Comparator Classes
Learn about comparator classes and how to use them to support matching in the Reltio Data Cloud.
This topic provides an overview of the comparator classes that are available in the Reltio Data Cloud, how they're used to support matching, and how to perform a relevance calculation for each comparator class.
Purpose of comparator classes
- Compares strings to determine whether strings can be considered (i.e., a comparison algorithm).
- Helps to calculate similarity between the strings (i.e., a relevance calculation). The relevance value ranges from 0 to 1; when the relevance is closer to 1, the entities are more similar.
How comparator classes work
Match rules contain one or more high level comparison operators (e.g., Exact, ExactOrNull, ExactOrAllNull, notExactSame, and Fuzzy) that operate on the attributes you’ve selected for the rule.
Each comparison operator includes the word Exact or Fuzzy. This aspect’s behavior (i.e., exact or fuzzy) is governed by the comparator class, which must be mapped to the attribute. For example, you might choose to apply the ExactOrNull
operator to the attribute of a person’s last name, or suffix.
How to define a comparator class
Comparator class: com.reltio.match.comparator.* | Description |
---|---|
BasicStringComparator
| This comparator treats the attribute values as strings and returns true if the strings are identical. All characters are supported. It is a good starting point for a basic exact match use case.
Note: If your rule does not define a comparator class, the match engine will use this comparator class and the
ExactMatchToken class. |
DamerauLevenshteinDistance
| Consider the values of two attributes as S1 and S2. This comparator counts n , the minimum number of single-character operations (insert, delete, replace) required to convert string S1 to S2 most efficiently. The comparator returns true if n is:
Example, to make John equal to jon, n = 1. The comparator returns Example, to make John equal to jonathon, n = 6; the comparator will return
|
DynamicDamerauLevenshteinDistance
| Same as DamerauLevnshteinDistance comparator but this comparator supports a greater number of operations on longer strings.
|
MetaphoneComparator
| The comparator returns true if the two strings are phonetically equal based on the Metaphone algorithm. The Metaphone algorithm is thought to improve upon the Soundex algorithm because it takes into consideration various inconsistencies in the English spelling and pronunciation.
|
DoubleMetaphoneComparator | The comparator returns true if the two strings are phonetically equal based on the Double Metaphone algorithm. The Double Metaphone algorithm is thought to be an improvement of the Metaphone algorithm.
|
SoundexComparator
| The comparator returns true if the two strings are phonetically equal based on the Soundex algorithm.
|
StringCharactersComparator
| This comparator strips the two strings of all non-alphabetic characters (/, @,#,$) and returns true if the two resulting strings are identical but both must not be empty and null. If both strings are empty or null, the comparator returns false .
|
StringComparatorIgnoringNulls
| The comparator returns true if the strings are identical AND both strings are non-zero in length, and non-null, and do not equal to string ‘null’.
|
PhoneNumberComparator
| The comparator strips the strings of all non-numeric characters. The comparator is specifically expecting a result of 10 digits for the purpose of comparison. So if either of the resulting strings is less than 10 characters, the comparator returns false . Whereas if the resulting strings are >= 10 chars AND the right-most 10 chars are identical, the comparator returns true . Examples:
|
OrganizationNamesComparator
| For each attribute being compared, the comparator parses the attribute’s string (for example, IBM Services Corp ) into a collection of words. It then compares one collection to the other and if at least 60% of the words in the collections are the same, the comparator returns true .
|
AddressLineComparator
|
There is an exact matching on address street. Algorithm:
Note:
|
RangeNumericComparator | Returns true if the difference in magnitude between two values falls within a specified range. The range can be expressed as an absolute value (for example, 5) or as a percentage (for example, 10%). For example, if the threshold is 5, and the two values are 12 and 16, then the comparator returns true . For example, if the threshold is 10%, and the two values are 12 and 16, then the comparator returns false . The percentage is applied to the smallest of the two values, thus in the case above, the threshold is calculated as 12*.1 = 1.2.
See example below showing the proper structure for setting required parameters in this class. |
BasicTokenizedOrganizationNameComparator | Compares two organization names. Comparison is done in two steps:
Note:
Click Organization Names Garbage to see the list of organization name garbage words.
|
CustomComparator | This comparator is like a container for unlimited groups of other comparators. Each group is configured using comparator parameters that are also grouped. Each parameter group contains a list of parameters and these parameter groups must be configured as a list of group elements. Each group is a list of parameters. The following parameters are supported:
The scenarios are described for possible values while these are enabled: The algorithm goes through configured groups, and for each group it does the following:
Total result is calculated as an expression: |
DistinctWordsComparator | Compares values by distinct words based on the parameters specified. The following parameters are supported:
If the intersection is greater than the value of the the If comparison by words failed to match the values and the When the option |
ExactMultiComparator | The comparator accepts two lists of values from the two entities that are being compared. Each list consists of the attributes set in the multi operand. The result is a match if both lists have only one distinct value and is the same in two lists. Example 1: Entity 1 has FirstName=John, AlternateName=John (list of values is [john, john]) and Entity 2 has FirstName=John, AlternateName=John (list of values is [john, john]). The set of values consists of John, and the result is a match. Example 2: Entity 1 has FirstName=John, AlternateName=Unknown and Entity 2 has FirstName=John, AlternateName=<no value>. The unknown value in Entity 1 is marked as a null value in the match rule, and is filtered out. The set of values consists of John, and the result is a match. Example 3: Entity 1 has FirstName=John, AlternateName=Jonny (list of values is [john, jonny]) and Entity 2 has FirstName=John, AlternateName=John (list of values is [john, john]). The set of values consists of John and Jonny, and the result is not a match. |
CrossMultiComparator | Behaves the same for the Fuzzy operator as it does for any of the Exact operators (Exact , ExactOrNull , ExactOrAllNull , and notExactSame ). If chosen for any of the Exact operators, the comparator’s logic is used for the Exact part of these.
|
ProximateGeoComparator |
|
ExactOrNullCrossMultiComparator |
|
JaroWinklerComparator | This comparator uses the Apache Commons Jaro Winkler algorithm in the matching rule configuration. The comparator also has an optional parameter, |
VolumeComparator | The comparator extracts volumes from attribute values, converts and compares the volumes. For example, if the attribute name is testname 1L , the comparator extracts the 1L from the name, converts it to milliliters and compares the results. The parameters available are:
|
Comparator class examples
ProximateGeoComparator
example
Example of CrossMultiComparator
Example of ExactOrNullCrossMultiComparator
Example of CustomComparator
You must describe a comparison algorithm or the relevance score calculation algorithm.
How to perform a relevance score calculation
Comparator Class | Description |
---|---|
AddressLineComparator
|
If addressLines S1 and S2 start with PO-box number:
If addressLines do not start with a PO-box value, the calculation of the relevance includes the next steps:
Ncommon/S1WordsCount+S2WordsCount |
AlwaysTrueComparator
|
Relevance is 1. |
BasicStringComparator
|
If strings are equal then the relevance is 1. Otherwise the relevance is 0. |
BasicTokenizedOrganizationNameComparator
|
The relevance calculation includes the next steps:
Ncommon/S1wordsCount+S2wordsCount - Ncommon |
CrossMultiComparator |
Relevance is calculated for each combination of attributes. The biggest value is chosen as a result. For each combination of attributes the relevance is a sum of relevances per each attribute R attributedivided by count of attributes Nattributes R attribute1+ R attribute2+.…... R attributeN/Nattributes |
DamerauLevenshteinDistance |
This comparator counts the minimum number of single-character operations (insert, delete, replace) required to convert string S1 to S2 most efficiently. The relevance counts as 1 - distance /max (S1_length, S2_length). The smaller the distance, the closer the relevance is to 1. |
DistinctWordsComparator |
The algorithm is based on the following:
The relevance is a ratio of common words count N common to the bigger value of the words countN common/ / max (S1 _wordsCount , S2 wordsCount ) Additional Case: If the The relevance is a ratio of common chars count Ncommonto the bigger chars count S1charsCount, S2charsCount Ncommon/ max(S1 charsCount, S2 charsCount) In this case relevance will be the bigger one between relevance by words and relevance by chars. |
DoubleMetaphoneComparator |
The comparator returns 1 if the two strings are phonetically equal based on the Double Metaphone algorithm, otherwise the relevance is 0. The Double Metaphone algorithm is thought to be an improvement of the Metaphone algorithm. |
DynamicDamerauLevenshteinDistance |
This comparator counts the minimum number of single-character operations (insert, delete, replace) required to convert string S1 to S2 most efficiently. The relevance counts as 1 - sqr(distance) / sqr(max (S1_length, S2_length)). The smaller the distance, the closer the relevance is to 1. |
ExactMultiComparator |
Returns 1 if all compared attributes are equals, otherwise returns 0 |
MetaphoneComparator |
The comparator returns 1 if the two strings are phonetically equal based on the Metaphone algorithm, otherwise the relevance is 0. |
OrganizationNamesComparator |
The relevance calculation include the next steps:
commonWordsCount / (S1_wordsCount + S2_wordsCount - commonWordsCount) |
PhoneNumberComparator |
The comparator strips the strings of all non-numeric characters. The comparator specifically expects a result of 10 digits for comparison purposes. Therefore, if either of the resulting strings is less than 10 characters, the comparator will return 0; else if the resulting strings are >= 10 chars AND the right-most 10 chars are identical, the comparator returns 1. |
ProximateGeoComparator |
The relevance is 1 - distance between 2 comparing points / distance_threshold. The closer the comparing geo-points to each other, the closer the revance is to 1. |
RangeNumericComparator |
Consider two attribute values (numbers) as N1 and N2. The relevance is 1 - diff between two numbers / max (N1, N2) For example, the two values are 12 and 16, then the comparator returns 1 - 4/16 = 0.75. |
SoundexComparator |
The comparator returns 1 if the two strings are phonetically equal based on the Soundex algorithm, otherwise returns 0. |
StringCharactersComparator |
This comparator strips the two strings of all non-alphabetic characters (/, @,#,$) and returns 1 if the two resulting strings are identical. But, both the strings must not be empty and null, otherwise it returns 0. |
StringComparatorIgnoringNulls |
The comparator returns 1 if the strings are identical AND do not equal to string null, otherwise it returns 0. |