Unify and manage your data

Comparator Classes

Learn about comparator classes and how to use them to support matching in the Reltio Data Cloud.

This topic provides an overview of the comparator classes that are available in the Reltio Data Cloud, how they're used to support matching, and how to perform a relevance calculation for each comparator class.

Purpose of comparator classes

Reltio Data Cloud uses comparator classes to define the behavior of comparison operators and to compare attribute values. Each comparator class:
  • Compares strings to determine whether strings can be considered (i.e., a comparison algorithm).
  • Helps to calculate similarity between the strings (i.e., a relevance calculation). The relevance value ranges from 0 to 1; when the relevance is closer to 1, the entities are more similar.

How comparator classes work

Match rules contain one or more high level comparison operators (e.g., Exact, ExactOrNull, ExactOrAllNull, notExactSame, and Fuzzy) that operate on the attributes you’ve selected for the rule.

Each comparison operator includes the word Exact or Fuzzy. This aspect’s behavior (i.e., exact or fuzzy) is governed by the comparator class, which must be mapped to the attribute. For example, you might choose to apply the ExactOrNull operator to the attribute of a person’s last name, or suffix.

Example { "and": { "exact": [ "configuration/enityTypes/Contact/attributes/FirstName", "configuration/entityTypes/Contact/attributes/LastName" ], "exactOrNull": [ "configuration/entityTypes/Contact/attributes/SuffixName" ] } } 

How to define a comparator class

You can define a comparator class for each attribute using exact, fuzzy, exactOrNull and exactOrAllNull match rules. Currently, Reltio platform does not support custom defining comparators.
Note: Comparators always check for exact matches only. If you want to apply the fuzzy match logic, the Fuzzy operator option must be selected manually.
The table below lists available comparator classes:
Table 1. Description and Recommended Use of Comparator Classes
Comparator class: com.reltio.match.comparator.* Description
BasicStringComparator This comparator treats the attribute values as strings and returns true if the strings are identical. All characters are supported. It is a good starting point for a basic exact match use case.
  • Behaves the same for the Fuzzy operator as it does for any of the Exact operators (Exact, ExactOrNull, ExactOrAllNull, and notExactSame). If chosen for any of the Exact operators, the comparator’s logic is used for the Exact part of them.
  • Recommended for cases where you wish to compare two strings in a basic manner with no additional processing to remove special characters or need for fuzzy variations of the strings.
  • Typical use cases are Exact matching on First Name, Last Name, Middle Name, Product SKU, and so on.
  • Supports non-Latin character sets.
  • Guidance regarding Match Token Class: ExactMatchToken class.
Note: If your rule does not define a comparator class, the match engine will use this comparator class and the ExactMatchToken class.
DamerauLevenshteinDistance Consider the values of two attributes as S1 and S2. This comparator counts n, the minimum number of single-character operations (insert, delete, replace) required to convert string S1 to S2 most efficiently. The comparator returns true if n is:
  • =0 (that is, the strings are already equal)
  • <=1 where the largest raw string length is <=4
  • <= 2 where the largest raw string length is > 6 and <=10

Example, to make John equal to jon, n = 1. The comparator returns true.

Example, to make John equal to jonathon, n = 6; the comparator will return false.

  • Behaves differently for the Fuzzy operator vs the Exact operators (Exact, ExactOrNull, ExactOrAllNull, and notExactSame). If chosen for Exact, ExactOrNull, ExactOrAllNull, and notExactSame, the comparator’s logic is used for the Exact part of these.
  • Can be used for the Fuzzy Comparison Operator.
  • Recommended for cases where you wish to compare two strings that might have spelling inconsistencies.
  • Typical use cases are matching words that are believed to have spelling errors.
  • Supports non-Latin character sets.
  • Guidance regarding Match Token Class: FuzzyTextAndNumberMatchToken class. If a match token class is not defined, the FuzzyTextAndNumberMatchToken class is used by default.
  • Supports threshold setting for Levenshtein comparators.
DynamicDamerauLevenshteinDistance Same as DamerauLevnshteinDistance comparator but this comparator supports a greater number of operations on longer strings.
  • <= 1 where the largest raw string length is <=6
  • <= 2 where the largest raw string length is > 6 and <=10
  • <= 3 where the largest raw string length is > 10 and <= 20
  • <= 4 where the largest raw string length is > 20 and <=30
  • <= 5 where the largest raw string length is > 30
  • Behaves the same for the Fuzzy operator as it does for any of the Exact operators (Exact, ExactOrNull, ExactOrAllNull, and notExactSame). If chosen for any of the Exact operators, the comparator’s logic is used for the Exact part of them.
  • Recommended for cases where you wish to compare two strings that might have spelling mistakes.
  • Typical use cases are matching words that are believed to have spelling errors.
  • Support non-Latin character sets.
  • Guidance regarding Match Token Class: FuzzyTextAndNumberMatchToken class. If a match token class is not defined, the FuzzyTextAndNumberMatchToken class is used by default.
  • Supports threshold setting for Levenshtein comparators.
MetaphoneComparator The comparator returns true if the two strings are phonetically equal based on the Metaphone algorithm. The Metaphone algorithm is thought to improve upon the Soundex algorithm because it takes into consideration various inconsistencies in the English spelling and pronunciation.
  • Behaves the same for the Fuzzy operator as it does for any of the Exact operators (Exact, ExactOrNull, ExactOrAllNull, and notExactSame). If chosen for any of the Exact operators, the comparator’s logic is used for the Exact part of them.
  • Recommended for cases where you wish to compare two strings that are likely to sound the same when spoken even if they are spelled somewhat differently.
  • Typical use cases are matching words that are believed to sound the same.
  • Does not support non-Latin character sets.
  • Guidance regarding Match Token Class: DictionaryStatsPhoneticFuzzyToken class. If a match token class is not defined, the DictionaryStatsPhoneticFuzzyToken class is used by default.
DoubleMetaphoneComparatorThe comparator returns true if the two strings are phonetically equal based on the Double Metaphone algorithm. The Double Metaphone algorithm is thought to be an improvement of the Metaphone algorithm.
  • Behaves the same for the Fuzzy operator as it does for any of the Exact operators (Exact, ExactOrNull, ExactOrAllNull, and notExactSame). If chosen for any of the Exact operators, the comparator’s logic is used for the Exact part of them.
  • Recommended for cases where you wish to compare two strings that are likely to sound the same when spoken even if they are spelled somewhat differently.
  • Typical use cases are matching words that are believed to sound the same.
  • Does not support non-Latin character sets.
  • Guidance regarding Match Token Class: DoubleMetaphoneMatchToken class. If a match token class is not defined, the DoubleMetaphoneMatchToken class is used by default.
SoundexComparator The comparator returns true if the two strings are phonetically equal based on the Soundex algorithm.
  • Behaves the same for the Fuzzy operator as it does for any of the Exact operators (Exact, ExactOrNull, ExactOrAllNull, and notExactSame). If chosen for any of the Exact operators, the comparator’s logic is used for the Exact part of them.
  • Recommended for cases where you wish to compare two strings that are likely to sound the same when spoken even if they are spelled somewhat differently.
  • Typical use cases are matching words that are believed to sound the same.
  • Does not support non-Latin character sets.
  • Guidance regarding Match Token Class: SoundexTextMatchToken class.
  • If a match token class is not defined, the SoundexTextMatchToken class is used by default.
  • Additional Guidance: See other phonetic comparator options such as the Metaphone and Double Metaphone comparators.
StringCharactersComparator This comparator strips the two strings of all non-alphabetic characters (/, @,#,$) and returns true if the two resulting strings are identical but both must not be empty and null. If both strings are empty or null, the comparator returns false.
  • Behaves the same for the Fuzzy operator as it does for any of the Exact operators (Exact, ExactOrNull, ExactOrAllNull, and notExactSame). If chosen for any of the Exact operators, the comparator’s logic is used for the Exact part of them.
  • Recommended for cases where you wish to compare two strings in a basic manner but do so with special characters removed automatically.
  • Typical use cases are Exact matching on First Name, Last Name, Middle Name, Product SKU, and so on.
  • Does not supports non-Latin character sets.
  • Guidance regarding Match Token Class: FuzzyTextMatchToken class. If a match token class is not defined, the FuzzyTextMatchToken class is used by default.
StringComparatorIgnoringNulls The comparator returns true if the strings are identical AND both strings are non-zero in length, and non-null, and do not equal to string ‘null’.
  • Behaves the same for the Fuzzy operator as it does for any of the Exact operators (Exact, ExactOrNull, ExactOrAllNull, and notExactSame). If chosen for any of the Exact operators, the comparator’s logic is used for the Exact part of them.
  • Recommended for cases where you wish to compare two strings in a basic manner but also test that both strings exist and have meaningful info.
  • Typical use cases are Exact matching on First Name, Last Name, Middle Name, Product SKU, and so on.
  • Supports non-Latin character sets.
  • Guidance regarding Match Token Class: ExactMatchToken class. If a match token class is not defined, the ExactTextMatchToken class is used by default.
  • Additional guidance: Do not use with ExactOrNull, and ExactOrAllNull comparison operators because both of those are designed to return true in cases where a value is null, whereas the StringComparatorIgnoringNulls comparator class will actually return false in these cases. Thus this comparator is incompatible with ExactOrNull and ExactOrAllNull.
PhoneNumberComparator The comparator strips the strings of all non-numeric characters. The comparator is specifically expecting a result of 10 digits for the purpose of comparison. So if either of the resulting strings is less than 10 characters, the comparator returns false. Whereas if the resulting strings are >= 10 chars AND the right-most 10 chars are identical, the comparator returns true.

Examples:

  • (818)777-09876 and 818-777-0987 will produce 81877709876 and 8187770987 and return false
  • +0177(818)777-0987 and 818-7770987 will produce 8187770987 and 8187770987 and return true
  • Behaves the same for the Fuzzy operator as it does for any of the Exact operators (Exact, ExactOrNull, ExactOrAllNull, and notExactSame). If chosen for any of the Exact operators, the comparator’s logic is used for the Exact part of them.
  • Recommended for cases where you wish to compare 10-digits of phone numbers in a basic manner.
  • Typical use cases are Exact matching on phone numbers.
  • Supports non-Latin character sets.
  • Guidance regarding Match Token Class: PhoneNumberMatchTokenclass. If a match token class is not defined, the PhoneNumberMatchToken class is used by default.
OrganizationNamesComparator For each attribute being compared, the comparator parses the attribute’s string (for example, IBM Services Corp) into a collection of words. It then compares one collection to the other and if at least 60% of the words in the collections are the same, the comparator returns true.
  • Behaves the same for the Fuzzy operator as it does for any of the Exact operators (Exact, ExactOrNull, ExactOrAllNull, and notExactSame). If chosen for any of the Exact operators, the comparator’s logic is used for the Exact part of these.
  • Can be used for the Fuzzy Comparison Operator.
  • Recommended for cases where you wish to compare company names and you believe there will be inconsistencies due to each having a slightly different set of words.
  • Typical use cases are for matching on organization name.
  • Does support non-Latin character sets.
    Note: For scripts in non-Latin languages, in which you cannot split a value into multiple words (eg. Chinese), the comparator may not be able to properly compare values.
  • Guidance regarding Match Token Class: OrganizationNameMatchToken class. If a match token class is not defined, the OrganizationNameMatchToken class is used by default.
  • Additional guidance: This comparator does not remove garbage words.
AddressLineComparator

There is an exact matching on address street.

Algorithm:

  1. The two values to compare are transformed to lowercase whereas letters and numbers are retained. If both transformed values start with po-box then their equality is checked. If they are the same the comparator results in a match.
  2. If the values do not represent a po-box the comparator normalizes in a different way: the values are transformed as lowercase. The letters are retained.
  3. The values are split into words.
  4. The comparator removes noise (garbage) words like st, ave, rd, etc.
    Note: If all the words are in the noise dictionary then all the initial words are retained, for example, 123 ave value contains digits and noise word 'ave' then the value to compare is just ave
    Tip: There are currently three out-of-the box dictionaries: addressLine, internationalOrganizationName and organizationName. To get the full list of noise words, call this API (example for addressLine):
    RequestGET https://{{env}}.reltio.com/reltio/api/{{tenantId}}/configuration/noiseDictionaries/addressLine
    Response (full list of noise words for the Address attribute that AddressLineComparator uses)[ "access", "accs", "alley", (...) ]
  5. The two sets of words obtained by splitting are compared to find how many words of the first set (having M1 words) exist in the second set (having M2 words). The number of same words N is used to answer the question if the initial values are the same: N/M1+M2-N >=0.6. If there is inequality then the comparator results in a match.
Note:
  • Displays same behavior for the Fuzzy operator as it does for any of the Exact operators (Exact, ExactOrNull, ExactOrAllNull, and notExactSame). If selected for any of the Exact operators, the comparator’s logic is used for the Exact part of them.
  • Does not support non-Latin character sets.
  • Regarding Match Token Class: For AddressLineMatchToken class, if a match token class is not defined, the AddressLineMatchToken class is used by default.
  • Click Address Line Garbage to see the list of address line garbage words.
RangeNumericComparatorReturns true if the difference in magnitude between two values falls within a specified range. The range can be expressed as an absolute value (for example, 5) or as a percentage (for example, 10%). For example, if the threshold is 5, and the two values are 12 and 16, then the comparator returns true. For example, if the threshold is 10%, and the two values are 12 and 16, then the comparator returns false. The percentage is applied to the smallest of the two values, thus in the case above, the threshold is calculated as 12*.1 = 1.2.
  • Behaves the same for the Fuzzy operator as it does for any of the Exact operators (Exact, ExactOrNull, ExactOrAllNull, and notExactSame). If chosen for any of the Exact operators, the comparator’s logic is used for the Exact part of them.
  • Recommended for cases where you wish to compare the difference between two numbers.
  • Typical use cases are for comparing product prices, age, height, and so on.
  • Supports non-Latin character sets.
  • Guidance regarding Match Token Class: RangeNumericMatchToken class. If a match token class is not defined, the RangeNumericMatchToken class is used by default.
  • Additional guidance: If the RangeNumericMatchToken class is chosen in association with this comparator class, then all applicable parameters defined for the comparator class will also be used for the RangeNumericMatchToken class.

See example below showing the proper structure for setting required parameters in this class.

{ "mapping": [ { "attribute": "configuration/entityTypes/Household/attributes/Address/attributes/Zip5", "parameters": [ { "parameter": "threshold", "value": "2" } ], "class": "com.reltio.match.comparator.RangeNumericComparator" } ] },
BasicTokenizedOrganizationNameComparatorCompares two organization names. Comparison is done in two steps:
  1. Comparator normalizes each value. Value is split into separate words. All garbage words (inc, corp, services, and so on) are removed from the words list.
  2. Comparator compares the words list. If there is at least 60% of the same words in the words list obtained in the first step, then the values are considered the same.
Note:
  • Behaves the same for the Fuzzy operator as it does for any of the Exact operators (Exact, ExactOrNull, ExactOrAllNull, and notExactSame). If chosen for any of the Exact operators, the comparator’s logic is used for the Exact part of them.
  • Recommended for cases where you wish to compare company names and you believe there will be inconsistencies due to each having different garbage words.
  • Does not support non-Latin character sets.
  • Guidance regarding Match Token Class: BasicTokenizedOrganizationNameMatchToken class. If a match token class is not defined, the BasicTokenizedOrganizationNameMatchToken class is used by default.
Click Organization Names Garbage to see the list of organization name garbage words.
CustomComparatorThis comparator is like a container for unlimited groups of other comparators. Each group is configured using comparator parameters that are also grouped. Each parameter group contains a list of parameters and these parameter groups must be configured as a list of group elements. Each group is a list of parameters. The following parameters are supported:
  • className - The name of the Comparator class name which is used to compare values. Default value - BasicStringComparator
  • classParams - Parameters (if any) for the comparator class.
  • pattern - Regular expression pattern which is used to match and extract group specific values from the original attribute values. If this parameter is not specified, the original values are used as group specific values.
  • splitByWordsBoundaries - Enables preprocessing of a connected non-Latin character sequence (word) such that it is split into parts as defined by the script rules. For example, the word 'これをテストする' is split into four parts - 0 = "これ", 1 = "を", 2 = "テスト", and 3 = "する".
  • acceptMissingBoth - If both compared values are empty or missing they are considered as equal and the comparator comparison proceeds with the other parameter groups.
  • acceptMissingOneOf - One of the values became empty after the preprocessing and the result is considered as a match.
    Note: If both values became empty, acceptMissingOneOf is set to true, and acceptMissingBoth is set to false the result of the group is not match.
  • transliterate - Enables the transliteration in the scope of the comparator only. The transliterated values are not available outside for other match rules. By default, this option is disabled.
  • transliterateCommand - Setting for transliterator that should be written according to rules in the com.ibm.icu.text library. If transliterate is enabled, transliterateCommand is "Any-Latin" by default.
  • TransliterateCleanser - its results are visible across the match rules (with some restrictions). The option is intended to allow for better control over the comparison process.
  • wordReplacements - Sets a limited number of patterns to replace words in the values. The max number is cannot exceed 10. If more replacements is needed, use RegexpReplaceCleanser. For example:"wordReplacements": [ { "pattern": "\\d+", "replacement": "number" } ]
  • noiseDictionary - Name of a predefined noise words dictionary or URL of a customer file with dictionary words. Noise dictionary contains words that are excluded from the attribute values before they are compared by the comparator. Possible values are:
    • addressLine
    • organizationName
    • internationalOrganizationName
    • eqfOrganizationName
    • foodOrganizationName
The scenarios are described for possible values while these are enabled:
  • useNoiseIfEmpty - If enabled and the value contains noise words only, then no noise words are removed from the value. This indicate tokens are generated for noise words. Default is true.
  • useStemmer - If enabled, the words are stemmed to their base form. Default is false.
  • useSoundex - If enabled, the words are replaced by their soundex codes. Default is false.
  • wordDelimiter - Delimiter which is used while concatenating the words into one value before passing the value to the provided comparator. Default is " " (white space).
  • sortWords - If enabled, words inside a group are sorted alphabetically before passing the value to the provided comparator. Otherwise, the original order of words is maintained.
The algorithm goes through configured groups, and for each group it does the following:
  1. Split the value into words according to the specified regexp pattern.
  2. Replaces words (if replacements are specified).
  3. Splits values by words boundaries (if enabled).
  4. Transliterates (if enabled and needed).
  5. Construct a list of lower-case words.
  6. Stemmize words (if enabled).
  7. Remove noise words (if noise dictionary specified).
  8. Replace words with soundex codes (if enabled).
  9. Sort words (if enabled) and join into one value using word delimiter.
  10. Words are joined using a specified word delimiter and passed to the provided comparator.

Total result is calculated as an expression: <part1_result> AND... AND <partN_result>.

DistinctWordsComparatorCompares values by distinct words based on the parameters specified. The following parameters are supported:
  • pattern - Regular expression pattern which is used to match and extract distinct words from the value. Default value is \\w+
  • threshold - The minimum number of words to be evaluated to consider the compared values as matches. The threshold value can be an absolute value or a percentage of words. Default value is 50%.
  • thresholdChars - The minimum number of characters to be evaluated to consider the compared values as matches. The thresholdChars value can be an absolute value or a percentage of characters.
  • splitByWordsBoundaries - Enables preprocessing of a connected non-Latin character sequence (word) such that it is split into parts as defined by the script rules. For example, the word 'これをテストする' is split into four parts - 0 = "これ", 1 = "を", 2 = "テスト", and 3 = "する".
  • wordReplacements - Sets a limited number of patterns to replace words in the values. The max number is cannot exceed 10. If more replacements is needed, use RegexpReplaceCleanser. For example:"wordReplacements": [ { "pattern": "\\d+", "replacement": "number" } ]
  • substringComparison -The algorithm to consider words as similar if one of them is a substring of another one. The substringComparison has these suboptions:
    • enabled=true/false - Enables the substringComparison option. By default the option is disabled.
    • weightStrategy=fixed/proportional - Defines how to weigh the similarity. fixed - if substring is detected then the fixed weight is assigned. propertional - if a substring is detected then the weight is the ratio of shortest word length to the longest word length. This is the default value.
    • lengthThreshold - If any of the compared words is shorter than the threshold then the pair is discarded. The default integer value is 3.
    • fixedWeight - The value for the fixed weight, if weightStrategy=fixed. The default value is 0.6.
  • wordDistance - The algorithm to consider words as similar if their edit distance (based on DynamicDamerauLevenshteinDistance) is small. The wordDistance has these suboptions:
    • enabled=true/false - Enables the wordDistance option. By default this option is disabled.
    • weightStrategy=fixed/relevance - Defines how to weigh wordDistance. fixed - if the words are similar by the regular DynamicDamerauLevenshteinDistance then the weight of the pair is the configured fixed value. relevance - the weight of the pair of words is calculated by the edit distance (1-distance^2/maxlength^2). It is selected by default.
    • wordLengthThreshold - If weightStrategy is relevance then the pair having a shorter word is discarded. If weightStrategy is fixed then the configured fixed weights are applied for short words and long words. By default its value is 3.
    • smallWordWeight - the weight applied if any of the words in the pair is shorter than the wordLengthThreshold. By default, its value is 0.6.
    • largeWordWeight - The weight applied if both of the words in the pair have at least wordLengthThreshold characters.
  • transliterate - Enables the transliteration in the scope of the comparator only. The transliterated values are not available outside for other match rules. By default, this option is disabled.
  • transliterateCommand - Setting for transliterator that should be written according to rules in the com.ibm.icu.text library. If transliterate is enabled, transliterateCommand is "Any-Latin" by default.
  • TransliterateCleanser - Its results are visible across the match rules (with some restrictions though). The option is intended to allow for better control over the comparison process.
  • noiseDictionary - The name of a predefined noise words dictionary or URL of a custom file with noice dictionary words. A noise dictionary contains words that are excluded from the attribute values before they are compared. Possible predefined values are:
    • addressLine
    • organizationName
    • internationalOrganizationName
    • eqfOrganizationName
    • foodOrganizationName
  • useNoiseIfEmpty - If enabled and the value contains noise words only, then no noise words are removed from the value. This indicates that tokens are generated for noise words as well. Default is true.
  • useStemmer - If enabled, the words are stemmed to their base form. Default isfalse.
  • useSoundex - If enabled, the words are replaced by their soundex codes. Default is false.
The algorithm is based on the following:
  1. Split the value into distinct words according to the specified regexp pattern.
  2. Replaces words (if replacements are specified).
  3. Splits values by words boundaries (if enabled).
  4. Transliterates (if enabled and needed).
  5. Construct a sorted set of lowercase words.
  6. Stemmize words (if enabled).
  7. Remove noise words if noise dictionary is specified.
  8. Replace words with soundex codes (if enabled).
  9. Compare sets of words.

If the intersection is greater than the value of the the threshold parameter, then the values are considered matches.

If comparison by words failed to match the values and the thresholdChars value is specified, then comparison is performed by character histograms. For both sets of words, the character histogram is calculated (characters are 0-9 and a-z). If the intersection is greater than the value of thresholdChars parameter, then the values are considered matches.

When the option thresholdChars is enabled, use ignoreInToken. If a match token class is not defined, DistinctWordsMatchToken is used automatically.

ExactMultiComparator The comparator accepts two lists of values from the two entities that are being compared. Each list consists of the attributes set in the multi operand. The result is a match if both lists have only one distinct value and is the same in two lists.

Example 1: Entity 1 has FirstName=John, AlternateName=John (list of values is [john, john]) and Entity 2 has FirstName=John, AlternateName=John (list of values is [john, john]). The set of values consists of John, and the result is a match.

Example 2: Entity 1 has FirstName=John, AlternateName=Unknown and Entity 2 has FirstName=John, AlternateName=<no value>. The unknown value in Entity 1 is marked as a null value in the match rule, and is filtered out. The set of values consists of John, and the result is a match.

Example 3: Entity 1 has FirstName=John, AlternateName=Jonny (list of values is [john, jonny]) and Entity 2 has FirstName=John, AlternateName=John (list of values is [john, john]). The set of values consists of John and Jonny, and the result is not a match.

CrossMultiComparator Behaves the same for the Fuzzy operator as it does for any of the Exact operators (Exact, ExactOrNull, ExactOrAllNull, and notExactSame). If chosen for any of the Exact operators, the comparator’s logic is used for the Exact part of these.
  • Can be used for the Fuzzy Comparison Operator.
  • Recommended for cases where you wish to compare two or more attributes which could be mixed while filling the values.
  • Guidance regarding Match Token Class: CrossMultiToken class. If a match token class is not defined, the CrossMultiToken class is used by default.
ProximateGeoComparator
  • Behaves the same for the Fuzzy operator as it does for any of the Exact operators (Exact, ExactOrNull, ExactOrAllNull, and notExactSame). If chosen for any of the Exact operators, the comparator’s logic is used for the Exact part of these.
  • Recommended for cases where you wish to define a maximum distance between two locations that is considered to be semantically the same as if the two locations shared the same location.
  • Typical use cases are for comparing the longitude and latitude of two objects where there might be some data quality problems regarding the longitude and latitude of the objects. Thus you might use this capability to declare that if the distance between the two objects is < 500 feet, the two locations are considered the same.
  • Guidance regarding Match Token Class: ProximateGeoToken class. If a match token class is not defined, the ProximateGeoToken class is used by default.
ExactOrNullCrossMultiComparator
  • Recommended for cases where you want to compare two or more attributes that could be mixed while filling the values and where one of the attributes can be null.
  • The match operand with the comparator does not produce the match tokens. As a result, the rule must have an additional operand that produces the match tokens and initiates match comparison.
JaroWinklerComparator

This comparator uses the Apache Commons Jaro Winkler algorithm in the matching rule configuration. The comparator also has an optional parameter, similarityThreshold, with the default value 0.9.

VolumeComparatorThe comparator extracts volumes from attribute values, converts and compares the volumes. For example, if the attribute name is testname 1L, the comparator extracts the 1L from the name, converts it to milliliters and compares the results. The parameters available are:
  • acceptMissingOneof — If set to true and one of the comparable string doesn't have a value before or after extraction, these strings will be marked as a match. By default, this is set to False.
  • acceptMissingBoth — If set to true and both comparable strings don't have a value before and after extraction, these strings will be marked as a match. By default, this is set to False.

  • toleranceLevel — Denotes the value that defines acceptable difference between comparable strings. By default, this value is 0, but you can enter any value greater than 0.

    Note: The following volume units are supported:
    • ml - milliliter
    • l - liter (1000 ml)
    • g - gallon (3785.41 ml)
    • kg - kilogram (1 kg = 1 l)

Comparator class examples

ProximateGeoComparator example

"comparatorClasses": { "mapping": [ { "attribute": "configuration/entityTypes/Location/attributes/AddressLine1", "class": "com.reltio.match.comparator.AddressLineComparator" }, { "attribute": "configuration/entityTypes/Location/attributes/LatLong", "parameters": [ { "parameter": "distance_miles", "value": "0.2" } ], "class": "com.reltio.match.comparator.ProximateGeoComparator" } ] }, "multi": [ { "uri": "configuration/entityTypes/Location/attributes/LatLong", "attributes": [ "configuration/entityTypes/Location/attributes/GeoLocation/attributes/Latitude", "configuration/entityTypes/Location/attributes/GeoLocation/attributes/Longitude" ] } ] ProfileA Latitude, Longitude ["59.939782", "30.314548"] ProfileB Latitude, Longitude ["59.938524", "30.315995"] Result -> True, distance between 2 geo points is ~0.1 mile ProfileA Latitude, Longitude ["59.939782", "30.314548"] ProfileC Latitude, Longitude ["59.936433", "30.317226"] Result -> False, distance between 2 geo points is ~0.25 mile

Example of CrossMultiComparator

"multi": [ { "uri" : "configuration/entityTypes/HCP/attributes/MultiGroup1", "attributes" : [ "configuration/entityTypes/HCP/attributes/FirstName", "configuration/entityTypes/HCP/attributes/LastName" ] } ], "comparatorClasses": { "mapping": [ { "attribute": "configuration/entityTypes/HCP/attributes/MultiGroup1", "class": "com.reltio.match.comparator.CrossMultiComparator" }, { "attribute": "configuration/entityTypes/HCP/attributes/FirstName", "class": "com.reltio.match.comparator.BasicStringComparator" }, { "attribute": "configuration/entityTypes/HCP/attributes/LastName", "class": "com.reltio.match.comparator.BasicStringComparator" } ] } ProfileA FirstName, LastName [“John”, “Doe”] ProfileB FirstName, LastName [“Doe”, “John”]

Example of ExactOrNullCrossMultiComparator

{ "rule": { "exact": [ "configuration/entityTypes/HCP/attributes/MiddleName" ], "multi": [ { "uri": "configuration/entityTypes/HCP/attributes/MultiGroup1", "attributes": [ "configuration/entityTypes/HCP/attributes/FirstName", "configuration/entityTypes/HCP/attributes/LastName" ] } ], "comparatorClasses": { "mapping": [ { "attribute": "configuration/entityTypes/HCP/attributes/MultiGroup1", "class": "com.reltio.match.comparator.ExactOrNullCrossMultiComparator" } ] } } } ProfileA FirstName, LastName, MiddleName["John", null, "Bob"] ProfileB FirstName, LastName, MiddleName["Doe", "John", "Bob"]

Example of CustomComparator

{ "attribute": "configuration/entityTypes/Location/attributes/AddressLine1", "parameters": [ { "parameter": "groups", "values": [ { "pattern": "[a-zA-Z]+" }, { "pattern": "[\\d]+" } ] } ], "class": "com.reltio.match.comparator.CustomComparator" } ProfileA AddressLine1 ["110, Street Red Linden"] ProfileB AddressLine1 ["Street Red Linden 110"] Result -> True, after applying the patterns both strings will be splitted to 2 groups: [“Street Red Linden”, “110”]. These values will be compared by pairs with BasicStringComparator. ProfileA AddressLine1 ["110, Street Red Linden"] ProfileC AddressLine1 ["Street Red Linden"] Result -> False, after applying the patterns profileA will be splitted to 2 groups: [“Street Red Linden”, “110”], profileB to [“Street Red Linden”, “”] { "attribute": "configuration/entityTypes/Location/attributes/AddressLine1", "parameters": [ { "parameter": "groups", "values": [ { "pattern": "[a-zA-Z]+", "noiseDictionary": "addressLine", "className": "com.reltio.match.comparator.SoundexComparator", "useNoiseIfEmpty": "true" }, { "pattern": "[\\d]+" } ] } ], "class": "com.reltio.match.comparator.CustomComparator" } ProfileA AddressLine1 ["24 Linden Drive"] ProfileB AddressLine1 ["24 Lynden Beach Dr"] Result -> True, after applying the patterns and removing noise words profileA will be splitted to 2 groups: [“Linden”, “24”], profileB to [“Lynden”, “24”]. “Lynden” and “Linden” have the same soundex code.

You must describe a comparison algorithm or the relevance score calculation algorithm.

Example of JaroWinklerComparator { "attribute": "configuration/entityTypes/Person/attributes/FirstName", "class": "com.reltio.match.comparator.JaroWinklerComparator", "parameters": [ { "parameter": "similarityThreshold", "value": 0.8 } ] }

How to perform a relevance score calculation

The following table shows how to perform a relevance calculation. S1 and S2 represent two different values of an attribute that are being compared.
Table 2. Relevance Calculation for Different Comparators
Comparator ClassDescription
AddressLineComparator

If addressLines S1 and S2 start with PO-box number:

  • If PO-box values are equal then the relevance is 1.

  • If PO-box values are different, then the relevance is 0.

If addressLines do not start with a PO-box value, the calculation of the relevance includes the next steps:

  • Split S1 and S2 by words.

  • Remove all S1 to S2 most efficiently. If the line consists of the noise words only, this step is skipped.

  • The relevance is a ratio of common words count to count of all unique words in compared lines.

Ncommon/S1WordsCount+S2WordsCount

AlwaysTrueComparator

Relevance is 1.

BasicStringComparator

If strings are equal then the relevance is 1. Otherwise the relevance is 0.

BasicTokenizedOrganizationNameComparator

The relevance calculation includes the next steps:

  1. Split the lines by words.
  2. Remove all noiseWords from noise dictionary organizationName). If the line consists of the noise words only, this step would be skipped.
  3. The relevance would be a ratio of common words count to count of all unique words in compared lines.

Ncommon/S1wordsCount+S2wordsCount - Ncommon

CrossMultiComparator

Relevance is calculated for each combination of attributes. The biggest value is chosen as a result.

For each combination of attributes the relevance is a sum of relevances per each attribute R attributedivided by count of attributes Nattributes

R attribute1+ R attribute2+.…... R attributeN/Nattributes

DamerauLevenshteinDistance

This comparator counts the minimum number of single-character operations (insert, delete, replace) required to convert string S1 to S2 most efficiently.

The relevance counts as 1 - distance /max (S1_length, S2_length).

The smaller the distance, the closer the relevance is to 1.

DistinctWordsComparator

The algorithm is based on the following:

  1. Split the value into distinct words according to the specified regexp pattern.
  2. Construct a sorted set of lowercase words.
  3. Apply stemmer to the words (if enabled).
  4. Remove noise words if noise dictionary is specified.
  5. Replace words with soundex codes (if enabled).
  6. Compare sets of words.

The relevance is a ratio of common words count N common to the bigger value of the words countN common/ / max (S1 _wordsCount , S2 wordsCount )

Additional Case: If the thresholdChars value is specified, then relevance calculation is performed by character histograms. For both sets of words, the character histogram is calculated (characters are 0-9 and a-z).

The relevance is a ratio of common chars count Ncommonto the bigger chars count S1charsCount, S2charsCount

Ncommon/ max(S1 charsCount, S2 charsCount)

In this case relevance will be the bigger one between relevance by words and relevance by chars.

DoubleMetaphoneComparator

The comparator returns 1 if the two strings are phonetically equal based on the Double Metaphone algorithm, otherwise the relevance is 0.

The Double Metaphone algorithm is thought to be an improvement of the Metaphone algorithm.

DynamicDamerauLevenshteinDistance

This comparator counts the minimum number of single-character operations (insert, delete, replace) required to convert string S1 to S2 most efficiently.

The relevance counts as 1 - sqr(distance) / sqr(max (S1_length, S2_length)).

The smaller the distance, the closer the relevance is to 1.

ExactMultiComparator

Returns 1 if all compared attributes are equals, otherwise returns 0

MetaphoneComparator

The comparator returns 1 if the two strings are phonetically equal based on the Metaphone algorithm, otherwise the relevance is 0.

OrganizationNamesComparator

The relevance calculation include the next steps:

  1. Split the lines by words.
  2. The relevance is a ratio of common words count to count of all unique words in compared lines

commonWordsCount / (S1_wordsCount + S2_wordsCount - commonWordsCount)

PhoneNumberComparator

The comparator strips the strings of all non-numeric characters. The comparator specifically expects a result of 10 digits for comparison purposes. Therefore, if either of the resulting strings is less than 10 characters, the comparator will return 0; else if the resulting strings are >= 10 chars AND the right-most 10 chars are identical, the comparator returns 1.

ProximateGeoComparator

The relevance is 1 - distance between 2 comparing points / distance_threshold.

The closer the comparing geo-points to each other, the closer the revance is to 1.

RangeNumericComparator

Consider two attribute values (numbers) as N1 and N2.

The relevance is 1 - diff between two numbers / max (N1, N2)

For example, the two values are 12 and 16, then the comparator returns 1 - 4/16 = 0.75.

SoundexComparator

The comparator returns 1 if the two strings are phonetically equal based on the Soundex algorithm, otherwise returns 0.

StringCharactersComparator

This comparator strips the two strings of all non-alphabetic characters (/, @,#,$) and returns 1 if the two resulting strings are identical. But, both the strings must not be empty and null, otherwise it returns 0.

StringComparatorIgnoringNulls

The comparator returns 1 if the strings are identical AND do not equal to string null, otherwise it returns 0.

Was this content helpful?