Match Token Classes
This article describes each of the available match token classes in the Reltio Data Cloud.
Reltio Data Cloud uses match token classes to identify correct pairs of records for quick and efficient match rule comparison and adjudication. Each match token class generates a match token that helps to identify candidates to be matched.
This table describes each match token class and the potential number of tokens that may be generated.
Match token class | Description | Potential number of tokens to generate |
---|---|---|
ExactMatchToken
| This class generates a single match token that uses the lowercase version of the attribute value as the token. However, only letters, digits, and spaces are used in the token. Other characters are removed. For example, if the attribute contains John then the token will be john. If the attribute contains John B, then the token will be John B.
| 1 |
ExactNumberMatchToken
| This class generates a match token that represents the numeric characters from the string. For example, if the attribute value is ACL89786291D , the token will be 89786291 .
| 1 |
SoundexTextMatchToken
| This class generates a single match token that represents the phonetic representation of the value based on the Soundex algorithm.
| 1 |
DoubleMetaphoneMatchToken | This class is based on the Double Metaphone algorithm that can generate two match tokens, a primary and a secondary code for a string. The second token supports ambiguous cases such as character ordering and double characters misspellings, such as John and Jhon.
| |
FuzzyTextMatchToken
| This class generates a token for each common spelling up to a maximum of six tokens. For example, Michael can be misspelled as Michale, Michel, Micheal, and so on. In this case, one token each is generated for Michale, Michel, and Micheal.
| 6 |
DictionaryStatsPhoneticFuzzyToken
| There is no need to mention about alphabetical order. The class generates a metaphone token (only if fuzzy operand) and a token based on characters frequency. It is able to catch misspellings and neighbouring duplicated characters.
| 2 |
ComplexPhoneticNameToken
| The ComplexPhoneticNameToken class works in the same way as the DictionaryStatsPhoneticFuzzyToken class but selects only letters (DictionaryStatsPhoneticFuzzyToken class selects letters and digits). The class is applicable to any values not only names. Examples Characters ordering: john, jhon Tokens: [JN, A+IAKK8-4fth-48], [JHN, A+IAKK8-4fth-48]. The two values have common statistical token. Double characters: john, johhn Tokens: [JN, A+IAKK8-4fth-48], [JN, A+IAKK8-4fth-48]. The two values have common statistical and metaphone tokens.
| 2 |
AddressLineMatchToken
| This class generates a token for the AddressLine1 attribute as follows:
| 1 |
OrganizationNameMatchToken
| Generates tokens by using the following steps:
Note: If the attribute is used by the Fuzzy comparator operator, then this token class will generate metaphone codes instead of word values. Example 1 - Anheuser-Busch InBev. No noise words: anheus-busch-inbev, inbev, anheus-busch, busch-inbev, anheus-inbev, anheus, busch. Example 2 - International Business Machines. All words are in noise dictionary: busi-intern-machin, busi-intern, busi-machin, intern-machin. Example 3 - Reltio Connected Customer 360. Connected and Customer are in noise dictionary (not explicitly but after stemmer). 360 - digits connect-custom-reltio, reltio. ForInternational Business Machines , tokens are as mentioned in example 2 above. Reltio recommends using this token class along with the
| N-number of words excluding noise |
RangeNumericMatchToken
| Generates appropriate tokens for a range numeric values. If the attribute value is identified for an Exact match, this class generates a token formatted to have four digits after the decimal. If the attribute value is marked for a fuzzy match, this class generates a token based on the value of the threshold parameter. The value of this parameter can be a number of type (Int, Integer, Number, Double) or a percentage value, for example, 0.25 or 10%. This value indicates the maximum acceptable difference for the comparison to evaluate to True. Example 1 - Two values are 12.55, 12.9. They should result in same tokens (at least one token should be common). If we set the threshold=5%, then the tokens are:
| 1 (in case of exact) or 2 (in case of fuzzy) |
BasicTokenizedOrganizationNameMatchToken | Same as for OrganizationNameMatchToken class except that the BasicTokenizedOrganizationNameMatchToken class does not use stemmer, has a small set of delimiters, and has a short noise dictionary.Reltio recommends using this token class along with the | |
CustomMatchToken
| This match token class is like a container for unlimited groups of other match token classes. Each group is configured using match token parameters that are also grouped. Each parameter group contains a list of parameters and these parameter groups must be configured as a list of group elements. The following parameters are supported:
Note: To update an already loaded dictionary, you must restart the API Server. Therefore, Reltio recommends using a different file name to ensure that the URL points to the updated file. Note that the Custom dictionary file size has a limitation of 10M characters. The file size correlates with the 10M chars limitation but is not strongly coupled. For example, spaces are trimmed and not counted in the limitation, the characters often take only one byte. The address-line-garbage.txt has 9937 chars, 1734 line breaks, and '11,671 bytes' on disk. If the size exceeds this limit, the remaining portion of the dictionary is skipped during load.
The algorithm parses through all configured groups, and for each group it performs the following:
The final token list is constructed by adding generated tokens from all the groups. In other words, the final token is a join of single tokens from each of the groups separated by the colon (':') character. | |
DistinctWordsMatchToken | This match token class generates tokens by distinct words based on the parameters specified. The following parameters are supported:
Note: To update an already loaded dictionary, you must restart the API Server. Therefore, Reltio recommends using a different file name to ensure that the URL points to the updated file. Note that the Custom dictionary file size is limited to 20MB. If the size exceeds this limit, the remaining portion of the dictionary is skipped during load. There is a limitation on the total amount of generated tokens per value of 1000. If the amount of tokens for a subset words is greater than the remaining space, the subset is ignored. For example, if we have the threshold=50% and there are 55 words in the value, then the tokens for subsets having 54 and 55 words are generated and subsets having 53 and less words are ignored.
The algorithm does the following:
Generate tokens for the subsets of words with the size equal to threshold. If the threshold value is a percent value, additional tokens for subsets with size greater than threshold are generated. | |
VolumeMatchToken |
Enables you to extract volumes from attribute values and generate tokens based on these extracted values. For example, if the attribute name is You can specify the following optional parameters:
| 1 |