Unify and manage your data

Match Token Generation

Learn about the generation of match tokens that identify potential profile matches within a tenant.

Much like you mapped a comparator class for each attribute being compared in a rule, you must also map a matchTokenClass to each attribute where tokens are generated. For a given profile, the matchTokenClass creates a set of tokens that individually and collectively represent the profile mathematically in various ways so that other profiles in your tenant with any of the same mathematical representations are easily identified as match candidates. You will see below that sometimes a token will be a simple string such as william taken directly from an attribute while other tokens will be encoded and look like YG-76 and represent a phonetic sound. It is the concatenation of tokens into token phrases that are actually used for identifying match candidates.

Now that we have a fundamental understanding of the comparison process that occurs, preceded in the lifecycle by the token generation process that occurs, let’s put it together and understand the full sequence of events when a profile is loaded or edited in a tenant.

Sequence of Events when a Profile is Loaded or Updated

When a profile is loaded (or updated) into the tenant, the full life cycle of cleanse/match/merge is invoked and occurs in real time. Cleanse/match/merge is a common phrase throughout this documentation and it's important to understand that the word cleanse in that common phrase refers to the general cleansing by the profile-level cleansers such as email, phone, and address cleansing. For more information, see topic Out-of-the-box Cleanse Functions. Additionally, the match framework offers its own localized cleanse (also known as transformation) capabilities. For more information, see topic Match cleansers.

When a profile is updated, it is immediately cleansed outside of the match framework by the profile-level cleansers. After that, the match engine is triggered for this profile. It will perform the following in sequence on this profile:

  1. Gather all the match token classes across the rules for this entity type.
  2. Generate tokens and token phrases and use them to identify match candidates, forming match pairs, and place the resulting match pairs into the match table.
  3. Execute the logic within the match rules, for the pairs in the match table.
  4. Evaluate the collective set of directives from the rules to determine a final action to perform (merge, queue for review, no action).
  5. Purge from the match table any pairs no longer relevant based on actions taken in the previous step.

Token Generation Examples

Token Generation is best explained by way of tangible examples provided below. Consider a tenant that has consumer style profiles in it, perhaps millions of them, with typical consumer attributes such as First Name, Last Name, Suffix, Address, City, State, Zip. The entity type might be Individual, and you have crafted a set of match groups for that entity type.

Example 1 - Basic synonym theory and match token generation

Assume you have created a rule where you have mapped one or more comparator classes and one or more match token classes, and you have decided to use the Reltio Name Dictionary for the First Name attribute. Assume the dictionary has a row which declares a canonical value of elizabeth with synonyms of liz, lizzie, and others.

Consider a profile where as a result of some merging, the First name attribute now contains several values including Ella, Beth, and Liz. Now suppose through survivorship rules, it turns out that the OV in the profile is Liz. Then if in your match rule you set useOvOnly to true, the Name Dictionary cleanser (which operates before the tokenizer) will only receive Liz as the value for First Name; the Name Dictionary will recognize Liz as a synonym of Elizabeth and will send the canonical value ‘‘elizabeth’’ to the match token class you have mapped to the First Name. Assume you mapped the ExactMatchToken class. It will generate and add the token <elizabeth> internally (it is not visible) to Liz’s record. (We’ll use <> to indicate a token).

Example 2 - Using token phrases to Identify match candidates

Suppose the record for Liz has Smith as her Last name and you also chose the ExactMatchToken generator for the Last name. It would generate the token <smith>.

You may have thought that the engine would attract all records that contain <elizabeth> as a token for the first name, and then additionally all records that have the token <smith> for the last name. It doesn’t do that because it would make no sense to attract profiles for Liz Brown, Lizzie Goldberg, which have nothing to do with Smith; nor Debbie Smith, which has no association to Liz. So instead the engine uses token phrases which are the combination of tokens, to constrain the candidates it attracts to just meaningful candidates.

The resulting token phrase, <elizabeth:smith> is generated and linked internally (in the match table) to Liz’s profile. Thus any other profile in the tenant that has <elizabeth:smith> as a token phrase will be matched to Liz’s profile as a match candidate. For example, the profile of Lizzie Smith will also contain the token phrase <elizabeth:smith>. The result is that the pairing of Liz Smith and Lizzie Smith will be placed into the pool (aka Match Table) of match candidates for this match rule to evaluate using the comparator class chosen.

Now, consider that Liz Smith is a pretty common name and in a tenant that contains millions of consumer profiles, there might be hundreds, perhaps thousands of profiles that are considered synonyms of Liz, even with the last name of Smith.

Example 3 - Constraining the Match Candidates

Recognizing that our token strategy thus far might be casting too wide a net thus identifying an abundance of match pairs that our chosen comparator strategy will reject. This is a poor use of resources and degrades the performance of the tenant. Let us add the attribute of Street Addr to the rule and map the AddressLineMatchToken to it. Suppose our record for Liz has an address of 123 Acorn St. If we run a rebuild Match Table job, then the old token phrase will be purged from the match table and the new one will be <elizabeth:smith:acorn>. Notice how this strategy purposefully and perhaps dramatically constrained the pool of candidates by adding the Street Name into the token phrase. Thus it won’t attract any Lizzie Smith records that have 11987 Oak Blvd for an address. But what if there is another record semantically the same for Liz but the first name is misspelled as Alisabeth?

Example 4 - Attracting variant match candidates, intelligently

Continuing with our example, let’s intentionally cast a slightly wider net, this time intelligently by replacing the ExactMatchToken generator with the SoundexTextExactMatchToken class for the first name. The new token phrases might look like <XH87:smith:acorn>. We get all of the benefits of the previous examples but now we will attract profiles that have variant spellings of elisabeth but which sound the same. So perhaps we will attract versions of elisabeth spelled like alisabeth, lisabeth, and so on.

Tokenization: Precision vs. Flexibility

As more attributes are added to a rule and included in tokenization, the sheer number of match candidates identified by the engine will naturally decrease (this is a good thing) as a result of having to meet ALL the criteria in the token phrase. However when the additional attributes have a fuzzy token class mapped to them, a counter dynamic can occur as we saw in Example 4 - Attracting variant match candidates, intelligently, where the number of records starts to intentionally increase.

Token Class Consistency for Rule Cohesion

It’s usually a good principle to minimize the range of token classes used for the same attribute across different rules because each will generate its own style of tokens, different from those of the other token classes. For example, if you want to use a phonetic strategy in multiple rules and you have chosen the DoubleMetaphoneMatchToken class for one rule, it's best to stay consistent by using DoubleMetaphoneMatchToken class in the other rules, if possible.

Efficient Tokenization Strategies

Be mindful that the maximum number of profiles that can be associated with a single token phrase is 300. So, if you choose a tokenization strategy that casts such a wide net, or is operating on an attribute with low cardinality, it will trigger the engine to find hundreds or even thousands of candidate matches. Furthermore, large numbers of match candidates can cause performance issues.

The table of comparator classes suggests a suitable token generator class for each comparator. It’s important you follow the guidance and choose comparators and token generators that are aligned. Example 5 - Misalignment of comparator class and matchToken class, below, illustrates this.

Example 5 - Misalignment of comparator class and matchToken class

The SoundexTextExactMatchToken token generator used on the Last Name attribute might quickly identify Smith from profile A as a candidate match for Smithe from profile B. And if this is acceptable to you based on your knowledge of the data set and the data profiling you have done, then great. But If you chose the BasicStringComparator as your comparator class for the attribute, it will return false since the spelling of the two values is not identical. Had you chosen the SoundexComparator class also based on a phonetic algorithm, then the comparator would likely return true for smith and smithe.

The example above reinforces the importance of choosing a match token class for an attribute that aligns to the behavior of the chosen comparator class. In this case the choice of a phonetic token generator coupled with an exact comparator is not prudent since it's very predictable that many of the candidate matches will fail the comparison test.

Optimize Rule Development: Mapping Comparator and Token Classes

A comparator class must ALWAYS be specified for each attribute referenced in your match rule definition. If you feel tokens are important for this attribute in the rule, then map a match token class. If not, then map the ingoreInToken element. Generally, the methodology for developing a rule and selecting comparator classes and match token classes is:

  1. Define the rule

    Fundamentally the rule must include one or more Comparison Operators. At your discretion, it may include cleansers and Helper Operators.

  2. Select and map comparator classes

    For each Comparison Operator in the rule, review the attributes being operated upon and select and map suitable comparator classes for those attributes that will perform the specific type of evaluation you desire.

  3. Select and map Token Generators

    For each attribute you have referenced in the rule, decide if it makes sense to map a match token class to the attribute. If it does, then choose a token class that aligns well with the comparator class you mapped for the attribute. You will see that the table of comparator classes provides recommendations for match token classes. That said, it is often advantageous to suppress token generation if the tokens will not provide meaningful benefit and in this case, use the ingoreInToken element to suppress generation of tokens for the attribute. For more information, see ignoreInToken.