Match Token Generation
Learn about the generation of match tokens that identify potential profile matches within a tenant.
Much like you mapped a comparator class for each attribute being compared in a rule, you
must also map a matchTokenClass
to each attribute where
tokens are generated. For a given profile, the matchTokenClass
creates a
set of tokens that individually and collectively represent the profile mathematically in
various ways so that other profiles in your tenant with any of the same mathematical
representations are easily identified as match candidates. You will see below that
sometimes a token will be a simple string such as william
taken
directly from an attribute while other tokens will be encoded and look like
YG-76
and represent a phonetic sound. It is the concatenation of
tokens into token phrases that are actually used for identifying match candidates.
Now that we have a fundamental understanding of the comparison process that occurs, preceded in the lifecycle by the token generation process that occurs, let’s put it together and understand the full sequence of events when a profile is loaded or edited in a tenant.
Sequence of Events when a Profile is Loaded or Updated
When a profile is loaded (or updated) into the tenant, the full life cycle of cleanse/match/merge is invoked and occurs in real time. Cleanse/match/merge is a common phrase throughout this documentation and it's important to understand that the word cleanse in that common phrase refers to the general cleansing by the profile-level cleansers such as email, phone, and address cleansing. For more information, see topic Out-of-the-box Cleanse Functions. Additionally, the match framework offers its own localized cleanse (also known as transformation) capabilities. For more information, see topic Match cleansers.
When a profile is updated, it is immediately cleansed outside of the match framework by the profile-level cleansers. After that, the match engine is triggered for this profile. It will perform the following in sequence on this profile:
- Gather all the match token classes across the rules for this entity type.
- Generate tokens and token phrases and use them to identify match candidates, forming match pairs, and place the resulting match pairs into the match table.
- Execute the logic within the match rules, for the pairs in the match table.
- Evaluate the collective set of directives from the rules to determine a final action to perform (merge, queue for review, no action).
- Purge from the match table any pairs no longer relevant based on actions taken in the previous step.
Token Generation Examples
Token Generation is best explained by way of tangible examples provided below.
Consider a tenant that has consumer style profiles in it, perhaps millions of them,
with typical consumer attributes such as First Name, Last Name, Suffix, Address,
City, State, Zip. The entity type might be Individual
, and you have
crafted a set of match groups for that entity type.
Example 1 - Basic synonym theory and match token generation
Assume you have created a rule where you have mapped one or more comparator classes and one or more match token classes, and you have decided to use the Reltio Name Dictionary for the First Name attribute. Assume the dictionary has a row which declares a canonical value of elizabeth with synonyms of liz, lizzie, and others.
Consider a profile where as a result of some merging, the First name attribute now
contains several values including Ella, Beth, and Liz. Now suppose through
survivorship rules, it turns out that the OV in the profile is Liz. Then if in your
match rule you set useOvOnly
to true, the Name Dictionary cleanser
(which operates before the tokenizer) will only receive Liz as the value for First
Name; the Name Dictionary will recognize Liz as a synonym of Elizabeth and will send
the canonical value ‘‘elizabeth’’ to the match token class you have mapped to the
First Name. Assume you mapped the ExactMatchToken
class. It will
generate and add the token <elizabeth>
internally (it is not
visible) to Liz’s record. (We’ll use <>
to indicate a token).
Example 2 - Using token phrases to Identify match candidates
Suppose the record for Liz has Smith as her Last name and you also chose the
ExactMatchToken
generator for the Last name. It would generate
the token <smith>
.
You may have thought that the engine would attract all records that contain
<elizabeth>
as a token for the first name, and then
additionally all records that have the token <smith>
for the
last name. It doesn’t do that because it would make no sense to attract profiles for
Liz Brown, Lizzie Goldberg, which have nothing to do with Smith; nor Debbie Smith,
which has no association to Liz. So instead the engine uses token phrases which are
the combination of tokens, to constrain the candidates it attracts to just
meaningful candidates.
The resulting token phrase, <elizabeth:smith>
is generated and
linked internally (in the match table) to Liz’s profile. Thus any other profile in
the tenant that has <elizabeth:smith>
as a token phrase will be
matched to Liz’s profile as a match candidate. For example, the profile of Lizzie
Smith will also contain the token phrase <elizabeth:smith>
. The
result is that the pairing of Liz Smith and Lizzie Smith will be placed into the
pool (aka Match Table) of match candidates for this match rule to evaluate using the
comparator class chosen.
Now, consider that Liz Smith is a pretty common name and in a tenant that contains millions of consumer profiles, there might be hundreds, perhaps thousands of profiles that are considered synonyms of Liz, even with the last name of Smith.
Example 3 - Constraining the Match Candidates
Recognizing that our token strategy thus far might be casting too wide a net thus
identifying an abundance of match pairs that our chosen comparator strategy will
reject. This is a poor use of resources and degrades the performance of the tenant.
Let us add the attribute of Street Addr to the rule and map the
AddressLineMatchToken
to it. Suppose our record for Liz has an
address of 123 Acorn St. If we run a rebuild Match Table job, then the old token
phrase will be purged from the match table and the new one will be
<elizabeth:smith:acorn>
. Notice how this strategy
purposefully and perhaps dramatically constrained the pool of candidates by adding
the Street Name into the token phrase. Thus it won’t attract any Lizzie Smith
records that have 11987 Oak Blvd for an address. But what if there is another record
semantically the same for Liz but the first name is misspelled as Alisabeth?
Example 4 - Attracting variant match candidates, intelligently
Continuing with our example, let’s intentionally cast a slightly wider net, this time
intelligently by replacing the ExactMatchToken
generator with the
SoundexTextExactMatchToken
class for the first name. The new
token phrases might look like <XH87:smith:acorn>
. We get all of
the benefits of the previous examples but now we will attract profiles that have
variant spellings of elisabeth but which sound the same. So perhaps we will attract
versions of elisabeth spelled like alisabeth, lisabeth, and so on.
Tokenization: Precision vs. Flexibility
As more attributes are added to a rule and included in tokenization, the sheer number of match candidates identified by the engine will naturally decrease (this is a good thing) as a result of having to meet ALL the criteria in the token phrase. However when the additional attributes have a fuzzy token class mapped to them, a counter dynamic can occur as we saw in Example 4 - Attracting variant match candidates, intelligently, where the number of records starts to intentionally increase.
Token Class Consistency for Rule Cohesion
It’s usually a good principle to minimize the range of token classes used for the
same attribute across different rules because each will generate its own style of
tokens, different from those of the other token classes. For example, if you want to
use a phonetic strategy in multiple rules and you have chosen the
DoubleMetaphoneMatchToken
class for one rule, it's best to stay
consistent by using DoubleMetaphoneMatchToken
class in the other
rules, if possible.
Efficient Tokenization Strategies
Be mindful that the maximum number of profiles that can be associated with a single token phrase is 300. So, if you choose a tokenization strategy that casts such a wide net, or is operating on an attribute with low cardinality, it will trigger the engine to find hundreds or even thousands of candidate matches. Furthermore, large numbers of match candidates can cause performance issues.
The table of comparator classes suggests a suitable token generator class for each comparator. It’s important you follow the guidance and choose comparators and token generators that are aligned. Example 5 - Misalignment of comparator class and matchToken class, below, illustrates this.
Example 5 - Misalignment of comparator class and matchToken class
The SoundexTextExactMatchToken
token generator used on the Last Name
attribute might quickly identify Smith from profile A as a candidate match for
Smithe from profile B. And if this is acceptable to you based on your knowledge of
the data set and the data profiling you have done, then great. But If you chose the
BasicStringComparator
as your comparator class for the
attribute, it will return false since the spelling of the two values is not
identical. Had you chosen the SoundexComparator
class also based on
a phonetic algorithm, then the comparator would likely return true for smith and
smithe.
The example above reinforces the importance of choosing a match token class for an attribute that aligns to the behavior of the chosen comparator class. In this case the choice of a phonetic token generator coupled with an exact comparator is not prudent since it's very predictable that many of the candidate matches will fail the comparison test.
Optimize Rule Development: Mapping Comparator and Token Classes
A comparator class must ALWAYS be specified for each attribute referenced in your
match rule definition. If you feel tokens are important for this attribute in the
rule, then map a match token class. If not, then map the
ingoreInToken
element. Generally, the methodology for
developing a rule and selecting comparator classes and match token classes is:
- Define the rule
Fundamentally the rule must include one or more Comparison Operators. At your discretion, it may include cleansers and Helper Operators.
- Select and map comparator classes
For each Comparison Operator in the rule, review the attributes being operated upon and select and map suitable comparator classes for those attributes that will perform the specific type of evaluation you desire.
- Select and map Token Generators
For each attribute you have referenced in the rule, decide if it makes sense to map a match token class to the attribute. If it does, then choose a token class that aligns well with the comparator class you mapped for the attribute. You will see that the table of comparator classes provides recommendations for match token classes. That said, it is often advantageous to suppress token generation if the tokens will not provide meaningful benefit and in this case, use the
ingoreInToken
element to suppress generation of tokens for the attribute. For more information, see ignoreInToken.