Design your Match Tokenization Scheme

Match tokens identify candidates for matching by the specified match rules.

Overview

You now have a set of rules and the attributes within each rule are assigned to comparator classes of your choice. You also have developed a strategy for remediating data quality problems so that the comparison formulas can be as effective as possible at determining which records are deemed to be a match vs those that are not.

In order for your rules to do something, they have to have profiles to work with. That is what your tokenization scheme will provide. But we want it to work in an efficient and effective way.

The Primary Tokenization Objective

Map attributes only as necessary within the rule, to identify the correct number of match candidates, which will evaluate to true by the comparison formula.

While working to achieve the above, try and develop the greatest reusability as possible of token class mapping across your rules, thus producing the fewest variations of token phrase definitions. Let’s break this down to understand the reasoning for each piece of it:

Tokenize as few attributes as necessary within the rule: As we saw in the examples here when your rule’s tokenization scheme is applied to a specific profile, the result will be a set of token phrases for the profile equal to [all values for token-enabled rule attribute(1)] x [all values for token-enabled rule attribute(2)] x [all values for token-enabled rule attribute(n)] where token-enabled rule attribute is an attribute in the rule that has a token class mapped to it. (This includes those that receive a default class due to not having ignoreInToken applied).

Why is this important? In order to protect the performance of your tenant you want to generate as few token phrases as necessary for each profile. Here are some reminders to simplify your thinking process for this piece of the Primary Tokenization Objective.

An attribute that only presents a single value to the token generator from a profile doesn't increase the number of combinations since its multiplier is 1. So it’s generally ok to include such attributes without much concern. (Examples: SSN, Driver’s license, Suffix, Zip code, middle initial,..) Conversely, the cases you should be concerned about because they introduce a multiplicative factor are:

An attribute that has multiple OV values (example: the address attribute with survivorship set to aggregation).
An attribute that uses a fuzzy comparator class (and thus a fuzzy token class) that produces a collection of values for each single value the profile has. For example, consider the use of the FuzzyTextMatchToken generator which produces a collection of values that represent the common misspellings of names. Each of those values (example: michael, micheal, mikael) will get fed into the token generator and in this example, multiple the number of token phrases by a factor of 3.
The use of the Name Dictionary Cleanser which by design will produce multiple values for a given single name and each of those values will get fed into the token generator to produce tokens.

...To identify the correct number of match candidates

Identification of match candidates for the rule is the reason for tokenization. A perfect token scheme for the rule would find all of the match candidates that will pass the comparison formula perfectly, but no more than that. This lofty goal is unrealistic, so in order to not leave any records out of the match table for this rule (put another way, in order to minimize the number of semantically identical profiles not found), its best to try devise a scheme that finds the right records and then a small percentage more. Some people say “a scheme that casts a slightly wider net than the records you need”)

...Which will evaluate to true by the comparison formula

It makes no sense to present profile pairs to the comparison formula that obviously won’t evaluate to true because you’re just consuming processing power without any accompanying value. The interesting dynamic here is that increasing the number of attributes mapped to token classes increases the criteria represented in the token phrase and this reduces the number of profiles identified and placed into the match table for the rule. So it is a little bit of a balancing act. That is, adding attributes to your tokenization scheme, especially those that have a fuzzy match token class increases the number of token phrases, yet generally reduces the number of match candidate pairs placed into the match table. But most agree that it is more important from a performance perspective to reduce the number of token phrases, even at the expense of increasing the number of match candidate pairs returned. The only exception to this dynamic is that any additional attribute you tokenize that returns a single OV value and has an exact token class mapped to it will not increase the number of token phrases.

Don’t allow match token classes to be assigned by default

For any given attribute in a rule, If you don’t explicitly assign a token generator class, and you don’t explicitly use ignoreInToken for the attribute, the match engine will assign a default token class anyway. But you shouldn’t allow that to happen. That is, for each attribute in your rule, you should always specify a token class OR use ignoreInToken on it.

So, armed with the above knowledge, here is the general step-by-step methodology you should use when designing a token scheme for each rule.

Step 1: Order the Attributes in Terms of their Ability to Narrow Scope

Each attribute could be a contributor to the token phrase, but it doesn’t have to. Remember, the purpose that drives your choice of attributes for tokenization is different than the purpose that drove your attribute choices for the comparison formulas. For example, Middle Initial might be a good attribute to use for comparison purposes but it is generally poor for tokenization because it only has 26 unique values and won’t help much to narrow the scope of match candidates. So in this step, you should jot down on a piece of paper, the attributes in the rule, but in order from left to right with the best attribute for tokenization on the left, and the worst on the right. Use the following guidelines:

Leverage traditional Identifiers first

Generally, the attributes you should consider first for token generation are any traditional identifiers because they contain highly unique information that will return just the records you want. For example, SSN, License #, Professional IDs, and so on. Suppose you chose the SSN attribute in a rule for tokenization. It might be the only attribute in the rule that you need to map to a token class, and if so, then use ignoreInToken for all the other attributes in the rule to suppress token generation on them.

Leverage pseudo-identifiers next

While email and phone numbers are not traditional identifiers, they are the next best thing. In fact, for all intents and purposes an email address is a unique identifier for a person (but it isn't always populated very well within the data so often times that mitigates its usefulness.)

Leverage normal attributes

If the rule doesn't contain an attribute equivalent to an identifier or quasi-identifier, then look for the next most unique attribute, that is, the one that will produce the fewest relevant and suitable records for comparison.

By way of example, here is a list of attributes in a typical rule, ordered from best to worst in terms of ability to narrow scope: SSN, Email addr, St Address, Postal code, Last name, First Name, State, and Middle Initial.

Example thought process for choosing attributes for tokenization

Using the preceding example, SSN is clearly the best choice, the most unique and probably a reliable identifier in terms of being populated. Middle Initial is the poorest choice for tokenization since it will only carve a million records into 26 unique groups; (If the initials were equally distributed across the million records, the initial J would present 38,000 records to the rules for comparison). State would be the next poorest choice identifying only 50 unique groups, each with 20K records. Conversely there about 40,000 Postal codes in the US, so a token representing a postal code (again if equally distributed

Step 2: Include Any Attributes Mapped to Fuzzy Comparator Classes

If you chose a fuzzy comparator class, then you are saying you’re willing to tolerate a range of variation of the data for the attribute. In order to find the records that have those variations, you must map a corresponding match token class. This is how you 'cast a wider net' to find the profiles that have data quality issues and thus have values that differ somewhat from the perfect value.

Step 3: Finalize the Attributes Within Each Rule you Feel is Correct for Tokenization

Now look across all your rules to see how much reuse of the same attributes you see earmarked for tokenization. Try to maximize reuse because it will minimize the overall population of token phrases and thus profiles identified to be processed by ALL your rules.

Step 4: Choose a Suitable Match Token Class for Each Attribute

Examine the comparator class mapped to the attribute from your previous steps. It will have a recommended matchToken class in the documentation. Use the recommended token class. It produces a distribution of tokens that will attract profiles that align to the tolerance of the comparator. Add into the JSON any additional parameters.

Step 5: Create any Custom Comparators and Match Token Classes as Needed

If your design requires the ability to override any standard parameters, use a stemmer, perhaps leverage your noise-words dictionary, you should define one or more custom comparators and match token classes as needed and put them into your match rules.

Step 6: Set ignoreInToken for all Other Attributes

Any attributes that remain in your match rules for which you have elected not to include in the tokenization must be referenced using ignoreInToken. If you don’t do this, the match engine will pick a match token class by default and tokenize them anyway.

After you have performed the six steps, you have on paper:

A set of match rules
Each rule has a comparison formula
Each rule has a strategy for remediating data quality problems
Each rule has a tokenization strategy

Unify and manage your data