ignoreInToken
The ignoreInToken
element prevents the generation of tokens for
attributes that are specified within it.
The ignoreInToken
functionality is used to suppress generation of tokens
for certain attributes when you feel those tokens will not serve a meaningful benefit
toward the goal of finding match candidates and will reduce the performance of your
rules due to either the quantity of tokens generated or the quantity of match candidates
returned. Technically speaking, it is optional but it is used so often (and should be
used often) that it might as well be required.
If you fail to map a token class for an attribute in a match group, the match engine will
map one for you by default. It does this because without one, you will get no match
candidates based on that attribute. As a best practice, you should never allow the
system to use the default matchToken
class. Instead, if you want tokens
created for a specific attribute, you should ALWAYS explicitly map a token class of your
choice, OR if you do not want tokens created for the attribute, then use
ignoreInToken
to suppress token generation for that attribute.
The usage of ignoreInToken
is strongly recommended in various important
cases described below.
When using the notEquals operator
In short if you only want to compare records that do not have a specific value, then you certainly don’t want to generate tokens whose objective is to find profiles that have that value.
When the cardinality resulting from a token class is too high
Let’s use a very simplified example to make the point. Consider a tenant with 10M profiles of consumers, and the attributes include Full Name, Phone, Addr, and SSN; And those 10M profiles are an aggregate coming from 6 sources. There might be 10,000 John Smiths in that size population. If your comparison strategy requires an Exact SSN or Exact Phone which are very unique across any population, then its far more prudent to tokenize the SSN attribute that might efficiently find six John Smith profiles to compare, versus tokenizing the Full Name attribute which might find 10,000 profiles to compare, only 6 of which will successfully pass your comparison strategy!
This illustration highlights the fact that while the name is an important attribute for comparison purposes, it is a poor choice for efficiently and conservatively finding match candidates because it is not terribly unique in a population of that size. NOTE: It’s also important to remember that the tokenization engine will only allow a max of 300 profiles for a given token phrase (in this case the token phrase being <john smith> depending on the token class chosen). So high-cardinality results of a token scheme may likely produce a set where some or many of the results don’t even get processed by the comparators.
In this case the ignoreInToken
section of the match rule could look like
this:
"rule": {
"ignoreInToken": [
"configuration/entityTypes/HCP/attributes/FullName",
],
When using the DistinctWordsComparator
Normally, use of the DistinctWordsComparator
would imply also using the
DistinctWordsMatchToken
class. But as a general rule, use of that
token class is not advised. The reason is that for comparison purposes the
DistinctWords
concept has merit and benefits, but not for
tokenization as it tends to quickly create many more tokens that clutter up the system
and degrade performance. So, best practice is to use ignoreInToken
when
using the DistinctWordsComparator
comparator.
Configuration details for ignoreInToken
- Example JSON that includes
ignoreInToken
is as follows:
{
"exact": [
"configuration/entityTypes/Contact/attributes/LastName"
],
"comparatorClasses": {
"mapping": [
{
"attribute": "configuration/entityTypes/Contact/attributes/LastName",
"class": "com.reltio.match.comparator.BasicStringComparator"
}
]
},
"matchTokenClasses": {
"mapping": [
{
"attribute": "configuration/entityTypes/Contact/attributes/LastName",
"class": "com.reltio.match.token.ExactMatchToken"
}
]
},
"ignoreInToken": [
"configuration/entityTypes/Contact/attributes/LastName"
]
}