Design your Comparison Formulas
This first phase is exclusively about designing a set of match rules, strictly from the standpoint of the formulae they will use for comparison of match candidates that are presented to them.
Each rule has one comparison formula and we need to design a set of rules that work together in an efficient and non-overlapping manner. Every unnecessary rule or redundant tactic degrades performance of the match engine and your tenant experience. Rule design begins with profiling your data.
Profile your Data
Profiling your data is the first critical step in the configuration process. Unless you intimately understand the complexion of your data set, you will not be able to formulate a strategy for remediating the duplication in your tenant nor will you be able to envision a collection of match rules, cleansers, comparators, token generators, and so on to execute that strategy. And the best match strategy cannot make up for poor data quality in your source records. You should do a thorough job of data profiling your source data. Understanding the data is a multi-faceted exercise that involves understanding completeness, accuracy, timeliness, relevance, value frequency, and distribution, and so on). Some traditional things to look for are discussed below:
Identify attributes that can be used in meaningful way for matching, discard the others
An attribute that is only populated in 30% of the profiles will not be useful for finding match candidates effectively. Or a phone attribute that largely contains 999-999-9999 in a significant number of records is not helpful for comparison or tokenization purposes since it does not help in narrowing the scope of records that should be presented to the comparators for evaluation. So, use this stage of the process to clearly understand and document the profile of each attribute you believe will be used in your match rules, and do so across all your sources since the profile will likely be different from each source.
Characterize the attributes in terms of usefulness for tokenization vs comparison
Remember that some attributes work well for the purpose of comparing records, but not so good for the purpose of identifying match candidates through tokenization. For example, State might be a decent element in a match rule for your comparison tactic, but a poor attribute for tokenization because it only has 26 unique values, which in a data set that has a million records, produces a lot of records per state.
Look for Noise words
Noise words like Mr., Mrs., Corp. Corporation, Inc. LLC, Street, St., Avenue, and so on are very common to many entities and thus do not help differentiate one profile from another. They are referred to as noise or garbage words. Certain out-of-the-box comparator classes (see list of comparator classes) provide the ability to remove these during the match process, or you can create a custom comparator to do this as well.
Look for non-standard abbreviations
For example, MEDCL might be found in some profiles whereas MEDICAL in others. Matching would be greatly enhanced if the terms were standardized to a single canonical term. You might standardize certain terms during the data load process and Reltio’s RDM (Reference Data Management) product can help do this, or you might use a common ETL tool during your data load. Alternatively, the Reltio Match engine provides the String Replacement Cleanser that can perform this standardization as well.
Look for Non-harmonized reference data
Imagine three different sources of data coming into your tenant using the values 01, F and F_CD to represent the gender of Female. If left as such, it would be difficult to use gender as an effective criteria for a match rule. It will be important to standardize on a canonical value, say “Female” in each of the records that receive some variant of the canonical value.
Incorrectly mapped attributes
Address Line 1 and Address Line 2 are swapped, or (First and Last Name swapped). The cross-attribute matching feature may help resolve this within your data.
Create a Grid of Sources
Create a grid of sources (rows) and attributes (columns) so you can easily see the commonality of attributes offered across the sources. This will inform the design of your match rules. A sample grid is shown below.
Develop a Matching Scheme
After studying the grid above, design a matching strategy that leverages common
attributes to bridge across sources and create rich merged records. For example, NPI
and Last Name can be used to bring Sources 4 and 5 together. The combined record can
match to Source3
using Last Name, St Number, and NPI.
Develop your Match Rules
Begin framing a set of match rules on paper or in a document that supports the strategy you have in mind. The rules will contain attributes that represent the criteria needed for the profile comparisons. Design the comparison formula within each rule using Comparison Operators, and Helper Operators. We recommend you use a lightweight, descriptive approach to help you frame your thoughts. Example notation:
Exact(First, Last, Address, Zip) + ExactOrNull(SSN)
Exact(Last, NPI) + Equals(Country=USA)
Important: Start with an assumption of error-free data. This helps you form
the initial draft of your rules and their comparison formulas without the
distraction of how to take care of data quality issues. As such, make strong use of
the Exact Comparison Operators (Exact
,
ExactOrNull
, ExactOrAllNull
, and
notExactSame
).
Additional guidance for developing match rules
Typically the total number of rules should be less than 10. Five to seven is a good target. If you have more than 10 it's likely some of your rules are redundant in purpose, or produce the same outcome. The analyzer tool which you can use later can help to detect these scenarios.
Remember: So far the guidance has been to assume perfect data and therefore restrict yourself to using the exact comparison operators. And if nothing changed throughout the remainder of the steps, all of your rules would comfortably be set as ‘automatic’ meaning their outcomes would automatically merge the records that evaluate to true per your rules. But later when we consider realistic data quality issues, you will likely modify your rules and/or add additional rules that will contain fuzzy operators and in those cases it is likely you will change those rules to be of type ‘suspect’ meaning the pairs that evaluate to true, will be linked as potential matches to be reviewed by data stewards.
Compensate for Data Quality Problems
Now that we have a set of rules (still in descriptive form) that support the comparison goals identified so far, it is time to back off from the assumption of perfect data and instead start to consider the data quality issues surfaced by your data profiling efforts. So the next step is to make decisions about how you will remediate each data quality problem. You get to decide how to treat each type of issue you identified. There are various options:
- Exclude the attribute from your strategy entirely because it won’t be useful for your comparison goals. (For example, attributes that just aren’t populated well).
- Remediate the data quality issue using one or more tactics (discussed below).
- Convert to the fuzzy comparison operator to widen the band of tolerance for variations of values.
The following section describes options for fixing data quality problems. As you make decisions on these options you should make notes next to each of your rules that explain what data quality remediation strategy you will employ, that will make the rule remain effective.
Options for Fixing Data Quality Issues
The following eight options are the most easily applied and best practices for remediating data quality issues toward steering poor data back toward perfect data.
Fix the data within the source itself
For some types of problems, it makes sense from a business perspective to just fix the data within the source before it is extracted and loaded into Hub. (This is the best option usually).
Leverage an ETL tool
For some types of problems, it makes sense (from a business perspective as well) to just fix the data during transfer into the Hub, typically using an ETL Tool. (This is probably the second best option).
Leverage Reltio’s entity-level cleansers
When an entity is initially POSTED into the hub or updated, the data in it can be cleansed in a variety of ways with Reltio’s entity-level cleansers. The classic cleansers available out-of-the-box are email, address, and phone cleansing. It is quite common for all three of these to be enabled and used for all your customer data within a tenant. And this should help quite a bit since email, address, and phone attributes are widely used in matching. (Note: Typically the cleansers add a cleansed version of the data to the entity, and you leverage the survivorship rules to enable the cleansed version to be the OV for the attribute, which is then consumed by the match engine.)
MEDCL SYS
and MEDICAL SYSTEMS INC
semantically the same during the match process, but an untrained data steward
looking at the two records (which continue to contain the values you see here) might
wonder why they matched. Just something to consider.Use Reltio RDM (Reference Data Management) to harmonize reference data
With Relto RDM, targeted source values will be transcoded to a canonical value on-the-fly for all applications and systems that require it. The canonical value then is used by the match engine.
Use the match-engine’s string replacement cleanser
If it doesn’t make business sense to modify the actual data in the record, then the
match engine allows you to replace any whole attribute value to another whole
attribute value (For example, replace MEDCL
with
MEDICAL
). For more information, see String Replacement Cleanser.
Use the match-engine’s “noise words removal” capability
If you are using Organization names or Addresses in your match rules and the data contains very common words (noise words) for example, Corp, Inc., LLC, Street, Avenue) two of the out-of-the-box comparator classes and match token classes will strip the data of these words during the match process. If needed, you can develop your own list of noise words and use it within a custom comparator and custom token class.
Use the match-engine’s in-built Name Dictionary
The Reltio Name Dictionary contains a list of North American first names that are considered synonyms of each other, and treats them as if they were the same during tokenization and comparison. For example, Robert, Bob, Bobby, and so on. For more information, see Name Dictionary Cleanser.
Use the match-engine’s custom Dictionary capability
Create a custom dictionary to hold word collections whose members are semantically the same. Essentially the same capability as the in-built Name Dictionary but provides the flexibility to develop and leverage your own list of synonym names or even apply this capability to attributes other than First Name. (for example, US, USA, United States, Estados Unidos, Etats Unis, アメリカ, ).
Use the match-engine’s support for a custom comparator
In addition to the out-of-the-box comparator classes offered, you can design a custom comparator class that leverages regex instructions to perform string cleanup and manipulations. For more information, see Comparator Classes. The custom comparator class uses an out-of-the-box class of your choosing, but allows you to override its setting which includes providing a regex expression as a parameter.
Use the nullValues Helper Operator
The nullValues
helper operator can be leveraged in your match rule
to replace values like 999-999-9999
with
<null>
, which then facilitates the use of the
ExactOrNull
Comparison Operator.
MEDCL
SYS
match to MEDICAL SYSTEMS INC
?)Convert to Fuzzy Versions to Compensate for Lingering Data Quality Issues
It is likely that you can’t fix all issues even after applying the techniques listed above. That means your data is not in the perfect state of quality we assumed when your rules were first developed. So now review the lingering data quality issues and begin to do two things:
- Review each rule to see if you need to change any of the Exact operators to Fuzzy so they will tolerate the lingering data quality issues yet still consider two profiles to be semantically the same.
- Decide if there are cases where you want to keep the Exact version of the rule and let it produce a directive of automatic merge, while creating an additional but fuzzier version of the rule that produces a directive of ‘queue for review’ (aka “potential match”).
As an example of #1, suppose the Last Name attribute contains a lot of misspellings
(which is typically not easy to remediate using data transformation techniques.) The
simple solution is to use the Fuzzy
comparison operator and assign
a fuzzy comparator class such as the
DynamicDamerauLevenshteinDistance
class, or a phonetic class.
As an example of #2, suppose you have a rule like: automatic : Exact(First,
Last, Suffix, Address, Zip)
It might make sense to develop a fuzzy version of it that allows data stewards to review close matches of these. For example,
Suspect : Fuzzy(First, Last, Zip), ExactOrNull(Suffix),
Fuzzy(Address)
Sort the Rules in Order of Exactness to Fuzziness
It is usually good housekeeping to order your rules so that the resulting matchGroup overall starts with rules based on exact matches then moving on to a blend of exact and fuzzy, then moving on to heavily fuzzy based rules. After you have finished testing and tuning your rules, the rules at the top of the list (the strict ones) might be good candidates to change to type ‘automatic’. Some of the next ones that involve a bit of fuzziness might also. Those at the bottom that only use fuzzy comparison operators will likely remain as ‘suspect’ types.
Consider Using Relevance-Based Matching
In short, Relevance Based Matching uses an arithmetic scoring approach to rule design (rather than a boolean evaluation) and can often reduce your number of rules. Now that you have a clear strategy for your rules, consider converting your rules to a relevance based scheme. For more information, see Relevance-Based Matching - Detailed Explanation.
Consider Using a Custom Rule Type
If you discover in your data profiling exercise that there are entities (for example, father/Son scenarios, spousal scenarios, or even product scenarios like chocolate bar 8-oz and chocolate bar 16-oz wherein the entities should be linked to each other with a relationship rather than merged as semantically identical entities, then a custom rule type can enable the creation of a relationship either automatically or via review by a data steward.
Choose Comparator Classes for your Rule Attributes
For each attribute referenced in rule you must map an actual comparator class that
executes the precise type of comparison you wish. Suppose you have chosen the
ExactOrNull
comparison operator for the phone field. The
Exact
part of that operator will be performed by the comparator
class you choose. Similarly, suppose you chose the Fuzzy
comparison
operator for the Last Name field. Again, the type of fuzzy comparison that will
occur depends on what comparator class you choose for Last Name in that rule.