Relevance-Based Matching - Detailed Explanation
Learn about configuring relevance-based matching rules for optimal data matching and merging.
Relevance-based matching is designed primarily as a replacement of the strategy that uses automatic
and suspect
rule types. With relevance-based matching you use a scoring algorithm of your own design. The advantage is that in most cases, a strategy based on relevance-based matching can reduce the complexity and overall number of rules. The reason for this is that the two directives of merge and queue for review which normally require separate rules (automatic
and suspect
, respectively) can often be represented by a single relevance-based rule.
When you configure relevance-based match rules, two effective comparators are the Jaro and Levenshtein distances. The Jaro distance measures the similarity between two sequences, useful for short texts. The Levenshtein distance, or edit distance, quantifies how dissimilar two strings are by counting the minimum number of operations required to transform one string into the other. Both of these comparators return scores from 0 to 1 and are therefore useful for relevance matching. However, if you use a comparator that does not return a score from 0 to 1, relevance matching may not work as desired.
Declaring Action Thresholds
When declaring the ranges for queue_for_review
and auto_merge
, the combination of them should span the entire available range of 0.0 to 1.0 with no gap and no overlap except that the upper endpoint for queue_for_review
should equal the lower endpoint for auto_merge
thus have a common touchpoint between them. (for example, 0.0 to 0.6 for queue_for_review
, and 0.6 to 1.0 for auto_merge
.) If the actionThresholds
leave a gap, then any score falling within the gap will produce no action. Conversely, if the actionThresholds
overlap (for example, 0.4 to 0.6 for queue_for_review
, and 0.5 to 0.7 for auto_merge
) and a score lands within the intersection (0.55 in our example) or on the touchpoint, the directive of queue_for_review
takes precedence.
Primary Use of Relevance-Based Matching
Using Relevance-based matching you could create a match rule that contains a collection of attributes you wish to test as a group, but assign weights to those attributes to govern their relative importance in the rule. Start with all weights equal to 1.0 and perhaps start with actionThresholds
of 0.0-0.5 for queue_for_review
and 0.5-1.0 for auto_merge
. Do some trial runs and examine the results. If you find that too many obvious matches are being set to queue_for_review
, then you can adjust weights and even modify the actionThresholds
to perhaps 0.0-0.7, and 0.7-1.0. Iterate and experiment until you are able to get optimized results with your data set. Always remember that data quality is a key component of effective matching. So good use of cleansing before the data is posted and proper cleansing using the out-of-the-box match cleansers are important.
Support for customer-defined match actions in a relevance-based rule
In the image below, notice that on the left the admin has two actionThresholds
defined and they use the system values of auto_merge
and potential_match
. However, if the admin had defined matchActions
as you see on the right, then she can replace the system values with her matchActions
. This provides her the ability to create more advanced outcomes when a threshold is reached. For more details on matchActions
, see Understanding Custom Rules.
Calculation of Score
suspect
and automatic
match rules. However, the comparison of the two entities works differently. Every comparator class provides relevance value while comparing values. The relevance is in the range of 0 to 1. For example, BasicStringComparator
returns 0 if two values are different. It returns 1 if two values are the identical. Fractional values can be a result of DistinctWordsComparator
or other comparators. Every attribute has assigned weights according to the importance of the attribute. If the weight is not assigned explicitly then it is equal to:- 1 for the simple attributes
- Maximum of the weights of sub-nested attributes for nested or reference attributes
If an attribute has multiple values then the maximum value of relevance is selected.
RelevanceScoreAND
- the relevance score of AND operand, the relevance score of the match ruleNsimple
- number of simple attributes (For example,. FirstName, LastName) participating in the AND operator directlyweighti
- configured weight of i-th simple attributerelevancei
- calculated relevance of i-th simple attributeNnest
- number of nested and reference attributes (For example, Phone-no, Email-ID, Address) participating in the AND operator directlyweightj
- configured weight of j-th nested or reference attributerelevancej
- calculated relevance of j-th nested/reference attributeNlogical
- number of logical operands (For example, AND or OR) participating in the AND operator directlyrelevancek
- calculated relevance of k-th logical operand. Note that the weight of a logical operand is fixed to 1.
RelevanceScoreOR=max(relevance1 , ..., relevancei, ..., relevanceN)
relevance
i- relevance of simple attribute, nested attribute, logical operand participating in the OR operand directly.
RelevanceScoreNOT
=1-RelevanceScoreAND,OR,exact,...
The relevance score of the NOT operand is equal to 1 minus the relevance score of the operand having this negation.
Sample configuration using relevance-based matching for auto_merge
and queue_for_review
{
"uri": "configuration/entityTypes/HCP/matchGroups/ByNames",
"label": "By Names",
"type": "relevance_based",
"rule": {
"and": {
"exact": [
"configuration/entityTypes/HCP/attributes/FirstName",
"configuration/entityTypes/HCP/attributes/LastName"
],
"exactOrAllNull": [
"configuration/entityTypes/HCP/attributes/Suffix"
]
},
"weights": [
{
"attribute": "configuration/entityTypes/HCP/attributes/Suffix",
"weight": 0.2
}
],
"actionThresholds": [
{
"type": "auto_merge",
"threshold": "0.8-1.0"
},
{
"type": "potential_match",
"threshold": "0.4-0.8"
}
],
"matchTokenClass": "com.reltio.match.token.ExactMatchToken"
},
"matchServiceClass": "com.reltio.businesslogic.match.providers.internal.InternalMatchService",
},
}
- Three attributes are used -
FirstName
,LastName
, andSuffix
. - Weight of 0.2 is assigned to only the
Suffix
. So a default weight of 1.0 is assigned toFirstName
andLastName
as well. - The
actionThresholds
forauto-merge
andpotential_match
do not overlap but they do share a common score point of 0.8 in order to make their ranges contiguous with no gaps.
Example profile pair for the configuration above:
{
"type": "configuration/entityTypes/HCP",
"attributes": {
"FirstName": [
{
"value": "Robert"
}
],
"LastName": [
{
"value": "Downy"
}
],
"Suffix": [
{
"value": "Sr."
}
]
}
},
{
"type": "configuration/entityTypes/HCP",
"attributes": {
"FirstName": [
{
"value": "Robert"
}
],
"LastName": [
{
"value": "Downy"
}
],
"Suffix": [
{
"value": "Jr."
}
]
}
}
BasicStringComparator
provides the relevance values and the score is calculated as follows: true
for First Name;true
for LastNamefalse
for Suffix
In this case the logic tree uses AND as the root and so the formula for the AND operator is . The score is calculated as (1*1 + 1*1 + 0*1) / (1 + 1 + 1) = ⅔ = .66
With score of .66 the directive for this pair will be set to queue_for_review
.
Using the Verify Matches API with Relevance-Based Matching
verifyMatches
API when using Relevance-based matching. Noteworthy items are: - relevance values appear for every attribute comparison and relevance for the entire rule.
- Match action name is shown if the relevance is within the corresponding threshold range, and null if it is not within any
actionThreshold
range. - Matched field will be
true
if the relevance is within anyactionThreshold
range.
{
"versions": {
"first": {
"EntityVersion": 1000000000
},
"second": {
"EntityVersion": 1000000000
}
},
"rules": {
"configuration/entityTypes/HCP/matchGroups/ByNames": {
"label": "By Names",
"useOvOnly": false,
"matchTokens": {
"first": {
"foundInMatchTables": true,
"tokens": [
"robert:downy"
],
"tokensGenerated": 1
},
"second": {
"foundInMatchTables": true,
"tokens": [
"robert:downy"
],
"tokensGenerated": 1
},
"intersection": {
"tokens": [
"robert:downy"
]
}
},
"rule": {
"relevance": 0.9090909090909091,
"and": [
{
"relevance": 0.9090909090909091,
"and": [
{
"exact": {
"FirstName": {
"ignoreInToken": false,
"relevance": 1.0
},
"LastName": {
"ignoreInToken": false,
"relevance": 1.0
}
},
"exactOrAllNull": {
"Suffix": {
""ignoreInToken": true,
"relevance": 0.2
}
},
"relevance": 0.9090909090909091
}
]
}
]
},
"matched": true,
"matchAction": "AUTO_MERGE",
"relevance": 0.9090909090909091
}
}
}
Weights and Action Thresholds
In the match group configuration, you will define Weights
and actionThresholds
. The weight
property allows you to assign a relative weight (strength) for each attribute. For example, you may decide that Middle Name is less reliable and thus less important than First Name.
The actionThreshold
allows you to define a range of scores to drive a directive. For example, you might decide that the match group should merge the profile pair if the score is between 0.9 to 1.0, but should queue the pair for review if the score falls into a lower range of 0.6 to 0.9.
Assigning Weights
Weights can be set from 0.0 to 1.0. If you do not explicitly set a weight for an attribute, it will receive a default weight
of 1.0 during execution of the rule.
Multiple Actions of the Same Type
You can configure a relevance-based match rule with multiple action thresholds having the same action type but with a different relevance score range.
Example Match Group Configuration
{
"uri": "configuration/entityTypes/HCP/matchGroups/PotentialMatchByEmail",
"label": "By Email",
"type": "relevance_based",
"rule": {
"and": {
"fuzzy": [
"configuration/entityTypes/HCP/attributes/Email/attributes/Email"
]
},
"actionThresholds": [
{
"type": "potential_match",
"threshold": "0.99-1",
"label": "candidate for merge"
},
{
"type": "potential_match",
"threshold": "0.9-0.99",
"label": "suspect match for review"
}
],
}
}
In the above example, the type is potential_match
for two different action thresholds. You can differentiate such thresholds by assigning appropriate labels. You can generate potential matches with different labels based on the range of the relevance score that allows you to differentiate between higher and lower relevance score matches. You can resolve matches quickly based on the label. In the example above, based on the relevance score, some potential matches can be considered for merging directly while others must be reviewed before any action is taken. The results of the API to get potential matches and the external match API will contain a relevance value and a matchActionLabel
corresponding to each of the action type configured under the actionThreshold
parameter. For more information, see Potential matches API and External match API.
Including Equals and Not Equals Constraints in Tokenization and Relevance Score Calculation
Using operators like equals
and notEquals
prevents tokenization from generating tokens. These operators should not have an impact on tokenization, if we want to compare and conclude that even though address and/or email and/or phone are different, the remaining attributes match enough to take the score above the threshold.
equal
, notEquals
and in
constraints:strict
(boolean value with default=true): Allows the constraint to be skipped before the match tokens and relevance score are computed.weight
(decimal with default=0.0): Allows the constraint to participate in the relevance score calculation.
Business Configuration
strict
, weight
options.{
"notEquals": [
{
"uri": "configuration/entityTypes/Individual/attributes/FirstName",
"value": "unknown",
"strict": false,
"weight": 2.5
}
]
}
Options | Condition of Constraint | Match Tokens | Relevance Score |
---|---|---|---|
strict=undefined (true by default) weight=undefined (0.0 by default) | Met | Generated | Calculated No contribution to the overall score |
Not met | Not generated | Not calculated | |
strict=true weight=undefined (0.0 by default) | Met | Generated | Calculated No contribution to the overall score |
Not met | Not generated | Not calculated | |
strict=false weight=undefined (0.0 by default) | Met | Generated | Calculated No contribution to the overall score |
Not met | Generated | Calculated No contribution to the overall score | |
strict=undefined (true by default) weight=R | Met | Generated | Calculated Condition contributes to the overall score according to the weight R (See: Formulae to Calculate Relevance Score) |
Not met | Not generated | Not calculated | |
strict=false weight=R | Met | Generated | Calculated Condition contributes to the overall score according to the weight R (See: Formulae to Calculate Relevance Score) |
Not met | Generated | Calculated Condition contributes to the overall score according to the weight R (See: Formulae to Calculate Relevance Score) |