Relevance-Based Matching - Detailed Explanation
This type of matching uses a scoring algorithm to match records.
Relevance-based matching is designed primarily as a replacement of the strategy that uses
automatic
and suspect
rule types. With Relevance-based
matching you can create a scoring algorithm of your own design. The advantage is that in most
cases, a strategy based on Relevance-based matching can reduce the complexity and overall
number of rules you need. The reason for this is that the two directives of merge and queue
for review which normally require separate rules (automatic
and
suspect
respectively) can often be represented by a single Relevance-Based
rule.
Declaring Action Thresholds
When declaring the ranges for queue_for_review
and
auto_merge
, the combination of them should span the entire available
range of 0.0 to 1.0 with no gap and no overlap except that the upper endpoint for
queue_for_review
should equal the lower endpoint for
auto_merge
thus have a common touchpoint between them. (for
example, 0.0 to 0.6 for queue_for_review
, and 0.6 to 1.0 for
auto_merge
.) If the actionThresholds
leave a gap, then
any score falling within the gap will produce no action. Conversely, if the
actionThresholds
overlap (for example, 0.4 to 0.6 for
queue_for_review
, and 0.5 to 0.7 for auto_merge
) and a
score lands within the intersection (0.55 in our example) or on the touchpoint, the
directive of queue_for_review
takes precedence.
Primary Use of Relevance-Based Matching
Using Relevance-based matching you could create a match rule that contains a
collection of attributes you wish to test as a group, but assign weights to those attributes
to govern their relative importance in the rule. Start with all weights equal to 1.0 and
perhaps start with actionThresholds
of 0.0-0.5 for
queue_for_review
and 0.5-1.0 for auto_merge
. Do some
trial runs and examine the results. If you find that too many obvious matches are being set
to queue_for_review
, then you can adjust weights and even modify the
actionThresholds
to perhaps 0.0-0.7, and 0.7-1.0. Iterate and experiment
until you are able to get optimized results with your data set. Always remember that data
quality is a key component of effective matching. So good use of cleansing before the data
is posted and proper cleansing using the out-of-the-box match cleansers are important.
Support for customer-defined match actions in a relevance-based rule
In the image below, notice that on the left the admin has two
actionThresholds
defined and they use the system values of
auto_merge
and potential_match
. However, if the admin
had defined matchActions
as you see on the right, then she can replace the
system values with her matchActions
. This provides her the ability to
create more advanced outcomes when a threshold is reached. For more details on
matchActions
, see Understanding Custom Rules.
Calculation of Score
suspect
and automatic
match rules. However, the
comparison of the two entities works differently. Every comparator class provides relevance
value while comparing values. The relevance is in the range of 0 to 1. For example,
BasicStringComparator
returns 0 if two values are different. It returns 1
if two values are the identical. Fractional values can be a result of
DistinctWordsComparator
or other comparators. Every attribute has
assigned weights according to the importance of the attribute. If the weight is not assigned
explicitly then it is equal to:- 1 for the simple attributes
- Maximum of the weights of sub-nested attributes for nested or reference attributes
If an attribute has multiple values then the maximum value of relevance is selected.
RelevanceScoreAND
- the relevance score of AND operand, the relevance score of the match ruleNsimple
- number of simple attributes (For example,. FirstName, LastName) participating in the AND operator directlyweighti
- configured weight of i-th simple attributerelevancei
- calculated relevance of i-th simple attributeNnest
- number of nested and reference attributes (For example, Phone-no, Email-ID, Address) participating in the AND operator directlyweightj
- configured weight of j-th nested or reference attributerelevancej
- calculated relevance of j-th nested/reference attributeNlogical
- number of logical operands (For example, AND or OR) participating in the AND operator directlyrelevancek
- calculated relevance of k-th logical operand. Note that the weight of a logical operand is fixed to 1.
RelevanceScoreOR=max(relevance1 , ..., relevancei, ..., relevanceN)
relevance
i- relevance of simple attribute, nested attribute, logical operand participating in the OR operand directly.
RelevanceScoreNOT
=1-RelevanceScoreAND,OR,exact,...
The relevance score of the NOT operand is equal to 1 minus the relevance score of the operand having this negation.
Sample configuration using relevance-based matching for
auto_merge
and
queue_for_review
{
"uri": "configuration/entityTypes/HCP/matchGroups/ByNames",
"label": "By Names",
"type": "relevance_based",
"rule": {
"and": {
"exact": [
"configuration/entityTypes/HCP/attributes/FirstName",
"configuration/entityTypes/HCP/attributes/LastName"
],
"exactOrAllNull": [
"configuration/entityTypes/HCP/attributes/Suffix"
]
},
"weights": [
{
"attribute": "configuration/entityTypes/HCP/attributes/Suffix",
"weight": 0.2
}
],
"actionThresholds": [
{
"type": "auto_merge",
"threshold": "0.8-1.0"
},
{
"type": "potential_match",
"threshold": "0.4-0.8"
}
],
"matchTokenClass": "com.reltio.match.token.ExactMatchToken"
},
"matchServiceClass": "com.reltio.businesslogic.match.providers.internal.InternalMatchService",
},
}
- Three attributes are used -
FirstName
,LastName
, andSuffix
. - Weight of 0.2 is assigned to only the
Suffix
. So a default weight of 1.0 is assigned toFirstName
andLastName
as well. - The
actionThresholds
forauto-merge
andpotential_match
do not overlap but they do share a common score point of 0.8 in order to make their ranges contiguous with no gaps.
Example profile pair for the configuration above:
{
"type": "configuration/entityTypes/HCP",
"attributes": {
"FirstName": [
{
"value": "Robert"
}
],
"LastName": [
{
"value": "Downy"
}
],
"Suffix": [
{
"value": "Sr."
}
]
}
},
{
"type": "configuration/entityTypes/HCP",
"attributes": {
"FirstName": [
{
"value": "Robert"
}
],
"LastName": [
{
"value": "Downy"
}
],
"Suffix": [
{
"value": "Jr."
}
]
}
}
BasicStringComparator
provides the relevance values and the score is calculated as follows: true
for First Name;true
for LastNamefalse
for Suffix
In this case the logic tree uses AND as the root and so the formula for the AND operator is . The score is calculated as (1*1 + 1*1 + 0*1) / (1 + 1 + 1) = ⅔ = .66
With score of
.66 the directive for this pair will be set to
queue_for_review
.
Using the Verify Matches API with Relevance-Based Matching
verifyMatches
API when using
Relevance-based matching. Noteworthy items are: - relevance values appear for every attribute comparison and relevance for the entire rule.
- Match action name is shown if the relevance is within the corresponding threshold
range, and null if it is not within any
actionThreshold
range. - Matched field will be
true
if the relevance is within anyactionThreshold
range.
{
"versions": {
"first": {
"EntityVersion": 1000000000
},
"second": {
"EntityVersion": 1000000000
}
},
"rules": {
"configuration/entityTypes/HCP/matchGroups/ByNames": {
"label": "By Names",
"useOvOnly": false,
"matchTokens": {
"first": {
"foundInMatchTables": true,
"tokens": [
"robert:downy"
],
"tokensGenerated": 1
},
"second": {
"foundInMatchTables": true,
"tokens": [
"robert:downy"
],
"tokensGenerated": 1
},
"intersection": {
"tokens": [
"robert:downy"
]
}
},
"rule": {
"relevance": 0.9090909090909091,
"and": [
{
"relevance": 0.9090909090909091,
"and": [
{
"exact": {
"FirstName": {
"ignoreInToken": false,
"relevance": 1.0
},
"LastName": {
"ignoreInToken": false,
"relevance": 1.0
}
},
"exactOrAllNull": {
"Suffix": {
""ignoreInToken": true,
"relevance": 0.2
}
},
"relevance": 0.9090909090909091
}
]
}
]
},
"matched": true,
"matchAction": "AUTO_MERGE",
"relevance": 0.9090909090909091
}
}
}
Weights and Action Thresholds
In the match group configuration, you will define Weights
and
actionThresholds
. The weight
property allows you to
assign a relative weight (strength) for each attribute. For example, you may decide that
Middle Name is less reliable and thus less important than First Name.
The actionThreshold
allows you to define a range of scores to drive
a directive. For example, you might decide that the match group should merge the profile
pair if the score is between 0.9 to 1.0, but should queue the pair for review if the score
falls into a lower range of 0.6 to 0.9.
Assigning Weights
Weights can be set from 0.0 to 1.0. If you do not explicitly set a weight for an
attribute, it will receive a default weight
of 1.0 during execution of the
rule.
Multiple Actions of the Same Type
You can configure a relevance-based match rule with multiple action thresholds having the same action type but with a different relevance score range.
Example Match Group Configuration
{
"uri": "configuration/entityTypes/HCP/matchGroups/PotentialMatchByEmail",
"label": "By Email",
"type": "relevance_based",
"rule": {
"and": {
"fuzzy": [
"configuration/entityTypes/HCP/attributes/Email/attributes/Email"
]
},
"actionThresholds": [
{
"type": "potential_match",
"threshold": "0.99-1",
"label": "candidate for merge"
},
{
"type": "potential_match",
"threshold": "0.9-0.99",
"label": "suspect match for review"
}
],
}
}
In the above example, the type is potential_match
for two different
action thresholds. You can differentiate such thresholds by assigning appropriate labels.
You can generate potential matches with different labels based on the range of the relevance
score that allows you to differentiate between higher and lower relevance score matches. You
can resolve matches quickly based on the label. In the example above, based on the relevance
score, some potential matches can be considered for merging directly while others must be
reviewed before any action is taken. The results of the API to get potential matches and the
external match API will contain a relevance value and a matchActionLabel
corresponding to each of the action type configured under the
actionThreshold
parameter. For more information, see Potential matches API and External match API.
Including Equals and Not Equals Constraints in Tokenization and Relevance Score Calculation
Using operators like equals
and notEquals
prevents
tokenization from generating tokens. These operators should not have an impact on
tokenization, if we want to compare and conclude that even though address and/or email
and/or phone are different, the remaining attributes match enough to take the score above
the threshold.
equal
, notEquals
and in
constraints:strict
(boolean value with default=true): Allows the constraint to be skipped before the match tokens and relevance score are computed.weight
(decimal with default=0.0): Allows the constraint to participate in the relevance score calculation.
Business Configuration
strict
, weight
options.{
"notEquals": [
{
"uri": "configuration/entityTypes/Individual/attributes/FirstName",
"value": "unknown",
"strict": false,
"weight": 2.5
}
]
}
Options | Condition of Constraint | Match Tokens | Relevance Score |
---|---|---|---|
strict=undefined (true by default) weight=undefined (0.0 by default) |
Met | Generated | Calculated No contribution to the overall score |
Not met | Not generated | Not calculated | |
strict=true weight=undefined (0.0 by default) |
Met | Generated | Calculated No contribution to the overall score |
Not met | Not generated | Not calculated | |
strict=false weight=undefined (0.0 by default) |
Met | Generated | Calculated No contribution to the overall score |
Not met | Generated | Calculated No contribution to the overall score |
|
strict=undefined (true by default) weight=R |
Met | Generated | Calculated Condition contributes to the overall score according to the weight R (See: Formulae to Calculate Relevance Score) |
Not met | Not generated | Not calculated | |
strict=false weight=R |
Met | Generated | Calculated Condition contributes to the overall score according to the weight R (See: Formulae to Calculate Relevance Score) |
Not met | Generated | Calculated Condition contributes to the overall score according to the weight R (See: Formulae to Calculate Relevance Score) |
Formulae to Calculate Relevance Score
Roperand
- the relevance score of an operand (for example: exact, exactOrNull, exactOrAllNull, fuzzy, etc.)Rconstraint
- the relevance score calculated for a constraint (for example: equals, notEquals, in)Woperand
- configured weight for an operandWconstraint
- configured weight for a constraint