Unify and manage your data

Relevance-Based Matching - Detailed Explanation

Learn about configuring relevance-based matching rules for optimal data matching and merging.

Relevance-based matching is designed primarily as a replacement of the strategy that uses automatic and suspect rule types. With relevance-based matching you use a scoring algorithm of your own design. The advantage is that in most cases, a strategy based on relevance-based matching can reduce the complexity and overall number of rules. The reason for this is that the two directives of merge and queue for review which normally require separate rules (automatic and suspect, respectively) can often be represented by a single relevance-based rule.

When you configure relevance-based match rules, two effective comparators are the Jaro and Levenshtein distances. The Jaro distance measures the similarity between two sequences, useful for short texts. The Levenshtein distance, or edit distance, quantifies how dissimilar two strings are by counting the minimum number of operations required to transform one string into the other. Both of these comparators return scores from 0 to 1 and are therefore useful for relevance matching. However, if you use a comparator that does not return a score from 0 to 1, relevance matching may not work as desired.

Declaring Action Thresholds

When declaring the ranges for queue_for_review and auto_merge, the combination of them should span the entire available range of 0.0 to 1.0 with no gap and no overlap except that the upper endpoint for queue_for_review should equal the lower endpoint for auto_merge thus have a common touchpoint between them. (for example, 0.0 to 0.6 for queue_for_review, and 0.6 to 1.0 for auto_merge.) If the actionThresholds leave a gap, then any score falling within the gap will produce no action. Conversely, if the actionThresholds overlap (for example, 0.4 to 0.6 for queue_for_review, and 0.5 to 0.7 for auto_merge) and a score lands within the intersection (0.55 in our example) or on the touchpoint, the directive of queue_for_review takes precedence.

Primary Use of Relevance-Based Matching

Using Relevance-based matching you could create a match rule that contains a collection of attributes you wish to test as a group, but assign weights to those attributes to govern their relative importance in the rule. Start with all weights equal to 1.0 and perhaps start with actionThresholds of 0.0-0.5 for queue_for_review and 0.5-1.0 for auto_merge. Do some trial runs and examine the results. If you find that too many obvious matches are being set to queue_for_review, then you can adjust weights and even modify the actionThresholds to perhaps 0.0-0.7, and 0.7-1.0. Iterate and experiment until you are able to get optimized results with your data set. Always remember that data quality is a key component of effective matching. So good use of cleansing before the data is posted and proper cleansing using the out-of-the-box match cleansers are important.

Support for customer-defined match actions in a relevance-based rule

In the image below, notice that on the left the admin has two actionThresholds defined and they use the system values of auto_merge and potential_match. However, if the admin had defined matchActions as you see on the right, then she can replace the system values with her matchActions. This provides her the ability to create more advanced outcomes when a threshold is reached. For more details on matchActions, see Understanding Custom Rules.

Calculation of Score

The relevance_based match rules use the match token classes in the same way as they are used in suspect and automatic match rules. However, the comparison of the two entities works differently. Every comparator class provides relevance value while comparing values. The relevance is in the range of 0 to 1. For example, BasicStringComparator returns 0 if two values are different. It returns 1 if two values are the identical. Fractional values can be a result of DistinctWordsComparator or other comparators. Every attribute has assigned weights according to the importance of the attribute. If the weight is not assigned explicitly then it is equal to:
  • 1 for the simple attributes
  • Maximum of the weights of sub-nested attributes for nested or reference attributes

If an attribute has multiple values then the maximum value of relevance is selected.

The following information describes participants of the formulae:
  • RelevanceScoreAND- the relevance score of AND operand, the relevance score of the match rule
  • Nsimple- number of simple attributes (For example,. FirstName, LastName) participating in the AND operator directly
  • weighti- configured weight of i-th simple attribute
  • relevancei- calculated relevance of i-th simple attribute
  • Nnest- number of nested and reference attributes (For example, Phone-no, Email-ID, Address) participating in the AND operator directly
  • weightj- configured weight of j-th nested or reference attribute
  • relevancej- calculated relevance of j-th nested/reference attribute
  • Nlogical- number of logical operands (For example, AND or OR) participating in the AND operator directly
  • relevancek- calculated relevance of k-th logical operand. Note that the weight of a logical operand is fixed to 1.

RelevanceScoreOR=max(relevance1 , ..., relevancei, ..., relevanceN)

  • relevancei- relevance of simple attribute, nested attribute, logical operand participating in the OR operand directly.

RelevanceScoreNOT=1-RelevanceScoreAND,OR,exact,...

The relevance score of the NOT operand is equal to 1 minus the relevance score of the operand having this negation.

Sample configuration using relevance-based matching for auto_merge and queue_for_review

{
       "uri": "configuration/entityTypes/HCP/matchGroups/ByNames",
       "label": "By Names",
       "type": "relevance_based",
       "rule": {
           "and": {
               "exact": [
                   "configuration/entityTypes/HCP/attributes/FirstName",
                   "configuration/entityTypes/HCP/attributes/LastName"
               ],
               "exactOrAllNull": [
                   "configuration/entityTypes/HCP/attributes/Suffix"
               ]
           },
           "weights": [
               {
                   "attribute": "configuration/entityTypes/HCP/attributes/Suffix",
                   "weight": 0.2
               }
           ],
           "actionThresholds": [
               {
                   "type": "auto_merge",
                   "threshold": "0.8-1.0"
               },
               {
                   "type": "potential_match",
                   "threshold": "0.4-0.8"
               }
           ],
           "matchTokenClass": "com.reltio.match.token.ExactMatchToken"
       },
       "matchServiceClass": "com.reltio.businesslogic.match.providers.internal.InternalMatchService",
   },
   
   }
The example indicates:
  • Three attributes are used - FirstName, LastName, and Suffix.
  • Weight of 0.2 is assigned to only the Suffix. So a default weight of 1.0 is assigned to FirstName and LastName as well.
  • The actionThresholds for auto-merge and potential_match do not overlap but they do share a common score point of 0.8 in order to make their ranges contiguous with no gaps.

Example profile pair for the configuration above:

{
   "type": "configuration/entityTypes/HCP",
   "attributes": {
       "FirstName": [
           {
               "value": "Robert"
           }
       ],
       "LastName": [
           {
               "value": "Downy"
           }
       ],
       "Suffix": [
           {
               "value": "Sr."
           }
       ]
   }
},

{
   "type": "configuration/entityTypes/HCP",
   "attributes": {
       "FirstName": [
           {
               "value": "Robert"
           }
       ],
       "LastName": [
           {
               "value": "Downy"
           }
       ],
       "Suffix": [
           {
               "value": "Jr."
           }
       ]
   }
}
BasicStringComparator provides the relevance values and the score is calculated as follows:
  • true for First Name;
  • true for LastName
  • false for Suffix

In this case the logic tree uses AND as the root and so the formula for the AND operator is . The score is calculated as (1*1 + 1*1 + 0*1) / (1 + 1 + 1) = ⅔ = .66

With score of .66 the directive for this pair will be set to queue_for_review.

Using the Verify Matches API with Relevance-Based Matching

The example below shows the use of the verifyMatches API when using Relevance-based matching. Noteworthy items are:
  • relevance values appear for every attribute comparison and relevance for the entire rule.
  • Match action name is shown if the relevance is within the corresponding threshold range, and null if it is not within any actionThreshold range.
  • Matched field will be true if the relevance is within any actionThreshold range.
{
 "versions": {
   "first": {
     "EntityVersion": 1000000000
   },
   "second": {
     "EntityVersion": 1000000000
   }
 },
 "rules": {
   "configuration/entityTypes/HCP/matchGroups/ByNames": {
   "label": "By Names",
   "useOvOnly": false,
   "matchTokens": {
		"first": {
			"foundInMatchTables": true,
			"tokens": [
				"robert:downy"
			],
			"tokensGenerated": 1
       },
       "second": {
			"foundInMatchTables": true,
			"tokens": [
				"robert:downy"
			],
			"tokensGenerated": 1
       },
       "intersection": {
			"tokens": [
				"robert:downy"
			]
       }
     },
     "rule": {
       "relevance": 0.9090909090909091,
       "and": [
         {
           "relevance": 0.9090909090909091,
           "and": [
             {
               "exact": {
                 "FirstName": {
                   "ignoreInToken": false,
                   "relevance": 1.0
                 },
                 "LastName": {
                   "ignoreInToken": false,
                   "relevance": 1.0
                 }
               },
               "exactOrAllNull": {
                 "Suffix": {
                   ""ignoreInToken": true,
                   "relevance": 0.2
                 }
               },
               "relevance": 0.9090909090909091
             }
           ]
         }
       ]
     },
     "matched": true,
     "matchAction": "AUTO_MERGE",
     "relevance": 0.9090909090909091
   }
 }
}

Weights and Action Thresholds

In the match group configuration, you will define Weights and actionThresholds. The weight property allows you to assign a relative weight (strength) for each attribute. For example, you may decide that Middle Name is less reliable and thus less important than First Name.

The actionThreshold allows you to define a range of scores to drive a directive. For example, you might decide that the match group should merge the profile pair if the score is between 0.9 to 1.0, but should queue the pair for review if the score falls into a lower range of 0.6 to 0.9.

Assigning Weights

Weights can be set from 0.0 to 1.0. If you do not explicitly set a weight for an attribute, it will receive a default weight of 1.0 during execution of the rule.

Multiple Actions of the Same Type

You can configure a relevance-based match rule with multiple action thresholds having the same action type but with a different relevance score range.

Example Match Group Configuration

{
  "uri": "configuration/entityTypes/HCP/matchGroups/PotentialMatchByEmail",
  "label": "By Email",
  "type": "relevance_based",
  "rule": {
    "and": {
      "fuzzy": [
        "configuration/entityTypes/HCP/attributes/Email/attributes/Email"
      ]
    },
    "actionThresholds": [
      {
        "type": "potential_match",
        "threshold": "0.99-1",
        "label": "candidate for merge"
      },
      {
        "type": "potential_match",
        "threshold": "0.9-0.99",
        "label": "suspect match for review"
      }
    ],
  }
}

In the above example, the type is potential_match for two different action thresholds. You can differentiate such thresholds by assigning appropriate labels. You can generate potential matches with different labels based on the range of the relevance score that allows you to differentiate between higher and lower relevance score matches. You can resolve matches quickly based on the label. In the example above, based on the relevance score, some potential matches can be considered for merging directly while others must be reviewed before any action is taken. The results of the API to get potential matches and the external match API will contain a relevance value and a matchActionLabel corresponding to each of the action type configured under the actionThreshold parameter. For more information, see Potential matches API and External match API.

Including Equals and Not Equals Constraints in Tokenization and Relevance Score Calculation

Using operators like equals and notEquals prevents tokenization from generating tokens. These operators should not have an impact on tokenization, if we want to compare and conclude that even though address and/or email and/or phone are different, the remaining attributes match enough to take the score above the threshold.

We have added the following options to equal, notEquals and in constraints:
  1. strict (boolean value with default=true): Allows the constraint to be skipped before the match tokens and relevance score are computed.
  2. weight (decimal with default=0.0): Allows the constraint to participate in the relevance score calculation.
Note: The two options and their default values ensure backward compatibility.

Business Configuration

The business configuration (L3) is enhanced so that the match rule constraints have strict, weight options.
{
  "notEquals": [
    {
      "uri": "configuration/entityTypes/Individual/attributes/FirstName",
      "value": "unknown",
      "strict": false,
      "weight": 2.5
    }
  ]
}
Table 1. Behavior of Constraints Based on Options
OptionsCondition of ConstraintMatch TokensRelevance Score
strict=undefined (true by default)

weight=undefined (0.0 by default)

MetGeneratedCalculated

No contribution to the overall score

Not metNot generatedNot calculated
strict=true

weight=undefined (0.0 by default)

MetGeneratedCalculated

No contribution to the overall score

Not metNot generatedNot calculated
strict=false

weight=undefined (0.0 by default)

MetGeneratedCalculated

No contribution to the overall score

Not metGeneratedCalculated

No contribution to the overall score

strict=undefined (true by default)

weight=R

MetGeneratedCalculated

Condition contributes to the overall score according to the weight R (See: Formulae to Calculate Relevance Score)

Not metNot generatedNot calculated
strict=false

weight=R

MetGeneratedCalculated

Condition contributes to the overall score according to the weight R (See: Formulae to Calculate Relevance Score)

Not metGeneratedCalculated

Condition contributes to the overall score according to the weight R (See: Formulae to Calculate Relevance Score)