Unify and manage your data

Match Rule Analyzer version 2 (Dynamic)

Learn how to profile the matching process using the Match Rule Analyzer.

Overview

You can use Match Rule Analyzer version 2 to profile match rules based on a relevant subset of data in a tenant and view statistics for data that is profiled.

You can use this information and recommendations to:
  • Analyze match rules and identify rules that are causing performance issues.
  • Tune match rules to ensure optimum performance before using them on your production data.
  • Remove rules that aren’t being used for matching.
You can also inspect a subset of data in your tenant in the following ways:

Request Body

The request body of the match rules analyzer version 2 contains a new section called Profiling under which you can perform several types of analysis and generate statistics to get a detailed insight into the overall matching process for your tenant.

Requests:

POST /tools/{tenantId}/analyzeMatchRules/v2

Example request body

{
  "entityTypes": [
    "Type1",
    "Type2",
    "Type3"
  ],
  "profiling": {
    "enabled": true
  },
  "model": {
     
  }
}
  • If entity types is empty or missing, then all entity types from the configuration are analyzed.
  • If the profiling section is empty or missing, then no analysis is performed. The enabled flag is false by default.
  • The model section contains a business (L3) configuration that needs testing. If no model is specified, then the current tenant business configuration is tested.
Note: When the match rule analysis is complete, an email notification is sent to the user who submitted the request.

Example response

{
  "profiling": {
    "description": "Summary about execution of the match groups based on the existing data (tenantId=abcd)",
    "id": "72d8918d-853d-4331-a1c1-29ccefa9a395",
    "uri": "analyzeMatchRules/v2/profiling/72d8918d-853d-4331-a1c1-29ccefa9a395"
  }
}

The response contains some description about profiling, a unique identifier of the submitted profiling task, and the URI to request the results of profiling. The identifier can be used to access the profiling tasks directly by using the Tasks API.

Profiling section

The following subsections are available under the profiling section:

{  
    "profiling": {
        "enabled": true,
        "maxObjectsPerType": 5000,
        "timeout": 3600,
        "samplingAlgorithm": {},
        "inspections": {},
		"wordAnalysis": {},
        "analysisTypes": [],
		"useSkippedRules": false,
        "scopes": [
			"EXTERNAL",
			"ALL",
			"INTERNAL",
			"NONE"
		]
    }   
}
The following table explains the fields in the response:
Field Description
enabled Profiling is enabled when this field is set to true. It’s set to false by default.
maxObjectsPerType Specifies the maximum number of objects for each type of match rule.
timeout Specifies the timeout time period. Processing a large number of entities can be time consuming and can cause failures. Default timeout is 3600 seconds. You can increase it to 7200 seconds to avoid failures.
samplingAlgorithm Specifies the sampling algorithm used to build a subset of entities. There are three supported sampling algorithms: SEQUENCE, SEARCH and MATCHES_AWARE.
inspections Specifies the type of inspection that is running on a subset of data on the tenant.
analysisTypes Specifies the type of analysis that is checking the matching performance of the tenant.
useSkippedRules In the analysis, choose to include match rules that are bypassed or skipped. By default, skipped matched rules will be included in the analysis.
scopes Defines the scopes of the match rules in the analysis. By default, ALL, NONE, INTERNAL and EXTERNAL scopes are evaluated.

maxObjectsPerType

It enables you to set the maximum number of objects for profiling. The default value is 10000. The value must be less than or equal to 100000000. You can’t use a value greater than 20000, when matchDocumentMatches is used. Disable matchDocumentMatches to use the maximum limit of 100000000.
{ 
    "profiling": {
        "enabled": true,
        "maxObjectsPerType": 5000,
        "timeout": 3600,
        "samplingAlgorithm": {
            "name": "..."
        },
        "inspections": {},
		"wordAnalysis": {}
        "analysisTypes": []
    }
    
}
Note: name is the name of the sampling algorithm. For example, SEQUENCE, SEARCH and MATCHES_AWARE.

Sampling Algorithm

There are three supported sampling algorithms:
  1. SEQUENCE - It iterates through entities as they’re stored in the database. You can specify the starting point of iterations by using the rangeStart field, which corresponds to startToken in the database.
    {
       "name": "SEQUENCE",
       "rangeStart": 123456789123
    }
  2. SEARCH - This algorithm works over a set of identified objects. The object URIs are provided directly in the section or searched by using the specified queries. If objects are defined explicitly in the object.includeList section, then the list is passed directly to the profiling task. The task reads the objects excluding those specified in the objects.excludeList section and filters them by using the query if specified. If no objects are explicitly defined, then QueryObjectsTask is submitted to search the objects for the specified query.
    {
          "objects": {
            "includeList": [
              "entities/ID1",
              "entities/ID2",
              "entities/ID3"
            ],
            "excludeList": [
              "entities/ID4",
              "entities/ID5"
            ]
          },
          "query": [
            {
              "filter": "equals(attributes.Address.City, 'Las Vegas') and equals(type, 'configuration/entityTypes/HCP')",
              "activeness": "active",
              "options": "searchByOv"
            },
            {
              "filter": "equals(attributes.Address.State, 'Ohio') and equals(type, 'configuration/entityTypes/HCO')",
              "activeness": "all"
            }
          ]
    }
  3. MATCHES_AWARE - This sampling algorithm uses the existing match information in the tenant to build a subset of entities. In addition to other sampling algorithms, you can use the correlation and conditionalCorrelation calculators, which are optimized to provide accurate information for matchDocumentMatches analysis.

    Correlation Calculator

    The correlation coefficient of two match rules MR1 and MR2 is calculated as follows:The following information describes participants of the formulae:
    • Nobs- the number of observations. The number of observations depends on the calculator parameter emulateWholeSubsetComparisonByFakeNotMatches. The default value is false.
      Note:
      • If set to false, the number of observations is equal to the number of entity pairs.
      • If set to true, the number of observations is equal to the number of entity pairs and the number of possible pairs of entities not considered because the result of the comparison MR1(Ex, Ey) == MR2(Ex,Ey) is zero.
      • All the entities are compared to each other, when emulateWholeSubsetComparisonByFakeNotMatches=true.
    • The outcome of the comparison operator == depends on the treatSimilarMatchActionAsSame parameter option, which is set to true by default. The calculator generalizes the outcome of automatic and suspect match rules into the known outcomes of relevance_based rules: automatic > auto_merge, suspect > potential_match . There are two approaches to compare the outcomes of the two rules of any type:
      • Strict equality of outcomes. For example, when treatSimilarMatchActionAsSame=false:
        • auto_merge==auto_merge === 1
        • auto_merge==potential_match === 0
        • <no outcome>==auto_merge === 0
        • <no outcome>==<no outcome> === 1
        • <no outcome>==not_a_match === 0
        • etc
      • Grouping positive and negative outcomes. For example:
        • auto_merge==auto_merge === 1
        • auto_merge==potential_match === 1
        • <no outcome>==auto_merge === 0
        • <no outcome>==<no outcome> === 1
        • <no outcome>==not_a_match === 1
    • The result of the formulae is the percentage number of similar match rule outcomes. The values -0.5 and 2 are used in the formulae, so that the value clearly represents the following:
      • C(MR1, MR2) = 1 - full correlation (all outcomes are the same)
      • C(MR1, MR2) = -1 - anti-correlation (all outcomes are different)
      • C(MR1, MR2) = 0 - non-correlation (no correlation between outcomes)

    Conditional Correlation Calculator

    The correlation coefficient of two match rules MR1 and MR2 is calculated as follows:The following information describes participants of the formulae:
    • Nobs([MR1=])- the number of observations, where the outcome of MR1 is positive.
    • The outcome of the comparison operator == depends on the treatSimilarMatchActionAsSame parameter option, which is set to true by default.

Analysis types

The following analysis types are available:
  • matchToken
  • matchTokenIntersections
  • matchGroupsPerMatchDocument
  • matchDocumentsPerMatchGroup
  • matchDocumentMatches
You can specify each analysis type by using a section as follows:
{
                "analysisType": "analysis name",
                "perMatchGroup": false,
                "splitByMatchGroupType": false,
                "statistics": [
                    {
                        "name": "statistic name",
                        "enabled": true,
						"parameters": [
							{
								"name": "parameterName",
								"value": "parameterValue or a JSON object"
							}
						]
                    }
                ]
            }
The analysis is enabled by default and can be disabled if required. The analysis can be performed on match groups in the following methods:
  • perMatchGroup flag - for a specific match group, true by default.
  • splitByMatchGroupType flag - for a set of match groups that have the same type of rules suspect, automatic, or relevance-based. The default value is true.
  • all match groups – for all match groups.

Some analysis can be run only for all match groups and therefore, perMatchGroup and splitByMatchGroupType settings are ignored.

If you don’t specify a statistics section, then the default set of statistics is used. Each statistic definition contains a name, enabled flag, and the parameters list. Each parameter contains the name and value fields. If a statistic has enabled=false, then it isn’t calculated. If some statistics are specified as disabled, then the corresponding default statistics are used. For most statistics, parameters are required but for some, no parameters are required.

You can use the following types of analysis to analyze the matching performance of your tenant:

matchToken

This analyzes the match tokens generated by different match groups for each object. It provides information about how match tokens are distributed across match groups in the relevant subset of data.

Table 1. Supported statistics for matchToken
Statistic name Parameters Description
min The minimum number of tokens generated for an object.
max The maximum number of tokens generated for an object.
range The range of token numbers. Equal to max - min + 1.
total The total number of tokens generated for all processed objects.
mode The most frequent number of tokens for all processed objects.
mean Mean (average) value of the number of tokens.
std Standard deviation of the number of tokens.
se Standard error of the mean of the generated tokens.
skewness The skewness of the distribution of the generated tokens.
kurtosis The kurtosis of the distribution of the generated tokens.
variance The variance of the distribution of the generated tokens.
median The median of the token numbers, equal to percentile (50%).
firstQuartile The first quartile of the number of tokens, equal to percentile (25%).
thirdQuartile The third quartile of the number of tokens, equal to percentile (75%).
lowerBoundOutliers k - coefficient to evaluate outliers. The default is 1.5. Objects that have a lesser number of tokens than most values, can be considered as outliers. Outlier is an object with the number of tokens less than Q1 - k (Q3 - Q1) (Tukey’s fence definition), where Q1 is the first quartile, Q3 is the third quartile, and k is the specified coefficient.
upperBoundOutliers k - coefficient to evaluate outliers. The default is 1.5. Objects that have a greater number of tokens than most values, can be considered as outliers. Outlier is an object with the number of tokens greater than Q3 + k (Q3 - Q1) (Tukey’s fence definition), where Q1 is the first quartile, Q3 is the third quartile, and k is the specified coefficient.
mostFrequent k - most frequent number of tokens to return. The default is 10. The list of most frequent tokens is returned with their corresponding frequencies.
histogram
  • nbins- the number of histogram bins. The default value is 10.
  • start - the starting point of the histogram.
  • binSize - the size of the bin.
If start and binSize aren’t specified, they’re calculated using nbins, min, and max values.
The histogram is a representation of the distribution of the generated tokens. It’s calculated as the number of tokens that are associated with a particular histogram bin.

The following example shows the use of some statistics:

{
      "enabled": true,
      "perMatchGroup": true,
      "splitByMatchGroupType": true,
      "statistics": [
        {
          "name": "max",
          "enabled": true
        },
        {
          "name": "min",
          "enabled": false
        },
        {
          "name": "histogram",
          "parameters": [
            {
              "name": "bins",
              "value": 10
            },
            {
              "name": "left",
              "value": 0
            },
            {
              "name": "right",
              "value": 1000
            }
          ]
        }
      ]
}

matchTokenIntersections

This provides information about the distribution of match token intersections in the relevant subset of data. The intersection of match tokens of two objects is a subset of match tokens that are common to both objects.

Table 2. Supported statistics for matchTokenIntersections
Statistic name Parameters Description
correlation It’s the ratio of the number of entity pairs identified as candidates to the total number of processed entity pairs of two match rules. The value of the correlation is between 0 and 1, where 0 indicates no intersection between match pairs and 1 indicates that the two sets of match pairs are exactly the same.

firstConditionalCorrelation=1 - all the candidate pairs found by the first match rule were also found by the second match rule. secondConditionalCorrelation=1 - all the candidate pairs found by the second match rule were also found by the first match rule.

min The minimum number match token intersections.
max The maximum number match token intersections.
range The range of token intersections. Equal to max - min + 1.
total The total number of intersections.
mode The most frequent number of intersections.
mean Mean (average) value of the number of token intersections.
std Standard deviation of the number of token intersections.
se Standard error of the mean of the number of token intersections.
skewness The skewness of the distribution of token intersections.
kurtosis The kurtosis of the distribution of token intersections.
variance The variance of the distribution of token intersections.
median The median of token intersections, equal to percentile (50%).
firstQuartile The first quartile of token intersections, equal to percentile (25%).
thirdQuartile The third quartile of token intersections, equal to percentile (75%).
lowerBoundOutliers k - coefficient to evaluate outliers. The default is 1.5. Objects that have a less number of common tokens than most values, can be considered as outliers. Outlier is an object with the number of common tokens less than Q1 - k (Q3 - Q1) (Tukey’s fence definition), where Q1 is the first quartile, Q3 is the third quartile, and k is the specified coefficient.
upperBoundOutliers k - coefficient to evaluate outliers. The default is 1.5. Objects that have a greater number of common tokens than most values, can be considered as outliers. Outlier is an object with the number of common tokens greater than Q3 + k (Q3 - Q1) (Tukey’s fence definition), where Q1 is the first quartile, Q3 is the third quartile, and k is the specified coefficient.
histogram
  • nbins- the number of histogram bins. The default value is 10.
  • start - the starting point of the histogram.
  • binSize - the size of the bin.
If start and binSize aren’t specified, they’re calculated using nbins, min, max values.
The histogram is a representation of the distribution of match token intersections. It’s calculated as the number of intersections that are associated with a particular histogram bin.

matchGroupsPerMatchDocument

This analyzes the match groups that are used to create a particular match document. It provides information about the distribution of match groups per match document. This analysis considers the entire set of known match groups and therefore, perMatchGroup and splitByMatchGroupType settings are ignored.

Table 3. Supported statistics for matchGroupsPerMatchDocument
Statistic name Parameters Description
min The minimum number of match groups.
max The maximum number of match groups.
range The range of match group numbers. Equal to max - min + 1.
total The total number match groups to build a match document (Sum of match groups per match document).
mode The most frequent number of match groups.
mean Mean (average) value of the number of match groups.
std Standard deviation of the number of match groups.
se Standard error of the mean of the number of match groups.
skewness The skewness of the distribution of match groups.
kurtosis The kurtosis of the distribution of match groups
variance The variance of the distribution of match groups.
median The median of the number of match groups, equal to percentile (50%).
firstQuartile The first quartile of the number of match groups, equal to percentile (25%).
thirdQuartile The third quartile of the number of match groups, equal to percentile (75%).
lowerBoundOutliers k - coefficient to evaluate outliers. The default is 1.5. Objects with a less number of match groups in a match document, can be considered as outliers. Outlier is an object with the number of match groups less than Q1 - k (Q3 - Q1) (Tukey’s fence definition), where Q1 is the first quartile, Q3 is the third quartile, and k is the specified coefficient.
upperBoundOutliers k - coefficient to evaluate outliers. The default is 1.5. Objects with many match groups in a match document, can be considered as outliers. Outlier is an object with the number of match groups greater than Q3 + k (Q3 - Q1) (Tukey’s fence definition), where Q1 is the first quartile, Q3 is the third quartile, and k is the specified coefficient.
histogram
  • nbins- the number of histogram bins. The default value is 10.
  • start - the starting point of the histogram.
  • binSize - the size of the bin.
If start and binSize aren’t specified, they’re calculated using nbins, min, max values.
The histogram is a representation of the distribution of the number of match groups. It’s calculated as the number of match groups in match document that are associated with a particular histogram bin.
mostFrequent k - the number of most frequent match groups to return. The default is 10. The list of most frequent match groups is returned along with their corresponding frequencies.
covariance The covariance of match groups in a match document. The result is a list of match group pairs with the covariance value.
correlation The correlation of match groups in a match document. The result is a list of match group pairs with the correlation value.

matchDocumentsPerMatchGroup

This analyzes and provides information about the distribution of match documents that have specific match groups. The analysis considers the entire set of match groups and therefore, perMatchGroup and splitByMatchGroupType settings are ignored.

Table 4. Supported statistics for matchDocumentPerMatchGroup
Statistic name Parameters Description
frequencies The number of match documents that have a particular match group.

matchDocumentMatches

This analyzes match documents and their matches based on the specified match groups. It provides information about the distribution of match document matches across match groups.

You can analyze a particular match group, a subset of match groups that have a specific type, or the entire set of match groups. However, the analysis is computationally expensive and therefore,perMatchGroup and splitByMatchGroupType settings are false by default.

The match is detected when two match documents are equal based on a match group. The detection happens by comparing objects directly by using the match group without considering the generated match tokens.

Table 5. Supported statistics for matchDocumentMatches
Statistic name Parameters Description
min The minimum number of matches.
max The maximum number of matches.
range The range of matches. Equal to max - min + 1.
total The total number of detected matches (match between object 1 and object 2 is considered the same as match between object 2 and object 1 and included as one match).
mode The most frequent number of matches.
mean Mean (average) value of the number of matches.
std Standard deviation of the number of matches.
se Standard error of the mean of the number of matches.
skewness The skewness of the distribution of matches.
kurtosis The kurtosis of the distribution of matches.
variance The variance of the distribution of matches.
median The median of the number of matches, equal to percentile (50%).
firstQuartile The first quartile of the number of matches, equal to percentile (25%).
thirdQuartile The third quartile of the number of matches, equal to percentile (75%).
lowerBoundOutliers k - coefficient to evaluate outliers. The default is 1.5. Objects with a less number of matches in a match document, can be considered as outliers. Outlier is an object with the number of matches less than Q1 - k (Q3 - Q1) (Tukey’s fence definition), where Q1 is the first quartile, Q3 is the third quartile, and k is the specified coefficient.
upperBoundOutliers k - coefficient to evaluate outliers. The default is 1.5. Objects with many matches in a match document, can be considered as outliers. Outlier is an object with the number of matches greater than Q3 + k (Q3 - Q1) (Tukey’s fence definition), where Q1 is the first quartile, Q3 is the third quartile, and k is the specified coefficient.
histogram
  • nbins- the number of histogram bins. The default value is 10.
  • start - the starting point of the histogram.
  • binSize - the size of the bin.
If start and binSize aren’t specified, they’re calculated using nbins, min, max values.
The histogram is a representation of the distribution of the number of matches. It’s calculated as the number of matches in match document that are associated with a particular histogram bin.
covariance The covariance of matches in a match document. The result is a list of matches with the covariance value.
correlation The correlation of matches in a match document. The result is a list of matches with the correlation value.
conditionalCorrelation The correlation coefficient of two match groups when at least one match group's result is positive.
countFrequency The number of entities having a specified amount of matches.

Retrieving the request for profiling

Requests:

GET /tools/{tenantId}/analyzeMatchRules/v2/profiling/{profilingId}

Example Response

{
  "status": "COMPLETED",
  "startTimestamp": 1569931108398,
  "startTime": "2019-10-01T11:58:28.398Z",
  "finishTimestamp": 1569931372217,
  "finishTime": "2019-10-01T12:02:52.217Z",
  "duration": 263819,
  "totalObjectsProcessed": 1000,
  "useSkippedRules": true,
  "scopes": [ "ALL", "NONE", "INTERNAL", "EXTERNAL" ],
  "entityTypes": [
    {
      "uri": "configuration/entityTypes/Individual",
      "objectsProcessed": 1000,
      "matchToken": {
        "enabled": true,
        "perMatchGroup": true,
        "splitByMatchGroupType": true,
        "projections": [
          {
            "type": "match-group-single",
            "matchGroups": [
              {
                "uri": "configuration/entityTypes/Individual/matchGroups/SameFirstName"
              }
            ],
            "statistics": [
              {
                "name": "total",
                "value": 989
              },
              {
                "name": "mode",
                "value": [1]
              },
              {
                "name": "mostFrequent",
                "parameters": {
                  "k": 10
                },
                "value": [{"item": "firstname23 0","count": 10},{"item": "firstname40 0","count": 9}]
              },
              {
                "name": "histogram",
                "parameters": {
                  "start": 0,
                  "nbins": 10,
                  "binSize": 1
                },
                "value": [{"x": 0,"y": 328},{"x": 1,"y": 355},{"x": 2,"y": 317},]
              },
            ]
          },
          {
            "type": "match-group-single",
            "matchGroups": [
              {
                "uri": "configuration/entityTypes/Individual/matchGroups/SameLastName"
              }
            ],
            "statistics": []
          },
          {
            "type": "match-group-type-suspect",
            "matchGroups": [
              {
                "uri": "configuration/entityTypes/Individual/matchGroups/SameFirstName"
              },
              {
                "uri": "configuration/entityTypes/Individual/matchGroups/SameLastName"
              },
              {
                "uri": "configuration/entityTypes/Individual/matchGroups/SameFirstLastName"
              }
            ],
            "statistics": []
          },
          {
            "type": "match-group-all",
            "matchGroups": [
              {
                "uri": "configuration/entityTypes/Individual/matchGroups/SameFirstName"
              },
              {
                "uri": "configuration/entityTypes/Individual/matchGroups/SameLastName"
              },
              {
                "uri": "configuration/entityTypes/Individual/matchGroups/SameFirstLastName"
              }
            ],
            "statistics": []
          }
        ]
      },
      "matchTokenIntersections": {
        "enabled": true,
        "perMatchGroup": true,
        "splitByMatchGroupType": true,
        "projections": [
          {
            "type": "match-group-all",
            "matchGroups": [
              {
                "uri": "configuration/entityTypes/Individual/matchGroups/SameFirstName"
              },
              {
                "uri": "configuration/entityTypes/Individual/matchGroups/SameLastName"
              },
              {
                "uri": "configuration/entityTypes/Individual/matchGroups/SameFirstLastName"
              }
            ],
            "statistics": [
              {
                "name": "upperBoundOutliers",
                "parameters": {
                  "k": 1.5
                },
                "value": [
                  {
                    "ids": ["id781","id681",],
                    "token": "lastname81 1",
                    "count": 90
                  }
                ],
                "details": {
                  "fence": 33
                }
              }
            ]
          }
        ]
      },
      "matchGroupsPerMatchDocument": {
        "enabled": true,
        "perMatchGroup": false,
        "splitByMatchGroupType": false,
        "projections": [
          {
            "type": "match-group-all",
            "matchGroups": [
              {
                "uri": "configuration/entityTypes/Individual/matchGroups/SameFirstName"
              },
              {
                "uri": "configuration/entityTypes/Individual/matchGroups/SameLastName"
              },
              {
                "uri": "configuration/entityTypes/Individual/matchGroups/SameFirstLastName"
              }
            ],
            "statistics": [
              {
                "name": "mostFrequent",
                "parameters": {
                  "k": 10
                },
                "value": [
                  {
                    "item": "configuration/entityTypes/Individual/matchGroups/SameFirstName",
                    "count": 672
                  },
                  {
                    "item": "configuration/entityTypes/Individual/matchGroups/SameLastName",
                    "count": 669
                  },
                  {
                    "item": "configuration/entityTypes/Individual/matchGroups/SameFirstLastName",
                    "count": 451
                  }
                ]
              },
              {
                "name": "covariance",
                "value": [
                  {
                    "t1": "configuration/entityTypes/Individual/matchGroups/SameFirstName",
                    "t2": "configuration/entityTypes/Individual/matchGroups/SameLastName",
                    "value": 0.001433
                  },
                  {
                    "t1": "configuration/entityTypes/Individual/matchGroups/SameFirstName",
                    "t2": "configuration/entityTypes/Individual/matchGroups/SameFirstLastName",
                    "value": 0.148076
                  },
                  {
                    "t1": "configuration/entityTypes/Individual/matchGroups/SameLastName",
                    "t2": "configuration/entityTypes/Individual/matchGroups/SameFirstLastName",
                    "value": 0.14943
                  }
                ]
              },
              {
                "name": "correlation",
                "value": [
                  {
                    "t1": "configuration/entityTypes/Individual/matchGroups/SameFirstName",
                    "t2": "configuration/entityTypes/Individual/matchGroups/SameLastName",
                    "value": 0.006482
                  },
                  {
                    "t1": "configuration/entityTypes/Individual/matchGroups/SameFirstName",
                    "t2": "configuration/entityTypes/Individual/matchGroups/SameFirstLastName",
                    "value": 0.63322
                  },
                  {
                    "t1": "configuration/entityTypes/Individual/matchGroups/SameLastName",
                    "t2": "configuration/entityTypes/Individual/matchGroups/SameFirstLastName",
                    "value": 0.637534
                  }
                ]
              }
            ]
          }
        ]
      },
      "matchDocumentsPerMatchGroup": {
        "enabled": true,
        "perMatchGroup": false,
        "splitByMatchGroupType": false,
        "projections": [
          {
            "type": "match-group-all",
            "matchGroups": [
              {
                "uri": "configuration/entityTypes/Individual/matchGroups/SameFirstName"
              },
              {
                "uri": "configuration/entityTypes/Individual/matchGroups/SameLastName"
              },
              {
                "uri": "configuration/entityTypes/Individual/matchGroups/SameFirstLastName"
              }
            ],
            "statistics": [
              {
                "name": "frequencies",
                "value": [
                  {
                    "x": "configuration/entityTypes/Individual/matchGroups/SameFirstName",
                    "frequency": 672
                  },
                  {
                    "x": "configuration/entityTypes/Individual/matchGroups/SameFirstLastName",
                    "frequency": 451
                  },
                  {
                    "x": "configuration/entityTypes/Individual/matchGroups/SameLastName",
                    "frequency": 669
                  }
                ]
              }
            ]
          }
        ]
      },
      "matchDocumentMatches": {
        "enabled": true,
        "perMatchGroup": false,
        "splitByMatchGroupType": false,
        "projections": [
          {
            "type": "match-group-all",
            "matchGroups": [
              {
                "uri": "configuration/entityTypes/Individual/matchGroups/SameFirstName"
              },
              {
                "uri": "configuration/entityTypes/Individual/matchGroups/SameLastName"
              },
              {
                "uri": "configuration/entityTypes/Individual/matchGroups/SameFirstLastName"
              }
            ],
            "statistics": [
              {
                "name": "correlation",
                "value": [
                  {
                    "t1": "configuration/entityTypes/Individual/matchGroups/SameFirstName",
                    "t2": "configuration/entityTypes/Individual/matchGroups/SameLastName",
                    "value": -0.540045
                  },
                  {
                    "t1": "configuration/entityTypes/Individual/matchGroups/SameFirstName",
                    "t2": "configuration/entityTypes/Individual/matchGroups/SameFirstLastName",
                    "value": 0.479254
                  },
                  {
                    "t1": "configuration/entityTypes/Individual/matchGroups/SameLastName",
                    "t2": "configuration/entityTypes/Individual/matchGroups/SameFirstLastName",
                    "value": 0.479866
                  }
                ]
              }
            ]
          }
        ]
      }
    }
  ]

The following table describes the sections that are available in the response:

Table 6. Section descriptions
Section Name Description
status The status of profiling.
startTimestamp Epoch milliseconds of the profiling start time.
startTime Human readable value of the start time.
finishTimestamp Epoch milliseconds of the profiling finish time.
finishTime Human readable value of the finish time.
duration Time of profiling in milliseconds.
totalObjectsProcessed The total number of processed objects.
useSkippedRules In the analysis, choose to include match rules that are bypassed or skipped. By default, skipped matched rules will be included in the analysis.
scopes Defines the scopes of the match rules in the analysis. By default, ALL, NONE, INTERNAL and EXTERNAL scopes are evaluated.
entityTypes The array of results for each entity type.
entityTypes.uri The URI of the processed entity type.
entityTypes.objectsProcessed The number of accepted objects of a specific type.
entityTypes.matchToken The section that describes the analysis of match tokens for a specific entity type.
entityTypes.matchToken.projections The array of profiling results related to a specific set of match groups.
entityTypes.matchToken.projections.type The type of projection: match-groups-single, match-groups-type-suspect, match-groups-type-automatic, match-groups-type-relevance_based, and match-groups-all.
entityTypes.matchToken.projections.matchGroups The list of match groups used for profiling.
entityTypes.matchToken.projections.statistics The array of statistics results.
entityTypes.matchToken.projections.statistics.name The name of a particular statistics calculator.
entityTypes.matchToken.projections.statistics.value The value of the result. The result can be a single number (for mean, max, total, and so on), array of numbers (mode), or some complex object or list of objects (such as correlation).
entityTypes.matchToken.projections.statistics.parameters The parameters of the statistics calculator.

Count frequency calculator

A match rule can generate many tokens for an entity. If the number of tokens is high, then the storage is overloaded during matching. If the number of tokens is greater than some value, such as 1000, then the tokens aren’t updated into the database and such entities don’t participate in matching. If there are no tokens generated, then the match rule or a set of match rules don’t participate in matching.

The analyzer is enhanced with a calculator to calculate the number of entities that have a token count within a specified lower and upper limit. If the lower value is omitted, the lower limit is 0. If the upper value is omitted, the upper limit is infinity. If both upper and lower limits aren’t specified, then lower limit is 0 and upper limit is infinity and the calculator's result is the total number of entities.

The calculator applies to matchToken analysis only. The calculator works for a single match rule, a set of match rules having a specific type, or all match rules.

Table 7. Parameters
Name Default Description
lower 0 The number that indicates the lower limit.
upper 2147483647 The number that indicates the lower limit.
examplesAmount 10 The pair of entities along with the number of tokens within the specified limits.
examplesOrder ANY Enables you to sort the identified entities in ascending or descending order. By default, there’s no ordering.

Example of the Calculator Payload

{
  "name": "countFrequency",
  "enabled": true,
  "parameters": {
    "examplesOrder": "ASC",
    "lower": 300,
    "examplesAmount": 20
  }
}

Example of the Calculator Result

{
  "name": "countFrequency",
  "parameters": {
    "examplesOrder": "DESC",
    "lower": 1,
    "upper": 5,
    "examplesAmount": 3
  },
  "value": {
    "frequency": 211,
    "examples": [
      {
        "id": "id71",
        "count": 3
      },
      {
        "id": "id61",
        "count": 2
      },
      {
        "id": "id13",
        "count": 1
      }
    ]
  }
}

The following table explains the sections in the JSON output:

Table 8. JSON sections
Section Description
name Name of the calculator.
parameters Actual parameters of the calculator.
parameters.lower Actual lower parameter.
parameters.upper Actual upper parameter.
parameters.examplesOrder Actual examplesOrder parameter.
parameters.examplesAmount Actual examplesAmount parameter.
value Result of the calculator.
value.frequency Number of entities having tokens count within the specified limits.
value.examples Examples of entities within the specified limits.
value.examples.id Identifier of the example entity.
value.examples.count Number of tokens for the example entity.