Match Rule Analyzer version 2 (Dynamic)

Learn how to profile the matching process using the Match Rule Analyzer.

Overview

You can use Match Rule Analyzer version 2 to profile match rules based on a relevant subset of data in a tenant and view statistics for data that is profiled.

You can use this information and recommendations to:

Analyze match rules and identify rules that are causing performance issues.
Tune match rules to ensure optimum performance before using them on your production data.
Remove rules that aren’t being used for matching.

You can also inspect a subset of data in your tenant in the following ways:

Request Body

The request body of the match rules analyzer version 2 contains a new section called Profiling under which you can perform several types of analysis and generate statistics to get a detailed insight into the overall matching process for your tenant.

Requests:

POST /tools/{tenantId}/analyzeMatchRules/v2

Example request body

{
  "entityTypes": [
    "Type1",
    "Type2",
    "Type3"
  ],
  "profiling": {
    "enabled": true
  },
  "model": {
     
  }
}

If entity types is empty or missing, then all entity types from the configuration are analyzed.
If the profiling section is empty or missing, then no analysis is performed. The enabled flag is false by default.
The model section contains a business (L3) configuration that needs testing. If no model is specified, then the current tenant business configuration is tested.

Note: When the match rule analysis is complete, an email notification is sent to the user who submitted the request.

Example response

{
  "profiling": {
    "description": "Summary about execution of the match groups based on the existing data (tenantId=abcd)",
    "id": "72d8918d-853d-4331-a1c1-29ccefa9a395",
    "uri": "analyzeMatchRules/v2/profiling/72d8918d-853d-4331-a1c1-29ccefa9a395"
  }
}

The response contains some description about profiling, a unique identifier of the submitted profiling task, and the URI to request the results of profiling. The identifier can be used to access the profiling tasks directly by using the Tasks API.

Profiling section

The following subsections are available under the profiling section:

{  
    "profiling": {
        "enabled": true,
        "maxObjectsPerType": 5000,
        "timeout": 3600,
        "samplingAlgorithm": {},
        "inspections": {},
		"wordAnalysis": {},
        "analysisTypes": [],
		"useSkippedRules": false,
        "scopes": [
			"EXTERNAL",
			"ALL",
			"INTERNAL",
			"NONE"
		]
    }   
}

The following table explains the fields in the response:


Field	Description
`enabled`	Profiling is enabled when this field is set to `true`. It’s set to `false` by default.
`maxObjectsPerType`	Specifies the maximum number of objects for each type of match rule.
`timeout`	Specifies the timeout time period. Processing a large number of entities can be time consuming and can cause failures. Default timeout is 3600 seconds. You can increase it to 7200 seconds to avoid failures.
`samplingAlgorithm`	Specifies the sampling algorithm used to build a subset of entities. There are three supported sampling algorithms: SEQUENCE, SEARCH and MATCHES_AWARE.
`inspections`	Specifies the type of inspection that is running on a subset of data on the tenant.
`analysisTypes`	Specifies the type of analysis that is checking the matching performance of the tenant.
`useSkippedRules`	In the analysis, choose to include match rules that are bypassed or skipped. By default, skipped matched rules will be included in the analysis.
`scopes`	Defines the scopes of the match rules in the analysis. By default, `ALL`, `NONE`, `INTERNAL` and `EXTERNAL` scopes are evaluated.

maxObjectsPerType

It enables you to set the maximum number of objects for profiling. The default value is 10000. The value must be less than or equal to 100000000. You can’t use a value greater than 20000, when matchDocumentMatches is used. Disable matchDocumentMatches to use the maximum limit of 100000000.{ "profiling": { "enabled": true, "maxObjectsPerType": 5000, "timeout": 3600, "samplingAlgorithm": { "name": "..." }, "inspections": {}, "wordAnalysis": {} "analysisTypes": [] } }

Note: name is the name of the sampling algorithm. For example, SEQUENCE, SEARCH and MATCHES_AWARE.

Sampling Algorithm

There are three supported sampling algorithms:

SEQUENCE - It iterates through entities as they’re stored in the database. You can specify the starting point of iterations by using the rangeStart field, which corresponds to startToken in the database. { "name": "SEQUENCE", "rangeStart": 123456789123 }
SEARCH - This algorithm works over a set of identified objects. The object URIs are provided directly in the section or searched by using the specified queries. If objects are defined explicitly in the object.includeList section, then the list is passed directly to the profiling task. The task reads the objects excluding those specified in the objects.excludeList section and filters them by using the query if specified. If no objects are explicitly defined, then QueryObjectsTask is submitted to search the objects for the specified query. { "objects": { "includeList": [ "entities/ID1", "entities/ID2", "entities/ID3" ], "excludeList": [ "entities/ID4", "entities/ID5" ] }, "query": [ { "filter": "equals(attributes.Address.City, 'Las Vegas') and equals(type, 'configuration/entityTypes/HCP')", "activeness": "active", "options": "searchByOv" }, { "filter": "equals(attributes.Address.State, 'Ohio') and equals(type, 'configuration/entityTypes/HCO')", "activeness": "all" } ] }
MATCHES_AWARE - This sampling algorithm uses the existing match information in the tenant to build a subset of entities. In addition to other sampling algorithms, you can use the correlation and conditionalCorrelation calculators, which are optimized to provide accurate information for matchDocumentMatches analysis.
Correlation Calculator
The correlation coefficient of two match rules MR₁ and MR₂ is calculated as follows:The following information describes participants of the formulae:
- N_obs- the number of observations. The number of observations depends on the calculator parameter emulateWholeSubsetComparisonByFakeNotMatches. The default value is false.
  Note:
  - If set to false, the number of observations is equal to the number of entity pairs.
  - If set to true, the number of observations is equal to the number of entity pairs and the number of possible pairs of entities not considered because the result of the comparison MR₁(E_x, E_y) == MR₂(Ex,Ey) is zero.
  - All the entities are compared to each other, when emulateWholeSubsetComparisonByFakeNotMatches=true.
- The outcome of the comparison operator == depends on the treatSimilarMatchActionAsSame parameter option, which is set to true by default. The calculator generalizes the outcome of automatic and suspect match rules into the known outcomes of relevance_based rules: automatic > auto_merge, suspect > potential_match . There are two approaches to compare the outcomes of the two rules of any type:
  - Strict equality of outcomes. For example, when treatSimilarMatchActionAsSame=false:
    - auto_merge==auto_merge === 1
    - auto_merge==potential_match === 0
    - <no outcome>==auto_merge === 0
    - <no outcome>==<no outcome> === 1
    - <no outcome>==not_a_match === 0
    - etc
  - Grouping positive and negative outcomes. For example:
    - auto_merge==auto_merge === 1
    - auto_merge==potential_match === 1
    - <no outcome>==auto_merge === 0
    - <no outcome>==<no outcome> === 1
    - <no outcome>==not_a_match === 1
- The result of the formulae is the percentage number of similar match rule outcomes. The values -0.5 and 2 are used in the formulae, so that the value clearly represents the following:
  - C(MR₁, MR₂) = 1 - full correlation (all outcomes are the same)
  - C(MR₁, MR₂) = -1 - anti-correlation (all outcomes are different)
  - C(MR₁, MR₂) = 0 - non-correlation (no correlation between outcomes)
Conditional Correlation Calculator
The correlation coefficient of two match rules MR₁ and MR₂ is calculated as follows:The following information describes participants of the formulae:
- N_obs([MR₁=])- the number of observations, where the outcome of MR₁ is positive.
- The outcome of the comparison operator == depends on the treatSimilarMatchActionAsSame parameter option, which is set to true by default.

Analysis types

The following analysis types are available:

matchToken
matchTokenIntersections
matchGroupsPerMatchDocument
matchDocumentsPerMatchGroup
matchDocumentMatches

You can specify each analysis type by using a section as follows:

{
                "analysisType": "analysis name",
                "perMatchGroup": false,
                "splitByMatchGroupType": false,
                "statistics": [
                    {
                        "name": "statistic name",
                        "enabled": true,
						"parameters": [
							{
								"name": "parameterName",
								"value": "parameterValue or a JSON object"
							}
						]
                    }
                ]
            }

The analysis is enabled by default and can be disabled if required. The analysis can be performed on match groups in the following methods:

perMatchGroup flag - for a specific match group, true by default.
splitByMatchGroupType flag - for a set of match groups that have the same type of rules suspect, automatic, or relevance-based. The default value is true.
all match groups – for all match groups.

Some analysis can be run only for all match groups and therefore, perMatchGroup and splitByMatchGroupType settings are ignored.

If you don’t specify a statistics section, then the default set of statistics is used. Each statistic definition contains a name, enabled flag, and the parameters list. Each parameter contains the name and value fields. If a statistic has enabled=false, then it isn’t calculated. If some statistics are specified as disabled, then the corresponding default statistics are used. For most statistics, parameters are required but for some, no parameters are required.

You can use the following types of analysis to analyze the matching performance of your tenant:

matchToken

This analyzes the match tokens generated by different match groups for each object. It provides information about how match tokens are distributed across match groups in the relevant subset of data.

Table 1. Supported statistics for matchToken
Statistic name	Parameters	Description
`min`		The minimum number of tokens generated for an object.
`max`		The maximum number of tokens generated for an object.
`range`		The range of token numbers. Equal to `max - min + 1`.
`total`		The total number of tokens generated for all processed objects.
`mode`		The most frequent number of tokens for all processed objects.
`mean`		Mean (average) value of the number of tokens.
`std`		Standard deviation of the number of tokens.
`se`		Standard error of the mean of the generated tokens.
`skewness`		The skewness of the distribution of the generated tokens.
`kurtosis`		The kurtosis of the distribution of the generated tokens.
`variance`		The variance of the distribution of the generated tokens.
`median`		The median of the token numbers, equal to percentile (50%).
`firstQuartile`		The first quartile of the number of tokens, equal to percentile (25%).
`thirdQuartile`		The third quartile of the number of tokens, equal to percentile (75%).
`lowerBoundOutliers`	`k` - coefficient to evaluate outliers. The default is `1.5`.	Objects that have a lesser number of tokens than most values, can be considered as outliers. Outlier is an object with the number of tokens less than `Q₁ - k (Q₃ - Q₁)` (Tukey’s fence definition), where `Q₁` is the first quartile, `Q₃` is the third quartile, and `k` is the specified coefficient.
`upperBoundOutliers`	`k` - coefficient to evaluate outliers. The default is `1.5`.	Objects that have a greater number of tokens than most values, can be considered as outliers. Outlier is an object with the number of tokens greater than `Q₃ + k (Q₃ - Q₁)` (Tukey’s fence definition), where `Q₁` is the first quartile, `Q₃` is the third quartile, and `k` is the specified coefficient.
`mostFrequent`	`k` - most frequent number of tokens to return. The default is `10`.	The list of most frequent tokens is returned with their corresponding frequencies.
`histogram`	`nbins`- the number of histogram bins. The default value is `10`. `start` - the starting point of the histogram. `binSize` - the size of the bin. If `start` and `binSize` aren’t specified, they’re calculated using `nbins`, `min`, and `max` values.	The histogram is a representation of the distribution of the generated tokens. It’s calculated as the number of tokens that are associated with a particular histogram bin.

The following example shows the use of some statistics:

{
      "enabled": true,
      "perMatchGroup": true,
      "splitByMatchGroupType": true,
      "statistics": [
        {
          "name": "max",
          "enabled": true
        },
        {
          "name": "min",
          "enabled": false
        },
        {
          "name": "histogram",
          "parameters": [
            {
              "name": "bins",
              "value": 10
            },
            {
              "name": "left",
              "value": 0
            },
            {
              "name": "right",
              "value": 1000
            }
          ]
        }
      ]
}

matchTokenIntersections

This provides information about the distribution of match token intersections in the relevant subset of data. The intersection of match tokens of two objects is a subset of match tokens that are common to both objects.

Table 2. Supported statistics for matchTokenIntersections
Statistic name	Parameters	Description
`correlation`		It’s the ratio of the number of entity pairs identified as candidates to the total number of processed entity pairs of two match rules. The value of the correlation is between 0 and 1, where 0 indicates no intersection between match pairs and 1 indicates that the two sets of match pairs are exactly the same. `firstConditionalCorrelation=1` - all the candidate pairs found by the first match rule were also found by the second match rule. `secondConditionalCorrelation=1` - all the candidate pairs found by the second match rule were also found by the first match rule.
`min`		The minimum number match token intersections.
`max`		The maximum number match token intersections.
`range`		The range of token intersections. Equal to `max - min + 1`.
`total`		The total number of intersections.
`mode`		The most frequent number of intersections.
`mean`		Mean (average) value of the number of token intersections.
`std`		Standard deviation of the number of token intersections.
`se`		Standard error of the mean of the number of token intersections.
`skewness`		The skewness of the distribution of token intersections.
`kurtosis`		The kurtosis of the distribution of token intersections.
`variance`		The variance of the distribution of token intersections.
`median`		The median of token intersections, equal to percentile (50%).
`firstQuartile`		The first quartile of token intersections, equal to percentile (25%).
`thirdQuartile`		The third quartile of token intersections, equal to percentile (75%).
`lowerBoundOutliers`	`k` - coefficient to evaluate outliers. The default is `1.5`.	Objects that have a less number of common tokens than most values, can be considered as outliers. Outlier is an object with the number of common tokens less than `Q₁ - k (Q₃ - Q₁)` (Tukey’s fence definition), where `Q₁` is the first quartile, `Q₃` is the third quartile, and `k` is the specified coefficient.
`upperBoundOutliers`	`k` - coefficient to evaluate outliers. The default is `1.5`.	Objects that have a greater number of common tokens than most values, can be considered as outliers. Outlier is an object with the number of common tokens greater than `Q₃ + k (Q₃ - Q₁)` (Tukey’s fence definition), where `Q₁` is the first quartile, `Q₃` is the third quartile, and `k` is the specified coefficient.
`histogram`	`nbins`- the number of histogram bins. The default value is `10`. `start` - the starting point of the histogram. `binSize` - the size of the bin. If `start` and `binSize` aren’t specified, they’re calculated using `nbins`, `min`, `max` values.	The histogram is a representation of the distribution of match token intersections. It’s calculated as the number of intersections that are associated with a particular histogram bin.

matchGroupsPerMatchDocument

This analyzes the match groups that are used to create a particular match document. It provides information about the distribution of match groups per match document. This analysis considers the entire set of known match groups and therefore, perMatchGroup and splitByMatchGroupType settings are ignored.

Table 3. Supported statistics for matchGroupsPerMatchDocument
Statistic name	Parameters	Description
`min`		The minimum number of match groups.
`max`		The maximum number of match groups.
`range`		The range of match group numbers. Equal to `max - min + 1`.
`total`		The total number match groups to build a match document (Sum of match groups per match document).
`mode`		The most frequent number of match groups.
`mean`		Mean (average) value of the number of match groups.
`std`		Standard deviation of the number of match groups.
`se`		Standard error of the mean of the number of match groups.
`skewness`		The skewness of the distribution of match groups.
`kurtosis`		The kurtosis of the distribution of match groups
`variance`		The variance of the distribution of match groups.
`median`		The median of the number of match groups, equal to percentile (50%).
`firstQuartile`		The first quartile of the number of match groups, equal to percentile (25%).
`thirdQuartile`		The third quartile of the number of match groups, equal to percentile (75%).
`lowerBoundOutliers`	`k` - coefficient to evaluate outliers. The default is `1.5`.	Objects with a less number of match groups in a match document, can be considered as outliers. Outlier is an object with the number of match groups less than `Q₁ - k (Q₃ - Q₁)` (Tukey’s fence definition), where `Q₁` is the first quartile, `Q₃` is the third quartile, and `k` is the specified coefficient.
`upperBoundOutliers`	`k` - coefficient to evaluate outliers. The default is `1.5`.	Objects with many match groups in a match document, can be considered as outliers. Outlier is an object with the number of match groups greater than `Q₃ + k (Q₃ - Q₁)` (Tukey’s fence definition), where `Q₁` is the first quartile, `Q₃` is the third quartile, and `k` is the specified coefficient.
`histogram`	`nbins`- the number of histogram bins. The default value is `10`. `start` - the starting point of the histogram. `binSize` - the size of the bin. If `start` and `binSize` aren’t specified, they’re calculated using `nbins`, `min`, `max` values.	The histogram is a representation of the distribution of the number of match groups. It’s calculated as the number of match groups in match document that are associated with a particular histogram bin.
`mostFrequent`	`k` - the number of most frequent match groups to return. The default is `10`.	The list of most frequent match groups is returned along with their corresponding frequencies.
`covariance`		The covariance of match groups in a match document. The result is a list of match group pairs with the covariance value.
`correlation`		The correlation of match groups in a match document. The result is a list of match group pairs with the correlation value.

matchDocumentsPerMatchGroup

This analyzes and provides information about the distribution of match documents that have specific match groups. The analysis considers the entire set of match groups and therefore, perMatchGroup and splitByMatchGroupType settings are ignored.

Table 4. Supported statistics for matchDocumentPerMatchGroup
Statistic name	Parameters	Description
`frequencies`		The number of match documents that have a particular match group.

matchDocumentMatches

This analyzes match documents and their matches based on the specified match groups. It provides information about the distribution of match document matches across match groups.

You can analyze a particular match group, a subset of match groups that have a specific type, or the entire set of match groups. However, the analysis is computationally expensive and therefore,perMatchGroup and splitByMatchGroupType settings are false by default.

The match is detected when two match documents are equal based on a match group. The detection happens by comparing objects directly by using the match group without considering the generated match tokens.

Table 5. Supported statistics for matchDocumentMatches
Statistic name	Parameters	Description
`min`		The minimum number of matches.
`max`		The maximum number of matches.
`range`		The range of matches. Equal to `max - min + 1`.
`total`		The total number of detected matches (match between object 1 and object 2 is considered the same as match between object 2 and object 1 and included as one match).
`mode`		The most frequent number of matches.
`mean`		Mean (average) value of the number of matches.
`std`		Standard deviation of the number of matches.
`se`		Standard error of the mean of the number of matches.
`skewness`		The skewness of the distribution of matches.
`kurtosis`		The kurtosis of the distribution of matches.
`variance`		The variance of the distribution of matches.
`median`		The median of the number of matches, equal to percentile (50%).
`firstQuartile`		The first quartile of the number of matches, equal to percentile (25%).
`thirdQuartile`		The third quartile of the number of matches, equal to percentile (75%).
`lowerBoundOutliers`	`k` - coefficient to evaluate outliers. The default is `1.5`.	Objects with a less number of matches in a match document, can be considered as outliers. Outlier is an object with the number of matches less than `Q₁ - k (Q₃ - Q₁)` (Tukey’s fence definition), where `Q₁` is the first quartile, `Q₃` is the third quartile, and `k` is the specified coefficient.
`upperBoundOutliers`	`k` - coefficient to evaluate outliers. The default is `1.5`.	Objects with many matches in a match document, can be considered as outliers. Outlier is an object with the number of matches greater than `Q₃ + k (Q₃ - Q₁)` (Tukey’s fence definition), where `Q₁` is the first quartile, `Q₃` is the third quartile, and `k` is the specified coefficient.
`histogram`	`nbins`- the number of histogram bins. The default value is `10`. `start` - the starting point of the histogram. `binSize` - the size of the bin. If `start` and `binSize` aren’t specified, they’re calculated using `nbins`, `min`, `max` values.	The histogram is a representation of the distribution of the number of matches. It’s calculated as the number of matches in match document that are associated with a particular histogram bin.
`covariance`		The covariance of matches in a match document. The result is a list of matches with the covariance value.
`correlation`		The correlation of matches in a match document. The result is a list of matches with the correlation value.
`conditionalCorrelation`		The correlation coefficient of two match groups when at least one match group's result is positive.
`countFrequency`		The number of entities having a specified amount of matches.

Retrieving the request for profiling

Requests:

GET /tools/{tenantId}/analyzeMatchRules/v2/profiling/{profilingId}

Example Response

{
  "status": "COMPLETED",
  "startTimestamp": 1569931108398,
  "startTime": "2019-10-01T11:58:28.398Z",
  "finishTimestamp": 1569931372217,
  "finishTime": "2019-10-01T12:02:52.217Z",
  "duration": 263819,
  "totalObjectsProcessed": 1000,
  "useSkippedRules": true,
  "scopes": [ "ALL", "NONE", "INTERNAL", "EXTERNAL" ],
  "entityTypes": [
    {
      "uri": "configuration/entityTypes/Individual",
      "objectsProcessed": 1000,
      "matchToken": {
        "enabled": true,
        "perMatchGroup": true,
        "splitByMatchGroupType": true,
        "projections": [
          {
            "type": "match-group-single",
            "matchGroups": [
              {
                "uri": "configuration/entityTypes/Individual/matchGroups/SameFirstName"
              }
            ],
            "statistics": [
              {
                "name": "total",
                "value": 989
              },
              {
                "name": "mode",
                "value": [1]
              },
              {
                "name": "mostFrequent",
                "parameters": {
                  "k": 10
                },
                "value": [{"item": "firstname23 0","count": 10},{"item": "firstname40 0","count": 9}]
              },
              {
                "name": "histogram",
                "parameters": {
                  "start": 0,
                  "nbins": 10,
                  "binSize": 1
                },
                "value": [{"x": 0,"y": 328},{"x": 1,"y": 355},{"x": 2,"y": 317},]
              },
            ]
          },
          {
            "type": "match-group-single",
            "matchGroups": [
              {
                "uri": "configuration/entityTypes/Individual/matchGroups/SameLastName"
              }
            ],
            "statistics": []
          },
          {
            "type": "match-group-type-suspect",
            "matchGroups": [
              {
                "uri": "configuration/entityTypes/Individual/matchGroups/SameFirstName"
              },
              {
                "uri": "configuration/entityTypes/Individual/matchGroups/SameLastName"
              },
              {
                "uri": "configuration/entityTypes/Individual/matchGroups/SameFirstLastName"
              }
            ],
            "statistics": []
          },
          {
            "type": "match-group-all",
            "matchGroups": [
              {
                "uri": "configuration/entityTypes/Individual/matchGroups/SameFirstName"
              },
              {
                "uri": "configuration/entityTypes/Individual/matchGroups/SameLastName"
              },
              {
                "uri": "configuration/entityTypes/Individual/matchGroups/SameFirstLastName"
              }
            ],
            "statistics": []
          }
        ]
      },
      "matchTokenIntersections": {
        "enabled": true,
        "perMatchGroup": true,
        "splitByMatchGroupType": true,
        "projections": [
          {
            "type": "match-group-all",
            "matchGroups": [
              {
                "uri": "configuration/entityTypes/Individual/matchGroups/SameFirstName"
              },
              {
                "uri": "configuration/entityTypes/Individual/matchGroups/SameLastName"
              },
              {
                "uri": "configuration/entityTypes/Individual/matchGroups/SameFirstLastName"
              }
            ],
            "statistics": [
              {
                "name": "upperBoundOutliers",
                "parameters": {
                  "k": 1.5
                },
                "value": [
                  {
                    "ids": ["id781","id681",],
                    "token": "lastname81 1",
                    "count": 90
                  }
                ],
                "details": {
                  "fence": 33
                }
              }
            ]
          }
        ]
      },
      "matchGroupsPerMatchDocument": {
        "enabled": true,
        "perMatchGroup": false,
        "splitByMatchGroupType": false,
        "projections": [
          {
            "type": "match-group-all",
            "matchGroups": [
              {
                "uri": "configuration/entityTypes/Individual/matchGroups/SameFirstName"
              },
              {
                "uri": "configuration/entityTypes/Individual/matchGroups/SameLastName"
              },
              {
                "uri": "configuration/entityTypes/Individual/matchGroups/SameFirstLastName"
              }
            ],
            "statistics": [
              {
                "name": "mostFrequent",
                "parameters": {
                  "k": 10
                },
                "value": [
                  {
                    "item": "configuration/entityTypes/Individual/matchGroups/SameFirstName",
                    "count": 672
                  },
                  {
                    "item": "configuration/entityTypes/Individual/matchGroups/SameLastName",
                    "count": 669
                  },
                  {
                    "item": "configuration/entityTypes/Individual/matchGroups/SameFirstLastName",
                    "count": 451
                  }
                ]
              },
              {
                "name": "covariance",
                "value": [
                  {
                    "t1": "configuration/entityTypes/Individual/matchGroups/SameFirstName",
                    "t2": "configuration/entityTypes/Individual/matchGroups/SameLastName",
                    "value": 0.001433
                  },
                  {
                    "t1": "configuration/entityTypes/Individual/matchGroups/SameFirstName",
                    "t2": "configuration/entityTypes/Individual/matchGroups/SameFirstLastName",
                    "value": 0.148076
                  },
                  {
                    "t1": "configuration/entityTypes/Individual/matchGroups/SameLastName",
                    "t2": "configuration/entityTypes/Individual/matchGroups/SameFirstLastName",
                    "value": 0.14943
                  }
                ]
              },
              {
                "name": "correlation",
                "value": [
                  {
                    "t1": "configuration/entityTypes/Individual/matchGroups/SameFirstName",
                    "t2": "configuration/entityTypes/Individual/matchGroups/SameLastName",
                    "value": 0.006482
                  },
                  {
                    "t1": "configuration/entityTypes/Individual/matchGroups/SameFirstName",
                    "t2": "configuration/entityTypes/Individual/matchGroups/SameFirstLastName",
                    "value": 0.63322
                  },
                  {
                    "t1": "configuration/entityTypes/Individual/matchGroups/SameLastName",
                    "t2": "configuration/entityTypes/Individual/matchGroups/SameFirstLastName",
                    "value": 0.637534
                  }
                ]
              }
            ]
          }
        ]
      },
      "matchDocumentsPerMatchGroup": {
        "enabled": true,
        "perMatchGroup": false,
        "splitByMatchGroupType": false,
        "projections": [
          {
            "type": "match-group-all",
            "matchGroups": [
              {
                "uri": "configuration/entityTypes/Individual/matchGroups/SameFirstName"
              },
              {
                "uri": "configuration/entityTypes/Individual/matchGroups/SameLastName"
              },
              {
                "uri": "configuration/entityTypes/Individual/matchGroups/SameFirstLastName"
              }
            ],
            "statistics": [
              {
                "name": "frequencies",
                "value": [
                  {
                    "x": "configuration/entityTypes/Individual/matchGroups/SameFirstName",
                    "frequency": 672
                  },
                  {
                    "x": "configuration/entityTypes/Individual/matchGroups/SameFirstLastName",
                    "frequency": 451
                  },
                  {
                    "x": "configuration/entityTypes/Individual/matchGroups/SameLastName",
                    "frequency": 669
                  }
                ]
              }
            ]
          }
        ]
      },
      "matchDocumentMatches": {
        "enabled": true,
        "perMatchGroup": false,
        "splitByMatchGroupType": false,
        "projections": [
          {
            "type": "match-group-all",
            "matchGroups": [
              {
                "uri": "configuration/entityTypes/Individual/matchGroups/SameFirstName"
              },
              {
                "uri": "configuration/entityTypes/Individual/matchGroups/SameLastName"
              },
              {
                "uri": "configuration/entityTypes/Individual/matchGroups/SameFirstLastName"
              }
            ],
            "statistics": [
              {
                "name": "correlation",
                "value": [
                  {
                    "t1": "configuration/entityTypes/Individual/matchGroups/SameFirstName",
                    "t2": "configuration/entityTypes/Individual/matchGroups/SameLastName",
                    "value": -0.540045
                  },
                  {
                    "t1": "configuration/entityTypes/Individual/matchGroups/SameFirstName",
                    "t2": "configuration/entityTypes/Individual/matchGroups/SameFirstLastName",
                    "value": 0.479254
                  },
                  {
                    "t1": "configuration/entityTypes/Individual/matchGroups/SameLastName",
                    "t2": "configuration/entityTypes/Individual/matchGroups/SameFirstLastName",
                    "value": 0.479866
                  }
                ]
              }
            ]
          }
        ]
      }
    }
  ]

The following table describes the sections that are available in the response:

Table 6. Section descriptions
Section Name	Description
`status`	The status of profiling.
`startTimestamp`	Epoch milliseconds of the profiling start time.
`startTime`	Human readable value of the start time.
`finishTimestamp`	Epoch milliseconds of the profiling finish time.
`finishTime`	Human readable value of the finish time.
`duration`	Time of profiling in milliseconds.
`totalObjectsProcessed`	The total number of processed objects.
`useSkippedRules`	In the analysis, choose to include match rules that are bypassed or skipped. By default, skipped matched rules will be included in the analysis.
`scopes`	Defines the scopes of the match rules in the analysis. By default, `ALL`, `NONE`, `INTERNAL` and `EXTERNAL` scopes are evaluated.
`entityTypes`	The array of results for each entity type.
`entityTypes.uri`	The URI of the processed entity type.
`entityTypes.objectsProcessed`	The number of accepted objects of a specific type.
`entityTypes.matchToken`	The section that describes the analysis of match tokens for a specific entity type.
`entityTypes.matchToken.projections`	The array of profiling results related to a specific set of match groups.
`entityTypes.matchToken.projections.type`	The type of projection: `match-groups-single`, `match-groups-type-suspect`, `match-groups-type-automatic`, `match-groups-type-relevance_based`, and `match-groups-all`.
`entityTypes.matchToken.projections.matchGroups`	The list of match groups used for profiling.
`entityTypes.matchToken.projections.statistics`	The array of statistics results.
`entityTypes.matchToken.projections.statistics.name`	The name of a particular statistics calculator.
`entityTypes.matchToken.projections.statistics.value`	The value of the result. The result can be a single number (for mean, max, total, and so on), array of numbers (mode), or some complex object or list of objects (such as correlation).
`entityTypes.matchToken.projections.statistics.parameters`	The parameters of the statistics calculator.

Count frequency calculator

A match rule can generate many tokens for an entity. If the number of tokens is high, then the storage is overloaded during matching. If the number of tokens is greater than some value, such as 1000, then the tokens aren’t updated into the database and such entities don’t participate in matching. If there are no tokens generated, then the match rule or a set of match rules don’t participate in matching.

The analyzer is enhanced with a calculator to calculate the number of entities that have a token count within a specified lower and upper limit. If the lower value is omitted, the lower limit is 0. If the upper value is omitted, the upper limit is infinity. If both upper and lower limits aren’t specified, then lower limit is 0 and upper limit is infinity and the calculator's result is the total number of entities.

The calculator applies to matchToken analysis only. The calculator works for a single match rule, a set of match rules having a specific type, or all match rules.

Table 7. Parameters
Name	Default	Description
`lower`	0	The number that indicates the lower limit.
`upper`	2147483647	The number that indicates the lower limit.
`examplesAmount`	10	The pair of entities along with the number of tokens within the specified limits.
`examplesOrder`	ANY	Enables you to sort the identified entities in ascending or descending order. By default, there’s no ordering.

Example of the Calculator Payload

{
  "name": "countFrequency",
  "enabled": true,
  "parameters": {
    "examplesOrder": "ASC",
    "lower": 300,
    "examplesAmount": 20
  }
}

Example of the Calculator Result

{
  "name": "countFrequency",
  "parameters": {
    "examplesOrder": "DESC",
    "lower": 1,
    "upper": 5,
    "examplesAmount": 3
  },
  "value": {
    "frequency": 211,
    "examples": [
      {
        "id": "id71",
        "count": 3
      },
      {
        "id": "id61",
        "count": 2
      },
      {
        "id": "id13",
        "count": 1
      }
    ]
  }
}

The following table explains the sections in the JSON output:

Table 8. JSON sections
Section	Description
`name`	Name of the calculator.
`parameters`	Actual parameters of the calculator.
`parameters.lower`	Actual `lower` parameter.
`parameters.upper`	Actual `upper` parameter.
`parameters.examplesOrder`	Actual `examplesOrder` parameter.
`parameters.examplesAmount`	Actual `examplesAmount` parameter.
`value`	Result of the calculator.
`value.frequency`	Number of entities having tokens count within the specified limits.
`value.examples`	Examples of entities within the specified limits.
`value.examples.id`	Identifier of the example entity.
`value.examples.count`	Number of tokens for the example entity.

Unify and manage your data