Match Rule Analyzer version 2 (Dynamic)
Learn how to profile the matching process using the Match Rule Analyzer.
Overview
You can use Match Rule Analyzer version 2 to profile match rules based on a relevant subset of data in a tenant and view statistics for data that is profiled.
- Analyze match rules and identify rules that are causing performance issues.
- Tune match rules to ensure optimum performance before using them on your production data.
- Remove rules that aren’t being used for matching.
Request Body
The request body of the match rules analyzer version 2 contains a new section called
Profiling
under which you can perform several types of analysis and
generate statistics to get a detailed insight into the overall matching process for your
tenant.
Requests:
POST /tools/{tenantId}/analyzeMatchRules/v2
Example request body
{
"entityTypes": [
"Type1",
"Type2",
"Type3"
],
"profiling": {
"enabled": true
},
"model": {
}
}
- If
entity types
is empty or missing, then all entity types from the configuration are analyzed. - If the
profiling
section is empty or missing, then no analysis is performed. Theenabled
flag isfalse
by default. - The
model
section contains a business (L3) configuration that needs testing. If no model is specified, then the current tenant business configuration is tested.
Example response
{
"profiling": {
"description": "Summary about execution of the match groups based on the existing data (tenantId=abcd)",
"id": "72d8918d-853d-4331-a1c1-29ccefa9a395",
"uri": "analyzeMatchRules/v2/profiling/72d8918d-853d-4331-a1c1-29ccefa9a395"
}
}
The response contains some description about profiling, a unique identifier of the submitted profiling task, and the URI to request the results of profiling. The identifier can be used to access the profiling tasks directly by using the Tasks API.
Profiling section
The following subsections are available under the profiling section:
{
"profiling": {
"enabled": true,
"maxObjectsPerType": 5000,
"timeout": 3600,
"samplingAlgorithm": {},
"inspections": {},
"wordAnalysis": {},
"analysisTypes": [],
"useSkippedRules": false,
"scopes": [
"EXTERNAL",
"ALL",
"INTERNAL",
"NONE"
]
}
}
Field | Description |
---|---|
enabled |
Profiling is enabled when this field is set to true . It’s
set to false by default. |
maxObjectsPerType |
Specifies the maximum number of objects for each type of match rule. |
timeout |
Specifies the timeout time period. Processing a large number of entities can be time consuming and can cause failures. Default timeout is 3600 seconds. You can increase it to 7200 seconds to avoid failures. |
samplingAlgorithm |
Specifies the sampling algorithm used to build a subset of entities. There are three supported sampling algorithms: SEQUENCE, SEARCH and MATCHES_AWARE. |
inspections |
Specifies the type of inspection that is running on a subset of data on the tenant. |
analysisTypes |
Specifies the type of analysis that is checking the matching performance of the tenant. |
useSkippedRules |
In the analysis, choose to include match rules that are bypassed or skipped. By default, skipped matched rules will be included in the analysis. |
scopes |
Defines the scopes of the match rules in the analysis. By default,
ALL , NONE , INTERNAL and
EXTERNAL scopes are evaluated. |
maxObjectsPerType
10000
. The value must be less than or equal to
100000000
. You can’t use a value greater than 20000
, when
matchDocumentMatches
is used. Disable
matchDocumentMatches
to use the maximum limit of
100000000
.{
"profiling": {
"enabled": true,
"maxObjectsPerType": 5000,
"timeout": 3600,
"samplingAlgorithm": {
"name": "..."
},
"inspections": {},
"wordAnalysis": {}
"analysisTypes": []
}
}
name
is the name of the sampling algorithm. For example,
SEQUENCE, SEARCH and MATCHES_AWARE.Sampling Algorithm
- SEQUENCE - It iterates through entities as they’re stored in the database. You can
specify the starting point of iterations by using the
rangeStart
field, which corresponds tostartToken
in the database.{ "name": "SEQUENCE", "rangeStart": 123456789123 }
- SEARCH - This algorithm works over a set of identified objects. The object URIs are
provided directly in the section or searched by using the specified queries. If objects
are defined explicitly in the
object.includeList
section, then the list is passed directly to the profiling task. The task reads the objects excluding those specified in theobjects.excludeList
section and filters them by using thequery
if specified. If no objects are explicitly defined, thenQueryObjectsTask
is submitted to search the objects for the specifiedquery
.{ "objects": { "includeList": [ "entities/ID1", "entities/ID2", "entities/ID3" ], "excludeList": [ "entities/ID4", "entities/ID5" ] }, "query": [ { "filter": "equals(attributes.Address.City, 'Las Vegas') and equals(type, 'configuration/entityTypes/HCP')", "activeness": "active", "options": "searchByOv" }, { "filter": "equals(attributes.Address.State, 'Ohio') and equals(type, 'configuration/entityTypes/HCO')", "activeness": "all" } ] }
- MATCHES_AWARE - This sampling algorithm uses the existing match information in the
tenant to build a subset of entities. In addition to other sampling algorithms, you can
use the
correlation
andconditionalCorrelation
calculators, which are optimized to provide accurate information formatchDocumentMatches
analysis.Correlation Calculator
The correlation coefficient of two match rules MR1 and MR2 is calculated as follows:The following information describes participants of the formulae:- Nobs- the number of observations. The number of observations depends
on the calculator parameter
emulateWholeSubsetComparisonByFakeNotMatches
. The default value is false.Note:- If set to false, the number of observations is equal to the number of entity pairs.
- If set to true, the number of observations is equal to the number of entity pairs and the number of possible pairs of entities not considered because the result of the comparison MR1(Ex, Ey) == MR2(Ex,Ey) is zero.
- All the entities are compared to each other, when
emulateWholeSubsetComparisonByFakeNotMatches
=true
.
- The outcome of the comparison operator == depends on the
treatSimilarMatchActionAsSame
parameter option, which is set to true by default. The calculator generalizes the outcome of automatic and suspect match rules into the known outcomes of relevance_based rules: . There are two approaches to compare the outcomes of the two rules of any type:- Strict equality of outcomes. For example, when
treatSimilarMatchActionAsSame
=false
:- auto_merge==auto_merge === 1
- auto_merge==potential_match === 0
- <no outcome>==auto_merge === 0
- <no outcome>==<no outcome> === 1
- <no outcome>==not_a_match === 0
- etc
- Grouping positive and negative outcomes. For example:
- auto_merge==auto_merge === 1
- auto_merge==potential_match === 1
- <no outcome>==auto_merge === 0
- <no outcome>==<no outcome> === 1
- <no outcome>==not_a_match === 1
- Strict equality of outcomes. For example, when
- The result of the formulae is the percentage number of similar match rule
outcomes. The values -0.5 and 2 are used in the formulae, so that the value
clearly represents the following:
- C(MR1, MR2) = 1 - full correlation (all outcomes are the same)
- C(MR1, MR2) = -1 - anti-correlation (all outcomes are different)
- C(MR1, MR2) = 0 - non-correlation (no correlation between outcomes)
Conditional Correlation Calculator
The correlation coefficient of two match rules MR1 and MR2 is calculated as follows:The following information describes participants of the formulae:- Nobs([MR1=])- the number of observations, where the outcome of MR1 is positive.
- The outcome of the comparison operator == depends on the
treatSimilarMatchActionAsSame
parameter option, which is set to true by default.
- Nobs- the number of observations. The number of observations depends
on the calculator parameter
Analysis types
- matchToken
- matchTokenIntersections
- matchGroupsPerMatchDocument
- matchDocumentsPerMatchGroup
- matchDocumentMatches
{
"analysisType": "analysis name",
"perMatchGroup": false,
"splitByMatchGroupType": false,
"statistics": [
{
"name": "statistic name",
"enabled": true,
"parameters": [
{
"name": "parameterName",
"value": "parameterValue or a JSON object"
}
]
}
]
}
perMatchGroup
flag - for a specific match group,true
by default.splitByMatchGroupType
flag - for a set of match groups that have the same type of rulessuspect
,automatic
, orrelevance-based
. The default value istrue
.- all match groups – for all match groups.
Some analysis can be run only for all match groups and therefore,
perMatchGroup
and splitByMatchGroupType
settings are
ignored.
If you don’t specify a statistics
section, then the default set of
statistics is used. Each statistic definition contains a name
,
enabled flag
, and the parameters list. Each parameter contains the
name
and value
fields. If a statistic has
enabled=false
, then it isn’t calculated. If some statistics are specified
as disabled, then the corresponding default statistics are used. For most statistics,
parameters are required but for some, no parameters are required.
You can use the following types of analysis to analyze the matching performance of your tenant:
matchToken
This analyzes the match tokens generated by different match groups for each object. It provides information about how match tokens are distributed across match groups in the relevant subset of data.
Statistic name | Parameters | Description |
---|---|---|
min |
The minimum number of tokens generated for an object. | |
max |
The maximum number of tokens generated for an object. | |
range |
The range of token numbers. Equal to max - min + 1 . |
|
total |
The total number of tokens generated for all processed objects. | |
mode |
The most frequent number of tokens for all processed objects. | |
mean |
Mean (average) value of the number of tokens. | |
std |
Standard deviation of the number of tokens. | |
se |
Standard error of the mean of the generated tokens. | |
skewness |
The skewness of the distribution of the generated tokens. | |
kurtosis |
The kurtosis of the distribution of the generated tokens. | |
variance |
The variance of the distribution of the generated tokens. | |
median |
The median of the token numbers, equal to percentile (50%). | |
firstQuartile |
The first quartile of the number of tokens, equal to percentile (25%). | |
thirdQuartile |
The third quartile of the number of tokens, equal to percentile (75%). | |
lowerBoundOutliers |
k - coefficient to evaluate outliers. The default is
1.5 . |
Objects that have a lesser number of tokens than most values, can be considered
as outliers. Outlier is an object with the number of tokens less than
Q1 - k (Q3 - Q1) (Tukey’s
fence definition), where Q1 is the first quartile,
Q3 is the third quartile, and k is
the specified coefficient. |
upperBoundOutliers |
k - coefficient to evaluate outliers. The default is
1.5 . |
Objects that have a greater number of tokens than most values, can be
considered as outliers. Outlier is an object with the number of tokens greater than
Q3 + k (Q3 - Q1) (Tukey’s
fence definition), where Q1 is the first quartile,
Q3 is the third quartile, and k is
the specified coefficient. |
mostFrequent |
k - most frequent number of tokens to return. The default is
10 . |
The list of most frequent tokens is returned with their corresponding frequencies. |
histogram |
start and binSize aren’t specified,
they’re calculated using nbins , min , and
max values. |
The histogram is a representation of the distribution of the generated tokens. It’s calculated as the number of tokens that are associated with a particular histogram bin. |
The following example shows the use of some statistics:
{
"enabled": true,
"perMatchGroup": true,
"splitByMatchGroupType": true,
"statistics": [
{
"name": "max",
"enabled": true
},
{
"name": "min",
"enabled": false
},
{
"name": "histogram",
"parameters": [
{
"name": "bins",
"value": 10
},
{
"name": "left",
"value": 0
},
{
"name": "right",
"value": 1000
}
]
}
]
}
matchTokenIntersections
This provides information about the distribution of match token intersections in the relevant subset of data. The intersection of match tokens of two objects is a subset of match tokens that are common to both objects.
Statistic name | Parameters | Description |
---|---|---|
correlation |
It’s the ratio of the number of entity pairs identified as candidates to the
total number of processed entity pairs of two match rules. The value of the
correlation is between 0 and 1, where 0 indicates no intersection between match
pairs and 1 indicates that the two sets of match pairs are exactly the
same.
|
|
min |
The minimum number match token intersections. | |
max |
The maximum number match token intersections. | |
range |
The range of token intersections. Equal to max - min +
1 . |
|
total |
The total number of intersections. | |
mode |
The most frequent number of intersections. | |
mean |
Mean (average) value of the number of token intersections. | |
std |
Standard deviation of the number of token intersections. | |
se |
Standard error of the mean of the number of token intersections. | |
skewness |
The skewness of the distribution of token intersections. | |
kurtosis |
The kurtosis of the distribution of token intersections. | |
variance |
The variance of the distribution of token intersections. | |
median |
The median of token intersections, equal to percentile (50%). | |
firstQuartile |
The first quartile of token intersections, equal to percentile (25%). | |
thirdQuartile |
The third quartile of token intersections, equal to percentile (75%). | |
lowerBoundOutliers |
k - coefficient to evaluate outliers. The default is
1.5 . |
Objects that have a less number of common tokens than most values, can be
considered as outliers. Outlier is an object with the number of common tokens less
than Q1 - k (Q3 - Q1) (Tukey’s
fence definition), where Q1 is the first quartile,
Q3 is the third quartile, and k is
the specified coefficient. |
upperBoundOutliers |
k - coefficient to evaluate outliers. The default is
1.5 . |
Objects that have a greater number of common tokens than most values, can be
considered as outliers. Outlier is an object with the number of common tokens
greater than Q3 + k (Q3 - Q1)
(Tukey’s fence definition), where Q1 is the first
quartile, Q3 is the third quartile, and
k is the specified coefficient. |
histogram |
start and binSize aren’t specified,
they’re calculated using nbins , min ,
max values. |
The histogram is a representation of the distribution of match token intersections. It’s calculated as the number of intersections that are associated with a particular histogram bin. |
matchGroupsPerMatchDocument
This analyzes the match groups that are used to create a particular match document. It
provides information about the distribution of match groups per match document. This
analysis considers the entire set of known match groups and therefore,
perMatchGroup
and splitByMatchGroupType
settings are
ignored.
Statistic name | Parameters | Description |
---|---|---|
min |
The minimum number of match groups. | |
max |
The maximum number of match groups. | |
range |
The range of match group numbers. Equal to max - min +
1 . |
|
total |
The total number match groups to build a match document (Sum of match groups per match document). | |
mode |
The most frequent number of match groups. | |
mean |
Mean (average) value of the number of match groups. | |
std |
Standard deviation of the number of match groups. | |
se |
Standard error of the mean of the number of match groups. | |
skewness |
The skewness of the distribution of match groups. | |
kurtosis |
The kurtosis of the distribution of match groups | |
variance |
The variance of the distribution of match groups. | |
median |
The median of the number of match groups, equal to percentile (50%). | |
firstQuartile |
The first quartile of the number of match groups, equal to percentile (25%). | |
thirdQuartile |
The third quartile of the number of match groups, equal to percentile (75%). | |
lowerBoundOutliers |
k - coefficient to evaluate outliers. The default is
1.5 . |
Objects with a less number of match groups in a match document, can be
considered as outliers. Outlier is an object with the number of match groups less
than Q1 - k (Q3 - Q1) (Tukey’s
fence definition), where Q1 is the first quartile,
Q3 is the third quartile, and k is
the specified coefficient. |
upperBoundOutliers |
k - coefficient to evaluate outliers. The default is
1.5 . |
Objects with many match groups in a match document, can be considered as
outliers. Outlier is an object with the number of match groups greater than
Q3 + k (Q3 - Q1) (Tukey’s
fence definition), where Q1 is the first quartile,
Q3 is the third quartile, and k is
the specified coefficient. |
histogram |
start and binSize aren’t specified,
they’re calculated using nbins , min ,
max values. |
The histogram is a representation of the distribution of the number of match groups. It’s calculated as the number of match groups in match document that are associated with a particular histogram bin. |
mostFrequent |
k - the number of most frequent match groups to return. The
default is 10 . |
The list of most frequent match groups is returned along with their corresponding frequencies. |
covariance |
The covariance of match groups in a match document. The result is a list of match group pairs with the covariance value. | |
correlation |
The correlation of match groups in a match document. The result is a list of match group pairs with the correlation value. |
matchDocumentsPerMatchGroup
This analyzes and provides information about the distribution of match documents that have
specific match groups. The analysis considers the entire set of match groups and therefore,
perMatchGroup
and splitByMatchGroupType
settings are
ignored.
Statistic name | Parameters | Description |
---|---|---|
frequencies |
The number of match documents that have a particular match group. |
matchDocumentMatches
This analyzes match documents and their matches based on the specified match groups. It provides information about the distribution of match document matches across match groups.
You can analyze a particular match group, a subset of match groups that have a specific
type, or the entire set of match groups. However, the analysis is computationally expensive
and therefore,perMatchGroup
and splitByMatchGroupType
settings are false by default.
The match is detected when two match documents are equal based on a match group. The detection happens by comparing objects directly by using the match group without considering the generated match tokens.
Statistic name | Parameters | Description |
---|---|---|
min |
The minimum number of matches. | |
max |
The maximum number of matches. | |
range |
The range of matches. Equal to max - min + 1 . |
|
total |
The total number of detected matches (match between object 1 and object 2 is considered the same as match between object 2 and object 1 and included as one match). | |
mode |
The most frequent number of matches. | |
mean |
Mean (average) value of the number of matches. | |
std |
Standard deviation of the number of matches. | |
se |
Standard error of the mean of the number of matches. | |
skewness |
The skewness of the distribution of matches. | |
kurtosis |
The kurtosis of the distribution of matches. | |
variance |
The variance of the distribution of matches. | |
median |
The median of the number of matches, equal to percentile (50%). | |
firstQuartile |
The first quartile of the number of matches, equal to percentile (25%). | |
thirdQuartile |
The third quartile of the number of matches, equal to percentile (75%). | |
lowerBoundOutliers |
k - coefficient to evaluate outliers. The default is
1.5 . |
Objects with a less number of matches in a match document, can be considered as
outliers. Outlier is an object with the number of matches less than
Q1 - k (Q3 - Q1) (Tukey’s
fence definition), where Q1 is the first quartile,
Q3 is the third quartile, and k is
the specified coefficient. |
upperBoundOutliers |
k - coefficient to evaluate outliers. The default is
1.5 . |
Objects with many matches in a match document, can be considered as outliers.
Outlier is an object with the number of matches greater than Q3 +
k (Q3 - Q1) (Tukey’s fence definition), where
Q1 is the first quartile,
Q3 is the third quartile, and k is
the specified coefficient. |
histogram |
start and binSize aren’t specified,
they’re calculated using nbins , min ,
max values. |
The histogram is a representation of the distribution of the number of matches. It’s calculated as the number of matches in match document that are associated with a particular histogram bin. |
covariance |
The covariance of matches in a match document. The result is a list of matches with the covariance value. | |
correlation |
The correlation of matches in a match document. The result is a list of matches with the correlation value. | |
conditionalCorrelation |
The correlation coefficient of two match groups when at least one match group's result is positive. | |
countFrequency |
The number of entities having a specified amount of matches. |
Retrieving the request for profiling
Requests:
GET /tools/{tenantId}/analyzeMatchRules/v2/profiling/{profilingId}
Example Response
{
"status": "COMPLETED",
"startTimestamp": 1569931108398,
"startTime": "2019-10-01T11:58:28.398Z",
"finishTimestamp": 1569931372217,
"finishTime": "2019-10-01T12:02:52.217Z",
"duration": 263819,
"totalObjectsProcessed": 1000,
"useSkippedRules": true,
"scopes": [ "ALL", "NONE", "INTERNAL", "EXTERNAL" ],
"entityTypes": [
{
"uri": "configuration/entityTypes/Individual",
"objectsProcessed": 1000,
"matchToken": {
"enabled": true,
"perMatchGroup": true,
"splitByMatchGroupType": true,
"projections": [
{
"type": "match-group-single",
"matchGroups": [
{
"uri": "configuration/entityTypes/Individual/matchGroups/SameFirstName"
}
],
"statistics": [
{
"name": "total",
"value": 989
},
{
"name": "mode",
"value": [1]
},
{
"name": "mostFrequent",
"parameters": {
"k": 10
},
"value": [{"item": "firstname23 0","count": 10},{"item": "firstname40 0","count": 9}]
},
{
"name": "histogram",
"parameters": {
"start": 0,
"nbins": 10,
"binSize": 1
},
"value": [{"x": 0,"y": 328},{"x": 1,"y": 355},{"x": 2,"y": 317},]
},
]
},
{
"type": "match-group-single",
"matchGroups": [
{
"uri": "configuration/entityTypes/Individual/matchGroups/SameLastName"
}
],
"statistics": []
},
{
"type": "match-group-type-suspect",
"matchGroups": [
{
"uri": "configuration/entityTypes/Individual/matchGroups/SameFirstName"
},
{
"uri": "configuration/entityTypes/Individual/matchGroups/SameLastName"
},
{
"uri": "configuration/entityTypes/Individual/matchGroups/SameFirstLastName"
}
],
"statistics": []
},
{
"type": "match-group-all",
"matchGroups": [
{
"uri": "configuration/entityTypes/Individual/matchGroups/SameFirstName"
},
{
"uri": "configuration/entityTypes/Individual/matchGroups/SameLastName"
},
{
"uri": "configuration/entityTypes/Individual/matchGroups/SameFirstLastName"
}
],
"statistics": []
}
]
},
"matchTokenIntersections": {
"enabled": true,
"perMatchGroup": true,
"splitByMatchGroupType": true,
"projections": [
{
"type": "match-group-all",
"matchGroups": [
{
"uri": "configuration/entityTypes/Individual/matchGroups/SameFirstName"
},
{
"uri": "configuration/entityTypes/Individual/matchGroups/SameLastName"
},
{
"uri": "configuration/entityTypes/Individual/matchGroups/SameFirstLastName"
}
],
"statistics": [
{
"name": "upperBoundOutliers",
"parameters": {
"k": 1.5
},
"value": [
{
"ids": ["id781","id681",],
"token": "lastname81 1",
"count": 90
}
],
"details": {
"fence": 33
}
}
]
}
]
},
"matchGroupsPerMatchDocument": {
"enabled": true,
"perMatchGroup": false,
"splitByMatchGroupType": false,
"projections": [
{
"type": "match-group-all",
"matchGroups": [
{
"uri": "configuration/entityTypes/Individual/matchGroups/SameFirstName"
},
{
"uri": "configuration/entityTypes/Individual/matchGroups/SameLastName"
},
{
"uri": "configuration/entityTypes/Individual/matchGroups/SameFirstLastName"
}
],
"statistics": [
{
"name": "mostFrequent",
"parameters": {
"k": 10
},
"value": [
{
"item": "configuration/entityTypes/Individual/matchGroups/SameFirstName",
"count": 672
},
{
"item": "configuration/entityTypes/Individual/matchGroups/SameLastName",
"count": 669
},
{
"item": "configuration/entityTypes/Individual/matchGroups/SameFirstLastName",
"count": 451
}
]
},
{
"name": "covariance",
"value": [
{
"t1": "configuration/entityTypes/Individual/matchGroups/SameFirstName",
"t2": "configuration/entityTypes/Individual/matchGroups/SameLastName",
"value": 0.001433
},
{
"t1": "configuration/entityTypes/Individual/matchGroups/SameFirstName",
"t2": "configuration/entityTypes/Individual/matchGroups/SameFirstLastName",
"value": 0.148076
},
{
"t1": "configuration/entityTypes/Individual/matchGroups/SameLastName",
"t2": "configuration/entityTypes/Individual/matchGroups/SameFirstLastName",
"value": 0.14943
}
]
},
{
"name": "correlation",
"value": [
{
"t1": "configuration/entityTypes/Individual/matchGroups/SameFirstName",
"t2": "configuration/entityTypes/Individual/matchGroups/SameLastName",
"value": 0.006482
},
{
"t1": "configuration/entityTypes/Individual/matchGroups/SameFirstName",
"t2": "configuration/entityTypes/Individual/matchGroups/SameFirstLastName",
"value": 0.63322
},
{
"t1": "configuration/entityTypes/Individual/matchGroups/SameLastName",
"t2": "configuration/entityTypes/Individual/matchGroups/SameFirstLastName",
"value": 0.637534
}
]
}
]
}
]
},
"matchDocumentsPerMatchGroup": {
"enabled": true,
"perMatchGroup": false,
"splitByMatchGroupType": false,
"projections": [
{
"type": "match-group-all",
"matchGroups": [
{
"uri": "configuration/entityTypes/Individual/matchGroups/SameFirstName"
},
{
"uri": "configuration/entityTypes/Individual/matchGroups/SameLastName"
},
{
"uri": "configuration/entityTypes/Individual/matchGroups/SameFirstLastName"
}
],
"statistics": [
{
"name": "frequencies",
"value": [
{
"x": "configuration/entityTypes/Individual/matchGroups/SameFirstName",
"frequency": 672
},
{
"x": "configuration/entityTypes/Individual/matchGroups/SameFirstLastName",
"frequency": 451
},
{
"x": "configuration/entityTypes/Individual/matchGroups/SameLastName",
"frequency": 669
}
]
}
]
}
]
},
"matchDocumentMatches": {
"enabled": true,
"perMatchGroup": false,
"splitByMatchGroupType": false,
"projections": [
{
"type": "match-group-all",
"matchGroups": [
{
"uri": "configuration/entityTypes/Individual/matchGroups/SameFirstName"
},
{
"uri": "configuration/entityTypes/Individual/matchGroups/SameLastName"
},
{
"uri": "configuration/entityTypes/Individual/matchGroups/SameFirstLastName"
}
],
"statistics": [
{
"name": "correlation",
"value": [
{
"t1": "configuration/entityTypes/Individual/matchGroups/SameFirstName",
"t2": "configuration/entityTypes/Individual/matchGroups/SameLastName",
"value": -0.540045
},
{
"t1": "configuration/entityTypes/Individual/matchGroups/SameFirstName",
"t2": "configuration/entityTypes/Individual/matchGroups/SameFirstLastName",
"value": 0.479254
},
{
"t1": "configuration/entityTypes/Individual/matchGroups/SameLastName",
"t2": "configuration/entityTypes/Individual/matchGroups/SameFirstLastName",
"value": 0.479866
}
]
}
]
}
]
}
}
]
The following table describes the sections that are available in the response:
Section Name | Description |
---|---|
status |
The status of profiling. |
startTimestamp |
Epoch milliseconds of the profiling start time. |
startTime |
Human readable value of the start time. |
finishTimestamp |
Epoch milliseconds of the profiling finish time. |
finishTime |
Human readable value of the finish time. |
duration |
Time of profiling in milliseconds. |
totalObjectsProcessed |
The total number of processed objects. |
useSkippedRules |
In the analysis, choose to include match rules that are bypassed or skipped. By default, skipped matched rules will be included in the analysis. |
scopes |
Defines the scopes of the match rules in the analysis. By default,
ALL , NONE , INTERNAL and
EXTERNAL scopes are evaluated. |
entityTypes |
The array of results for each entity type. |
entityTypes.uri |
The URI of the processed entity type. |
entityTypes.objectsProcessed |
The number of accepted objects of a specific type. |
entityTypes.matchToken |
The section that describes the analysis of match tokens for a specific entity type. |
entityTypes.matchToken.projections |
The array of profiling results related to a specific set of match groups. |
entityTypes.matchToken.projections.type |
The type of projection: match-groups-single ,
match-groups-type-suspect ,
match-groups-type-automatic ,
match-groups-type-relevance_based , and
match-groups-all . |
entityTypes.matchToken.projections.matchGroups |
The list of match groups used for profiling. |
entityTypes.matchToken.projections.statistics |
The array of statistics results. |
entityTypes.matchToken.projections.statistics.name |
The name of a particular statistics calculator. |
entityTypes.matchToken.projections.statistics.value |
The value of the result. The result can be a single number (for mean, max, total, and so on), array of numbers (mode), or some complex object or list of objects (such as correlation). |
entityTypes.matchToken.projections.statistics.parameters |
The parameters of the statistics calculator. |
Count frequency calculator
A match rule can generate many tokens for an entity. If the number of tokens is high, then the storage is overloaded during matching. If the number of tokens is greater than some value, such as 1000, then the tokens aren’t updated into the database and such entities don’t participate in matching. If there are no tokens generated, then the match rule or a set of match rules don’t participate in matching.
The analyzer is enhanced with a calculator to calculate the number of entities that have a token count within a specified lower and upper limit. If the lower value is omitted, the lower limit is 0. If the upper value is omitted, the upper limit is infinity. If both upper and lower limits aren’t specified, then lower limit is 0 and upper limit is infinity and the calculator's result is the total number of entities.
The calculator applies to matchToken
analysis only. The calculator works
for a single match rule, a set of match rules having a specific type, or all match rules.
Name | Default | Description |
---|---|---|
lower
|
0 | The number that indicates the lower limit. |
upper |
2147483647 | The number that indicates the lower limit. |
examplesAmount |
10 | The pair of entities along with the number of tokens within the specified limits. |
examplesOrder |
ANY | Enables you to sort the identified entities in ascending or descending order. By default, there’s no ordering. |
Example of the Calculator Payload
{
"name": "countFrequency",
"enabled": true,
"parameters": {
"examplesOrder": "ASC",
"lower": 300,
"examplesAmount": 20
}
}
Example of the Calculator Result
{
"name": "countFrequency",
"parameters": {
"examplesOrder": "DESC",
"lower": 1,
"upper": 5,
"examplesAmount": 3
},
"value": {
"frequency": 211,
"examples": [
{
"id": "id71",
"count": 3
},
{
"id": "id61",
"count": 2
},
{
"id": "id13",
"count": 1
}
]
}
}
The following table explains the sections in the JSON output:
Section | Description |
---|---|
name
|
Name of the calculator. |
parameters |
Actual parameters of the calculator. |
parameters.lower |
Actual lower parameter. |
parameters.upper |
Actual upper parameter. |
parameters.examplesOrder |
Actual examplesOrder parameter. |
parameters.examplesAmount |
Actual examplesAmount parameter. |
value |
Result of the calculator. |
value.frequency |
Number of entities having tokens count within the specified limits. |
value.examples |
Examples of entities within the specified limits. |
value.examples.id |
Identifier of the example entity. |
value.examples.count |
Number of tokens for the example entity. |