Inspections for Tokens
Information about the inspections that you can perform using the dynamic analyzer.
Overview
- Highly correlated match tokens
- High number of match tokens
- High average number of match tokens
- Too many tokens are generated by a subset of the processed entities
- Too few tokens are generated by a subset of the processed entities
- Inspection ID (unique across all the match rule analyzer output) that points to the inspection applied
- Parameters used to perform the inspection
- Issue section with the severity and text
- Explanation and recommendation sections
Profiling Section Updated
The inspection section is added under the profiling
section of the
payload. Under the inspection section, there is a field called formatted
,
which is set to true by default. As a result, the inspection texts are formatted and the
texts have placeholders substituted by the actual values. The section for the issue,
explanation, and recommendation appears in the output. The updated profiling section is as
follows:
{
"enabled": true,
"entityTypes": [],
"staticAnalysis": {},
"profiling": {
"enabled": true,
"maxObjectsPerType": 10000,
"inspections": {
"formatted": true
}
}
}
Inspection - Highly correlated match tokens
highlyCorrelatedMatchTokens
inspection is added to identify match
groups generating candidate pairs having high correlation.[{0}]
and [{1}]
are placeholders for your actual tenant data.Parameter | Description |
---|---|
Summary
| Match group [{0}] has a tokenization scheme that is seemingly redundant since
it is fully covered in match group [{1}] . |
Explanation | The analysis shows that the tokenization scheme of [{0}] is producing the same
candidate pairs as [{1}] . |
Recommendation | Get rid of one of the tokenization schemes for better performance. Consider
fixing this only after you fix overcollisioned tokens and too many tokens issue. You
can transform one of your tokenization schemes to be identical to the other to get
rid of one of the schemes. You can also add attributes to
ignoreInToken parameter and use the same tokenization and cleanse
functions for the attributes (with preference to the most fuzzy ones). |
Parameter | Description |
---|---|
Summary
| Match group tokenization tactics are evenly distributed. |
Explanation | The analysis shows that your match groups tokenization schemes have a nicely distributed set of tactics, each of which is evaluating candidate pairs differently. |
Recommendation | Keep going! It appears that your match groups tokenization schemes employ a variety of tactics and are not redundant. |
Inspection - High number of match tokens
The documentsWithOmittedTokens
inspection is added to identify match rules
generating a high number of match tokens. We have set a limit of 1000 tokens per match
document. If the limit is exceeded, all the match tokens of the match rule are excluded from
the matching process.
Inspection - High average number of match tokens
The highAverageNumberOfTokens
inspection detects entities with a high
average number of match tokens.
Parameter | Value | Number of Match Groups Checked | Description |
---|---|---|---|
highAverageNumberOfTokens
| 20 to 40 | 1 | A Warning is displayed, when the average number of match tokens per match rule is between 20 to 40. |
highAverageNumberOfTokens | above 40 | 1 | An Error is displayed, when the average number of match tokens per match rule is higher than 40. |
highAverageNumberOfTokens | 30 to 50 | All match rules | A Warning is displayed, when the average number of match tokens for all match rules is between 30 to 50. |
highAverageNumberOfTokens | above 50 | All match rules | An Error is displayed, when the average number of match tokens for all match rules is higher than 50. |
Inspection - Too Many Tokens are Generated
{0}
, {1}
, and {2}
are placeholders for your actual tenant data.Parameter | Value | Description |
---|---|---|
thresholdManyTokens
| 150 | If the number of tokens for an entity is more than the
thresholdManyTokens value, then the entity is generating too many
tokens. |
thresholdEntitiesPercentage | 2 | If a percentage of entities generating too many tokens is more than the
thresholdEntitiesPercentage value, then the inspection results in
an issue with ERROR severity. |
Parameter | Value | Description |
---|---|---|
thresholdManyTokens
| 300 | If the number of tokens for an entity is more than the
thresholdManyTokens value, then the entity is generating too many
tokens. |
thresholdEntitiesPercentage | 2 | If a percentage of entities generating too many tokens is more than the
thresholdEntitiesPercentage value, then the inspection results in
an issue with ERROR severity. |
Parameter | Description |
---|---|
examples
| Array of strings with entity identifiers |
entitiesPercentage | Actual percentage of entities that generated too many tokens |
Parameter | Description |
---|---|
Text
| {0} of the entities generated an excessive number of token phrases. |
Severity | ERROR |
Parameters |
|
Parameter | Description |
---|---|
Text
| The number of token phrases generated by a single match group for an entity is
normally less than {0} . {1} of the entities in the
data set generated more than {0} token phrase(s) for the match group
''{2} ''. This can significantly affect the performance of the
match process. |
Parameters |
|
Parameter | Description |
---|---|
Text
| The number of token phrases generated for an entity by a set of match groups is
normally below {0} . {1} of the entities in the
data set generated more than {0} token phrase(s). This can
significantly affect the performance of the match process. |
Parameters |
|
Parameter | Description |
---|---|
Text
| Review the tokenization design with a goal of reducing the number of token
phrases it generates. Typical tactics include reducing the fuzziness of the match
token classes, increasing the use of ignoreInToken to suppress
attributes that are unneeded for tokenization, and use of UseOvOnly
to reduce the number of attribute values used by the tokenizer. |
Information Output
If no errors or warnings are reported, then the inspection results in an information output.
Parameter | Description |
---|---|
Summary | {0} entities generated less than {1} token
phrase(s). |
Severity
| INFO |
Parameters |
|
Parameter | Description |
---|---|
Text | Great job! Ideally, between {1}-{2} token phrases are
generated for an entity per match group. For this data set, only
{0} entities fall above the recommended upper limit of token
phrase generation and so it appears that entities in your data set have not produced
an excessive number of token phrases. |
Parameters |
|
Parameter | Description |
---|---|
Text | Great job! Ideally, between {1}-{2} token phrases are
generated for a set of match groups. For this data set, only {0}
entities fall above the recommended upper limit of token phrase generation and so it
appears that entities in your data set have not produced an excessive number of
token phrases. |
Parameters |
|
Parameter | Description |
---|---|
Text | Keep on going! It appears your match groups are not generating an excessive number of token phrases for this data set. |
Inspection - Too Few Tokens are Generated
{0}
, {1}
, and {2}
are placeholders for your actual tenant data.Parameter | Value | Description |
---|---|---|
thresholdFewTokens
| 1 | If the number of tokens for an entity is less than the
thresholdFewTokens value, then the entity is generating too few
tokens. |
thresholdEntitiesPercentage | 20 | If a percentage of entities generating too few tokens is more than the
thresholdEntitiesPercentage value, then the inspection results in
an issue with WARNING severity. |
Parameter | Value | Description |
---|---|---|
thresholdFewTokens
| 5 | If the number of tokens for an entity is less than the
thresholdFewTokens value, then the entity is generating too few
tokens. |
thresholdEntitiesPercentage | 2 | If a percentage of entities generating too few tokens is more than the
thresholdEntitiesPercentage value, then the inspection results in
an issue with ERROR severity. |
Parameter | Description |
---|---|
examples
| Array of strings with entity identifiers |
entitiesPercentage | Percentage of entities that generated too few tokens |
Parameter | Description |
---|---|
Text
| {0} entities generated very few token phrases. |
Severity | WARNING |
Parameters |
|
Parameter | Description |
---|---|
Text
| {0} entities generated too few token phrases. |
Severity | ERROR |
Parameters |
|
Parameter | Description |
---|---|
Text
| The number of token phrases generated for an entity by a single match group is
normally greater than {0} . {1} of the entities in
the data set generated less than {0} token phrase(s) for the match
group "{2} ". This suggests your tokenization design is too
restrictive. |
Parameters |
|
Parameter | Description |
---|---|
Text
| The number of token phrases generated for an entity by a set of match groups is
normally greater than {0} . {1} of the entities in
the data set generated fewer than {0} token phrase(s) which
significantly reduces the likelihood of these entities being paired with others for
matching. This suggests your tokenization design is too restrictive. |
Parameters |
|
Parameter | Description |
---|---|
Text
| Review the tokenization design with a goal of increasing the number of token phrases it generates. You might try increasing the fuzziness of the match token classes. |
Information Output
If no errors or warnings are reported, then the inspection results in an information output.
Parameter | Description |
---|---|
Summary | {0} entities generated over {1} token
phrase(s). |
Severity
| INFO |
Parameters |
|
Parameter | Description |
---|---|
Text | Great job! Ideally, between {1}-{2} token phrases are
generated for an entity per match group. For this data set, only
{0} entities fall below the recommended lower limit of token
phrase generation and so it appears that entities in your data set have produced an
adequate number of token phrases. |
Parameters |
|
Parameter | Description |
---|---|
Text | Great job! Ideally, between {1}-{2} token phrases are
generated for an entity by a set of match groups.For this data set, only
{0} entities fall below the recommended lower limit of token
phrase generation and so it appears that entities in your data set have produced an
adequate number of token phrases. |
Parameters |
|
Parameter | Description |
---|---|
Text | Keep on going! It appears your match groups are generating an adequate number of token phrases for this data set. |
Example of the Inspections
The following sample shows the use of both the inspections:
{
"inspections": [
{
"uid": "1-987-asdf-vcxb-sadf",
"inspectionId": "tooManyTokens",
"parameters": [
{
"name": "thresholdManyTokens",
"value": 300
},
{
"name": "thresholdEntitiesPercentage",
"value": "2%"
}
],
"computationResults": [
{
"name": "entitiesPercentage",
"value": 5.5
},
{
"name": "examples",
"values": [
"entityId1",
"entityId2"
]
}
],
"issue": {
"severity": "ERROR",
"text": "{0} entities generated too many token phrases.",
"parameters": [
{
"id": "0",
"type": "percentage",
"value": "3"
}
]
},
"explanation": {
"text": "Extended explanation {0} {1} {2}",
"parameters": [
{
"id": "0",
"type": "percentage",
"value": "3"
},
{
"id": "1",
"type": "matchGroupUri",
"value": "some uri"
}
]
},
"recommendation": {
"text": "Some recommendation {0} {1} {2}",
"parameters": [
{
"id": "0",
"type": "percentage",
"value": "3"
},
{
"id": "1",
"type": "matchGroupUri",
"value": "some uri"
}
The following table explains the sections in the JSON output:
Name | Required | Description |
---|---|---|
inspections
| No | The results of the applied inspections. |
inspections[*].uid | Yes | Unique identifier of an inspection result. |
inspections[*].inspectionId | Yes | Identifier of an applied inspection. |
inspections[*].parameters | Yes/No, depends on inspectionId | Parameters of an applied inspection. For example, for
tooManyTokens inspection, the parameters
thresholdManyTokens and
thresholdEntitiesPercentage are mandatory. Other inspections
might have no parameters. |
inspections[*].parameters[*].name | Yes | Name of an inspection parameter. |
inspections[*].parameters[*].value | Yes | Value of an inspection parameter. |
inspections[*].computationResults | yes/no, depends on inspectionId | Details of the applied inspection. For example, for
tooManyTokens inspection, the details section contains examples
and actual percentage of entities generating too many tokens. |
inspections[*].computationResults[*].name | Yes | Name of the details. |
inspections[*].computationResults[*].value | No | Value object with the details. |
inspections[*].computationResults[*].values | No | Values array with the details. |
inspections[*].summary | No | The section with a short description of a discovered issue. |
inspections[*].summary.severity | Yes | Severity of an issue. Possible values are ERROR ,
WARNING , and INFO . |
inspections[*].summary.text | Yes | Short description on an issue. Might be a format string (with placeholders). |
inspections[*].summary.parameters | No | Parameters representing some information during analysis to fill the issue text placeholders. |
inspections[*].summary.parameters[*].id | Yes | Id of a parameter (unique across the parameters
section). |
inspections[*].summary.parameters[*].type | Yes |
Type of a parameter, such as |
inspections[*].explanation | No | The section with a full explanation of a discovered issue. |
inspections[*].explanation.text | Yes | Full explanation on an issue. Might be a format string (with placeholders). |
inspections[*].explanation.parameters | No | Parameters representing some information during analysis to fill the explanation text placeholders. |
inspections[*].explanation.parameters[*].id | Yes | Id of a parameter (unique across the parameters
section). |
inspections[*].explanation.parameters[*].type | Yes | Type of a parameter, such as text , number ,
percentage , matchGroupUri ,
matchGroupUris , attributeUri ,
attributeUris , matchTokenClass ,
comparatorClass , and so on. |
inspections[*].recommendation | No | The section with a recommendation on how to resolve the issue |
inspections[*].recommendation.text | Yes | Text of a recommendation. Might be a format string (with placeholders) |
inspections[*].recommendation.parameters | No | Parameters representing some information during analysis to fill the recommendation text placeholders. |
inspections[*].recommendation.parameters[*].id | Yes | Id of a parameter (unique across the parameters
section). |
inspections[*].recommendation.parameters[*].type | Yes | Type of a parameter, such as text , number ,
percentage , matchGroupUri ,
matchGroupUris , attributeUri ,
attributeUris , matchTokenClass ,
comparatorClass , and so on. |