Inspections for Tokens

Information about the inspections that you can perform using the dynamic analyzer.

Overview

The following inspections are available:

Highly correlated match tokens
High number of match tokens
High average number of match tokens
Too many tokens are generated by a subset of the processed entities
Too few tokens are generated by a subset of the processed entities

The inspections are applied to a single match rule, a group of match rules (suspect, automatic, and relevance-based), and all match rules. The inspection results are available for each of these match rule's subset.

Each inspection provides the following information:

Inspection ID (unique across all the match rule analyzer output) that points to the inspection applied
Parameters used to perform the inspection
Issue section with the severity and text
Explanation and recommendation sections

Profiling Section Updated

The inspection section is added under the profiling section of the payload. Under the inspection section, there is a field called formatted, which is set to true by default. As a result, the inspection texts are formatted and the texts have placeholders substituted by the actual values. The section for the issue, explanation, and recommendation appears in the output. The updated profiling section is as follows:

{
  "enabled": true,
  "entityTypes": [],
  "staticAnalysis": {},
  "profiling": {
    "enabled": true,
    "maxObjectsPerType": 10000,
    "inspections": {
      "formatted": true
    }
  }
}

Inspection - Highly correlated match tokens

The highlyCorrelatedMatchTokens inspection is added to identify match groups generating candidate pairs having high correlation.

Note: [{0}] and [{1}] are placeholders for your actual tenant data.

Table 1. Inspection Issue Details - High Correlation
Parameter	Description
`Summary`	Match group `[{0}]` has a tokenization scheme that is seemingly redundant since it is fully covered in match group `[{1}]`.
`Explanation`	The analysis shows that the tokenization scheme of `[{0}]` is producing the same candidate pairs as `[{1}]`.
`Recommendation`	Get rid of one of the tokenization schemes for better performance. Consider fixing this only after you fix overcollisioned tokens and too many tokens issue. You can transform one of your tokenization schemes to be identical to the other to get rid of one of the schemes. You can also add attributes to `ignoreInToken` parameter and use the same tokenization and cleanse functions for the attributes (with preference to the most fuzzy ones).

Table 2. Inspection Issue Details - After Fixing Correlation Issue
Parameter	Description
`Summary`	Match group tokenization tactics are evenly distributed.
`Explanation`	The analysis shows that your match groups tokenization schemes have a nicely distributed set of tactics, each of which is evaluating candidate pairs differently.
`Recommendation`	Keep going! It appears that your match groups tokenization schemes employ a variety of tactics and are not redundant.

Inspection - High number of match tokens

The documentsWithOmittedTokens inspection is added to identify match rules generating a high number of match tokens. We have set a limit of 1000 tokens per match document. If the limit is exceeded, all the match tokens of the match rule are excluded from the matching process.

Inspection - High average number of match tokens

The highAverageNumberOfTokens inspection detects entities with a high average number of match tokens.

Table 3. Parameters for High Average Number of Tokens
Parameter	Value	Number of Match Groups Checked	Description
`highAverageNumberOfTokens`	20 to 40	1	A Warning is displayed, when the average number of match tokens per match rule is between 20 to 40.
`highAverageNumberOfTokens`	above 40	1	An Error is displayed, when the average number of match tokens per match rule is higher than 40.
`highAverageNumberOfTokens`	30 to 50	All match rules	A Warning is displayed, when the average number of match tokens for all match rules is between 30 to 50.
`highAverageNumberOfTokens`	above 50	All match rules	An Error is displayed, when the average number of match tokens for all match rules is higher than 50.

Inspection - Too Many Tokens are Generated

Note: {0}, {1}, and {2} are placeholders for your actual tenant data.

The inspection checks if too many tokens are generated by a subset of the processed entities than the specified threshold for the specified match groups. The following tables provide information about the inspection parameters:

Table 4. Parameters for a Single Match Group
Parameter	Value	Description
`thresholdManyTokens`	150	If the number of tokens for an entity is more than the `thresholdManyTokens` value, then the entity is generating too many tokens.
`thresholdEntitiesPercentage`	2	If a percentage of entities generating too many tokens is more than the `thresholdEntitiesPercentage` value, then the inspection results in an issue with ERROR severity.

Table 5. Parameters for all Match Groups or Match Groups of a Specific Type
Parameter	Value	Description
`thresholdManyTokens`	300	If the number of tokens for an entity is more than the `thresholdManyTokens` value, then the entity is generating too many tokens.
`thresholdEntitiesPercentage`	2	If a percentage of entities generating too many tokens is more than the `thresholdEntitiesPercentage` value, then the inspection results in an issue with `ERROR` severity.

Table 6. Inspection Results
Parameter	Description
`examples`	Array of strings with entity identifiers
`entitiesPercentage`	Actual percentage of entities that generated too many tokens

Table 7. Inspection Issue Details
Parameter	Description
`Text`	`{0}` of the entities generated an excessive number of token phrases.
`Severity`	`ERROR`
`Parameters`	`{0}` percentage of the entities generated an excessive number of token phrases.

Table 8. Inspection Explanation for a Single Match Group
Parameter	Description
`Text`	The number of token phrases generated by a single match group for an entity is normally less than `{0}`. `{1}` of the entities in the data set generated more than `{0}` token phrase(s) for the match group ''`{2}`''. This can significantly affect the performance of the match process.
`Parameters`	`{0}` expected number of token phrases generated for one entity `{1}` percentage of entities that generated more than the expected number of token phrases `{2}` match group uri for which entities generated more than the expected number of token phrases

Table 9. Inspection Explanation for a All Match Groups or Match Groups of a Specific Type
Parameter	Description
`Text`	The number of token phrases generated for an entity by a set of match groups is normally below `{0}`. `{1}` of the entities in the data set generated more than `{0}` token phrase(s). This can significantly affect the performance of the match process.
`Parameters`	`{0}` expected number of token phrases generated for one entity. `{1}` percentage of entities that generated more than the expected number of token phrases.

Table 10. Inspection Recommendation
Parameter	Description
`Text`	Review the tokenization design with a goal of reducing the number of token phrases it generates. Typical tactics include reducing the fuzziness of the match token classes, increasing the use of `ignoreInToken` to suppress attributes that are unneeded for tokenization, and use of `UseOvOnly` to reduce the number of attribute values used by the tokenizer.

Information Output

If no errors or warnings are reported, then the inspection results in an information output.

Table 11. Summary Field
Parameter	Description
`Summary`	`{0}` entities generated less than `{1}` token phrase(s).
`Severity`	`INFO`
`Parameters`	`{0}` The percentage of entities of a particular entity type which generated small amount of token phrases `{1}` the number of token phrases generated for this entity type

Table 12. Explanation Field for a Single Match Group
Parameter	Description
`Text`	Great job! Ideally, between `{1}-{2}` token phrases are generated for an entity per match group. For this data set, only `{0}` entities fall above the recommended upper limit of token phrase generation and so it appears that entities in your data set have not produced an excessive number of token phrases.
`Parameters`	`{0}` The number of entities that generated a number of token phrases that exceeds the recommended limit. `{1}-{2}` The range of token phrases generated for an enttity per match group.

Table 13. Explanation Field for Several Match Groups
Parameter	Description
`Text`	Great job! Ideally, between `{1}-{2}` token phrases are generated for a set of match groups. For this data set, only `{0}` entities fall above the recommended upper limit of token phrase generation and so it appears that entities in your data set have not produced an excessive number of token phrases.
`Parameters`	`{0}` The number of entities that generated a number of token phrases that exceeds the recommended limit. `{1}-{2}` The range of token phrases generated for an entity per match group.

Table 14. Recommendation Field
Parameter	Description
`Text`	Keep on going! It appears your match groups are not generating an excessive number of token phrases for this data set.

Inspection - Too Few Tokens are Generated

The inspection checks if too few tokens are generated by a subset of the processed entities than the specified threshold for the specified match groups. The inspection has WARNING severity if applied to a single match group because it might be fine to have entities generating no tokens. The inspection has ERROR severity if applied to all match rules or match rules of a specific type because in this case the entities do not participate in matching.