Inspections for Tokens

Information about the inspections that you can perform using the dynamic analyzer.

Overview

The following inspections are available:

Highly correlated match tokens
High number of match tokens
High average number of match tokens
Too many tokens are generated by a subset of the processed entities
Too few tokens are generated by a subset of the processed entities

The inspections are applied to a single match rule, a group of match rules (suspect, automatic, and relevance-based), and all match rules. The inspection results are available for each of these match rule's subset.

Each inspection provides the following information:

Inspection ID (unique across all the match rule analyzer output) that points to the inspection applied
Parameters used to perform the inspection
Issue section with the severity and text
Explanation and recommendation sections

Profiling Section Updated

The inspection section is added under the profiling section of the payload. Under the inspection section, there is a field called formatted, which is set to true by default. As a result, the inspection texts are formatted and the texts have placeholders substituted by the actual values. The section for the issue, explanation, and recommendation appears in the output. The updated profiling section is as follows:

{
  "enabled": true,
  "entityTypes": [],
  "staticAnalysis": {},
  "profiling": {
    "enabled": true,
    "maxObjectsPerType": 10000,
    "inspections": {
      "formatted": true
    }
  }
}

Inspection - Highly correlated match tokens

The highlyCorrelatedMatchTokens inspection is added to identify match groups generating candidate pairs having high correlation.

Table 1. Inspection Issue Details - High Correlation
Parameter	Description
`Summary`	Match group [{0}] has a tokenization scheme that is seemingly redundant since it is fully covered in match group [{1}].
`Explanation`	The analysis shows that the tokenization scheme of [{0}] is producing the same candidate pairs as [{1}].
`Recommendation`	Get rid of one of the tokenization schemes for better performance. Consider fixing this only after you fix overcollisioned tokens and too many tokens issue. You can transform one of your tokenization schemes to be identical to the other to get rid of one of the schemes. You can also add attributes to `ignoreInToken` parameter and use the same tokenization and cleanse functions for the attributes (with preference to the most fuzzy ones).

Table 2. Inspection Issue Details - After Fixing Correlation Issue
Parameter	Description
`Summary`	Match group tokenization tactics are evenly distributed.
`Explanation`	The analysis shows that your match groups tokenization schemes have a nicely distributed set of tactics, each of which is evaluating candidate pairs differently.
`Recommendation`	Keep going! It appears that your match groups tokenization schemes employ a variety of tactics and are not redundant.

Inspection - High number of match tokens

The documentsWithOmittedTokens inspection is added to identify match rules generating a high number of match tokens. We have set a limit of 1000 tokens per match document. If the limit is exceeded, all the match tokens of the match rule are excluded from the matching process.

Inspection - High average number of match tokens

The highAverageNumberOfTokens inspection detects entities with a high average number of match tokens.

Table 3. Parameters for High Average Number of Tokens
Parameter	Value	Number of Match Groups Checked	Description
`highAverageNumberOfTokens`	20 to 40	1	A Warning is displayed, when the average number of match tokens per match rule is between 20 to 40.
`highAverageNumberOfTokens`	above 40	1	An Error is displayed, when the average number of match tokens per match rule is higher than 40.
`highAverageNumberOfTokens`	30 to 50	All match rules	A Warning is displayed, when the average number of match tokens for all match rules is between 30 to 50.
`highAverageNumberOfTokens`	above 50	All match rules	An Error is displayed, when the average number of match tokens for all match rules is higher than 50.

Inspection - Too Many Tokens are Generated

The inspection checks if too many tokens are generated by a subset of the processed entities than the specified threshold for the specified match groups. The following tables provide information about the inspection parameters:

Table 4. Parameters for a Single Match Group
Parameter	Value	Description
`thresholdManyTokens`	150	If the number of tokens for an entity is more than the `thresholdManyTokens` value, then the entity is generating too many tokens.
`thresholdEntitiesPercentage`	2	If a percentage of entities generating too many tokens is more than the `thresholdEntitiesPercentage` value, then the inspection results in an issue with ERROR severity.

Table 5. Parameters for all Match Groups or Match Groups of a Specific Type
Parameter	Value	Description
`thresholdManyTokens`	300	If the number of tokens for an entity is more than the `thresholdManyTokens` value, then the entity is generating too many tokens.
`thresholdEntitiesPercentage`	2	If a percentage of entities generating too many tokens is more than the `thresholdEntitiesPercentage` value, then the inspection results in an issue with `ERROR` severity.

Table 6. Inspection Results
Parameter	Description
`examples`	Array of strings with entity identifiers
`entitiesPercentage`	Actual percentage of entities that generated too many tokens

Table 7. Inspection Issue Details
Parameter	Description
`Text`	`{0}` of the entities generated an excessive number of token phrases.
`Formatted Text`	`{0}` of the entities generated an excessive number of token phrases.
`Severity`	`ERROR`
`Parameters`	`Id: 0, Type: percentage, Value: 4.8 (any double from 0 to 100)`

Table 8. Inspection Explanation for a Single Match Group
Parameter	Description
`Text`	The number of token phrases generated by a single match group for an entity is normally less than `{0}`. `{1}` of the entities in the data set generated more than {0} token phrase(s) for the match group ''`{2}`''. This can significantly affect the performance of the match process.
`Formatted Text`	The number of token phrases generated by a single match group for an entity is normally less than `{0}`. `{1}` of the entities in the data set generated more than {0} token phrase(s) for the match group ''`{2}`''. This can significantly affect the performance of the match process.
`Parameters`	`Id: 0, Type: number, Value: 150` `Id: 1, Type: percentage, Value: 4.8` `Id: 2, Type: matchGroupUri, Value: The match group URI (in formatted text it appears as match group label)`

Table 9. Inspection Explanation for a All Match Groups or Match Groups of a Specific Type
Parameter	Description
`Text`	The number of token phrases generated for an entity by a set of match groups is normally below `{0}`. `{1}` of the entities in the data set generated more than `{0}` token phrase(s). This can significantly affect the performance of the match process.
`Formatted Text`	The number of token phrases generated for an entity by a set of match groups is normally below `{0}`. `{1}` of the entities in the data set generated more than `{0}` token phrase(s). This can significantly affect the performance of the match process.
`Parameters`	`Id: 0, Type: number, Value: 300` `Id: 1, Type: percentage, Value: 4.8`

Table 10. Inspection Recommendation
Parameter	Description
`Formatted Text`	Review the tokenization design with a goal of reducing the number of token phrases it generates. Typical tactics include reducing the fuzziness of the match token classes, increasing the use of `ignoreInToken` to suppress attributes that are unneeded for tokenization, and use of `UseOvOnly` to reduce the number of attribute values used by the tokenizer.

Information Output

If no errors or warnings are reported, then the inspection results in an information output.

Table 11. Summary Field
Parameter	Description
`Summary`	`{0}` entities generated less than `{1}` token phrase(s).
`Severity`	`INFO`

Table 12. Explanation Field for a Single Match Group
Parameter	Description
`Text`	Great job! Ideally, between `{1}-{2}` token phrases are generated for an entity per match group. For this data set, only `{0}` entities fall above the recommended upper limit of token phrase generation and so it appears that entities in your data set have not produced an excessive number of token phrases.
`Formatted Text`	Great job! Ideally, between `{1}-{2}` token phrases are generated for an entity per match group. For this data set, only `{0}` entities fall above the recommended upper limit of token phrase generation and so it appears that entities in your data set have not produced an excessive number of token phrases.

Table 13. Explanation Field for Several Match Groups
Parameter	Description
`Text`	Great job! Ideally, between `{1}-{2}` token phrases are generated for a set of match groups. For this data set, only `{0}` entities fall above the recommended upper limit of token phrase generation and so it appears that entities in your data set have not produced an excessive number of token phrases.
`Formatted Text`	Great job! Ideally, between `{1}-{2}` token phrases are generated for a set of match groups. For this data set, only `{0}` entities fall above the recommended upper limit of token phrase generation and so it appears that entities in your data set have not produced an excessive number of token phrases.

Table 14. Recommendation Field
Parameter	Description
`Text`	Keep on going! It appears your match groups are not generating an excessive number of token phrases for this data set.
`Formatted Text`	Keep on going! It appears your match groups are not generating an excessive number of token phrases for this data set.

Inspection - Too Few Tokens are Generated

The inspection checks if too few tokens are generated by a subset of the processed entities than the specified threshold for the specified match groups. The inspection has WARNING severity if applied to a single match group because it might be fine to have entities generating no tokens. The inspection has ERROR severity if applied to all match rules or match rules of a specific type because in this case the entities do not participate in matching.

Table 15. Parameters for a Single Match Group
Parameter	Value	Description
`thresholdFewTokens`	1	If the number of tokens for an entity is less than the `thresholdFewTokens` value, then the entity is generating too few tokens.
`thresholdEntitiesPercentage`	20	If a percentage of entities generating too few tokens is more than the `thresholdEntitiesPercentage` value, then the inspection results in an issue with `WARNING` severity.

Table 16. Parameters for all Match Groups or Match Groups of a Specific Type
Parameter	Value	Description
`thresholdFewTokens`	5	If the number of tokens for an entity is less than the `thresholdFewTokens` value, then the entity is generating too few tokens.
`thresholdEntitiesPercentage`	2	If a percentage of entities generating too few tokens is more than the `thresholdEntitiesPercentage` value, then the inspection results in an issue with `ERROR` severity.

Table 17. Inspection Results
Parameter	Description
`examples`	Array of strings with entity identifiers
`entitiesPercentage`	Actual percentage of entities that generated too few tokens

Table 18. Inspection Issue Details - For a Single Match Group
Parameter	Description
`Text`	`{0}` entities generated very few token phrases.
`Formatted Text`	`{0}` entities generated very few token phrases.
`Severity`	`WARNING`
`Parameters`	`Id: 0, Type: percentage, Value: 10 (any double from 0 to 100)`

Table 19. Inspection Issue Details - For all Match Groups or Match Groups of a Specific Type
Parameter	Description
`Text`	`{0}` entities generated too few token phrases.
`Formatted Text`	`{0}` entities generated too few token phrases.
`Severity`	`ERROR`
`Parameters`	`Id: 0, Type: percentage, Value: 10 (any double from 0 to 100)`

Table 20. Inspection Explanation for a Single Match Group
Parameter	Description
`Text`	The number of token phrases generated for an entity by a single match group is normally greater than `{0}`. `{1}` of the entities in the data set generated less than `{0}` token phrase(s) for the match group "`{2}`". This suggests your tokenization design is too restrictive.
`Formatted Text`	The number of token phrases generated for an entity by a single match group is normally greater than `{0}`. `{1}` of the entities in the data set generated less than `{0}` token phrase(s) for the match group "`{2}`". This suggests your tokenization design is too restrictive.
`Parameters`	`Id: 0, Type: number, Value: 150` `Id: 1, Type: percentage, Value: 4.8` `Id: 2, Type: matchGroupUri, Value: The match group URI (in formatted text it appears as match group label)`

Table 21. Inspection Explanation for All Match Groups or Match Groups of a Specific Type
Parameter	Description
`Text`	The number of token phrases generated for an entity by a set of match groups is normally greater than `{0}`. `{1}` of the entities in the data set generated fewer than `{0}` token phrase(s) which significantly reduces the likelihood of these entities being paired with others for matching. This suggests your tokenization design is too restrictive.
`Formatted Text`	The number of token phrases generated for an entity by a set of match groups is normally greater than `{0}`. `{1}` of the entities in the data set generated fewer than `{0}` token phrase(s) which significantly reduces the likelihood of these entities being paired with others for matching. This suggests your tokenization design is too restrictive.
`Parameters`	`Id: 0, Type: number, Value: 5` `Id: 1, Type: percentage, Value: 2` `Id: 2, Type: matchGroupUri, Value: The match group URI (in formatted text it appears as match group label)`

Table 22. Inspection Recommendation
Parameter	Description
`Formatted Text`	Review the tokenization design with a goal of increasing the number of token phrases it generates. You might try increasing the fuzziness of the match token classes.

Information Output

If no errors or warnings are reported, then the inspection results in an information output.

Table 23. Summary Field
Parameter	Description
`Summary`	`{0}` entities generated over `{1}` token phrase(s).
`Severity`	`INFO`

Table 24. Explanation Field for a Single Match Group
Parameter	Description
`Text`	Great job! Ideally, between `{1}-{2}` token phrases are generated for an entity per match group. For this data set, only `{0}` entities fall below the recommended lower limit of token phrase generation and so it appears that entities in your data set have produced an adequate number of token phrases.
`Formatted Text`	Great job! Ideally, between `{1}-{2}` token phrases are generated for an entity per match group. For this data set, only `{0}` entities fall below the recommended lower limit of token phrase generation and so it appears that entities in your data set have produced an adequate number of token phrases.

Table 25. Explanation Field for Several Match Groups
Parameter	Description
`Text`	Great job! Ideally, between `{1}-{2}` token phrases are generated for an entity by a set of match groups.For this data set, only `{0}` entities fall below the recommended lower limit of token phrase generation and so it appears that entities in your data set have produced an adequate number of token phrases.
`Formatted Text`	Great job! Ideally, between `{1}-{2}` token phrases are generated for an entity by a set of match groups.For this data set, only `{0}` entities fall below the recommended lower limit of token phrase generation and so it appears that entities in your data set have produced an adequate number of token phrases.

Table 26. Recommendation Field
Parameter	Description
`Text`	Keep on going! It appears your match groups are generating an adequate number of token phrases for this data set.
`Formatted Text`	Keep on going! It appears your match groups are generating an adequate number of token phrases for this data set.

Example of the Inspections

The following sample shows the use of both the inspections:

{
  "inspections": [
    {
      "uid": "1-987-asdf-vcxb-sadf",
      "inspectionId": "tooManyTokens",
      "parameters": [
        {
          "name": "thresholdManyTokens",
          "value": 300
        },
        {
          "name": "thresholdEntitiesPercentage",
          "value": "2%"
        }
      ],
      "computationResults": [
        {
          "name": "entitiesPercentage",
          "value": 5.5
        },
        {
          "name": "examples",
          "values": [
            "entityId1",
            "entityId2"
          ]
        }
      ],
      "issue": {
        "severity": "ERROR",
        "text": "{0} entities generated too many token phrases.",
        "parameters": [
          {
            "id": "0",
            "type": "percentage",
            "value": "3"
          }
        ]
      },
      "explanation": {
        "text": "Extended explanation {0} {1} {2}",
        "parameters": [
          {
            "id": "0",
            "type": "percentage",
            "value": "3"
          },
          {
            "id": "1",
            "type": "matchGroupUri",
            "value": "some uri"
          }
        ]
      },
      "recommendation": {
        "text": "Some recommendation {0} {1} {2}",
        "parameters": [
          {
            "id": "0",
            "type": "percentage",
            "value": "3"
          },
          {
            "id": "1",
            "type": "matchGroupUri",
            "value": "some uri"
          }

The following table explains the sections in the JSON output:

Table 27. JSON Output Sections
Name	Required	Description
`inspections`	No	The results of the applied inspections.
`inspections[*].uid`	Yes	Unique identifier of an inspection result.
`inspections[*].inspectionId`	Yes	Identifier of an applied inspection.
`inspections[*].parameters`	Yes/No, depends on `inspectionId`	Parameters of an applied inspection. For example, for `tooManyTokens` inspection, the parameters `thresholdManyTokens` and `thresholdEntitiesPercentage` are mandatory. Other inspections might have no parameters.
`inspections[].parameters[].name`	Yes	Name of an inspection parameter.
`inspections[].parameters[].value`	Yes	Value of an inspection parameter.
`inspections[*].computationResults`	yes/no, depends on `inspectionId`	Details of the applied inspection. For example, for `tooManyTokens` inspection, the details section contains examples and actual percentage of entities generating too many tokens.
`inspections[].computationResults[].name`	Yes	Name of the details.
`inspections[].computationResults[].value`	No	Value object with the details.
`inspections[].computationResults[].values`	No	Values array with the details.
`inspections[*].summary`	No	The section with a short description of a discovered issue.
`inspections[*].summary.severity`	Yes	Severity of an issue. Possible values are `ERROR`, `WARNING`, and `INFO`.
`inspections[*].summary.text`	Yes	Short description on an issue. Might be a format string (with placeholders).
`inspections[*].summary.parameters`	No	Parameters representing some information during analysis to fill the issue text placeholders.
`inspections[].summary.parameters[].id`	Yes	Id of a parameter (unique across the `parameters` section).
`inspections[].summary.parameters[].type`	Yes	Type of a parameter, such as `text`, `number`, `percentage`, `matchGroupUri`, `matchGroupUris`, `attributeUri`, `attributeUris`, `matchTokenClass`, `comparatorClass`, and so on.
`inspections[*].explanation`	No	The section with a full explanation of a discovered issue.
`inspections[*].explanation.text`	Yes	Full explanation on an issue. Might be a format string (with placeholders).
`inspections[*].explanation.parameters`	No	Parameters representing some information during analysis to fill the explanation text placeholders.
`inspections[].explanation.parameters[].id`	Yes	Id of a parameter (unique across the `parameters` section).
`inspections[].explanation.parameters[].type`	Yes	Type of a parameter, such as `text`, `number`, `percentage`, `matchGroupUri`, `matchGroupUris`, `attributeUri`, `attributeUris`, `matchTokenClass`, `comparatorClass`, and so on.
`inspections[*].recommendation`	No	The section with a recommendation on how to resolve the issue
`inspections[*].recommendation.text`	Yes	Text of a recommendation. Might be a format string (with placeholders)
`inspections[*].recommendation.parameters`	No	Parameters representing some information during analysis to fill the recommendation text placeholders.
`inspections[].recommendation.parameters[].id`	Yes	Id of a parameter (unique across the `parameters` section).
`inspections[].recommendation.parameters[].type`	Yes	Type of a parameter, such as `text`, `number`, `percentage`, `matchGroupUri`, `matchGroupUris`, `attributeUri`, `attributeUris`, `matchTokenClass`, `comparatorClass`, and so on.

Accelerate the Value of Data