Accelerate the Value of Data

Inspections for Tokens

Information about the inspections that you can perform using the dynamic analyzer.

Overview

The following inspections are available:
  • Highly correlated match tokens
  • High number of match tokens
  • High average number of match tokens
  • Too many tokens are generated by a subset of the processed entities
  • Too few tokens are generated by a subset of the processed entities
The inspections are applied to a single match rule, a group of match rules (suspect, automatic, and relevance-based), and all match rules. The inspection results are available for each of these match rule's subset.
Each inspection provides the following information:
  • Inspection ID (unique across all the match rule analyzer output) that points to the inspection applied
  • Parameters used to perform the inspection
  • Issue section with the severity and text
  • Explanation and recommendation sections

Profiling Section Updated

The inspection section is added under the profiling section of the payload. Under the inspection section, there is a field called formatted, which is set to true by default. As a result, the inspection texts are formatted and the texts have placeholders substituted by the actual values. The section for the issue, explanation, and recommendation appears in the output. The updated profiling section is as follows:

{
  "enabled": true,
  "entityTypes": [],
  "staticAnalysis": {},
  "profiling": {
    "enabled": true,
    "maxObjectsPerType": 10000,
    "inspections": {
      "formatted": true
    }
  }
}

Inspection - Highly correlated match tokens

The highlyCorrelatedMatchTokens inspection is added to identify match groups generating candidate pairs having high correlation.

Table 1. Inspection Issue Details - High Correlation
Parameter Description
Summary Match group [{0}] has a tokenization scheme that is seemingly redundant since it is fully covered in match group [{1}].
Explanation The analysis shows that the tokenization scheme of [{0}] is producing the same candidate pairs as [{1}].
Recommendation Get rid of one of the tokenization schemes for better performance. Consider fixing this only after you fix overcollisioned tokens and too many tokens issue. You can transform one of your tokenization schemes to be identical to the other to get rid of one of the schemes. You can also add attributes to ignoreInToken parameter and use the same tokenization and cleanse functions for the attributes (with preference to the most fuzzy ones).
Table 2. Inspection Issue Details - After Fixing Correlation Issue
Parameter Description
Summary Match group tokenization tactics are evenly distributed.
Explanation The analysis shows that your match groups tokenization schemes have a nicely distributed set of tactics, each of which is evaluating candidate pairs differently.
Recommendation Keep going! It appears that your match groups tokenization schemes employ a variety of tactics and are not redundant.

Inspection - High number of match tokens

The documentsWithOmittedTokens inspection is added to identify match rules generating a high number of match tokens. We have set a limit of 1000 tokens per match document. If the limit is exceeded, all the match tokens of the match rule are excluded from the matching process.

Inspection - High average number of match tokens

The highAverageNumberOfTokens inspection detects entities with a high average number of match tokens.

Table 3. Parameters for High Average Number of Tokens
Parameter Value Number of Match Groups Checked Description
highAverageNumberOfTokens 20 to 40 1 A Warning is displayed, when the average number of match tokens per match rule is between 20 to 40.
highAverageNumberOfTokens above 40 1 An Error is displayed, when the average number of match tokens per match rule is higher than 40.
highAverageNumberOfTokens 30 to 50 All match rules A Warning is displayed, when the average number of match tokens for all match rules is between 30 to 50.
highAverageNumberOfTokens above 50 All match rules An Error is displayed, when the average number of match tokens for all match rules is higher than 50.

Inspection - Too Many Tokens are Generated

The inspection checks if too many tokens are generated by a subset of the processed entities than the specified threshold for the specified match groups. The following tables provide information about the inspection parameters:

Table 4. Parameters for a Single Match Group
Parameter Value Description
thresholdManyTokens 150 If the number of tokens for an entity is more than the thresholdManyTokens value, then the entity is generating too many tokens.
thresholdEntitiesPercentage 2 If a percentage of entities generating too many tokens is more than the thresholdEntitiesPercentage value, then the inspection results in an issue with ERROR severity.
Table 5. Parameters for all Match Groups or Match Groups of a Specific Type
Parameter Value Description
thresholdManyTokens 300 If the number of tokens for an entity is more than the thresholdManyTokens value, then the entity is generating too many tokens.
thresholdEntitiesPercentage 2 If a percentage of entities generating too many tokens is more than the thresholdEntitiesPercentage value, then the inspection results in an issue with ERROR severity.
Table 6. Inspection Results
Parameter Description
examples Array of strings with entity identifiers
entitiesPercentage Actual percentage of entities that generated too many tokens
Table 7. Inspection Issue Details
Parameter Description
Text {0} of the entities generated an excessive number of token phrases.
Formatted Text {0} of the entities generated an excessive number of token phrases.
Severity ERROR
Parameters Id: 0, Type: percentage, Value: 4.8 (any double from 0 to 100)
Table 8. Inspection Explanation for a Single Match Group
Parameter Description
Text The number of token phrases generated by a single match group for an entity is normally less than {0}. {1} of the entities in the data set generated more than {0} token phrase(s) for the match group ''{2}''. This can significantly affect the performance of the match process.
Formatted Text The number of token phrases generated by a single match group for an entity is normally less than {0}. {1} of the entities in the data set generated more than {0} token phrase(s) for the match group ''{2}''. This can significantly affect the performance of the match process.
Parameters
  • Id: 0, Type: number, Value: 150
  • Id: 1, Type: percentage, Value: 4.8
  • Id: 2, Type: matchGroupUri, Value: The match group URI (in formatted text it appears as match group label)
Table 9. Inspection Explanation for a All Match Groups or Match Groups of a Specific Type
Parameter Description
Text The number of token phrases generated for an entity by a set of match groups is normally below {0}. {1} of the entities in the data set generated more than {0} token phrase(s). This can significantly affect the performance of the match process.
Formatted Text The number of token phrases generated for an entity by a set of match groups is normally below {0}. {1} of the entities in the data set generated more than {0} token phrase(s). This can significantly affect the performance of the match process.
Parameters
  • Id: 0, Type: number, Value: 300
  • Id: 1, Type: percentage, Value: 4.8
Table 10. Inspection Recommendation
Parameter Description
Formatted Text Review the tokenization design with a goal of reducing the number of token phrases it generates. Typical tactics include reducing the fuzziness of the match token classes, increasing the use of ignoreInToken to suppress attributes that are unneeded for tokenization, and use of UseOvOnly to reduce the number of attribute values used by the tokenizer.

Information Output

If no errors or warnings are reported, then the inspection results in an information output.

Table 11. Summary Field
Parameter Description
Summary {0} entities generated less than {1} token phrase(s).
Severity INFO
Table 12. Explanation Field for a Single Match Group
Parameter Description
Text Great job! Ideally, between {1}-{2} token phrases are generated for an entity per match group. For this data set, only {0} entities fall above the recommended upper limit of token phrase generation and so it appears that entities in your data set have not produced an excessive number of token phrases.
Formatted Text Great job! Ideally, between {1}-{2} token phrases are generated for an entity per match group. For this data set, only {0} entities fall above the recommended upper limit of token phrase generation and so it appears that entities in your data set have not produced an excessive number of token phrases.
Table 13. Explanation Field for Several Match Groups
Parameter Description
Text Great job! Ideally, between {1}-{2} token phrases are generated for a set of match groups. For this data set, only {0} entities fall above the recommended upper limit of token phrase generation and so it appears that entities in your data set have not produced an excessive number of token phrases.
Formatted Text Great job! Ideally, between {1}-{2} token phrases are generated for a set of match groups. For this data set, only {0} entities fall above the recommended upper limit of token phrase generation and so it appears that entities in your data set have not produced an excessive number of token phrases.
Table 14. Recommendation Field
Parameter Description
Text Keep on going! It appears your match groups are not generating an excessive number of token phrases for this data set.
Formatted Text Keep on going! It appears your match groups are not generating an excessive number of token phrases for this data set.

Inspection - Too Few Tokens are Generated

The inspection checks if too few tokens are generated by a subset of the processed entities than the specified threshold for the specified match groups. The inspection has WARNING severity if applied to a single match group because it might be fine to have entities generating no tokens. The inspection has ERROR severity if applied to all match rules or match rules of a specific type because in this case the entities do not participate in matching.

Table 15. Parameters for a Single Match Group
Parameter Value Description
thresholdFewTokens 1 If the number of tokens for an entity is less than the thresholdFewTokens value, then the entity is generating too few tokens.
thresholdEntitiesPercentage 20 If a percentage of entities generating too few tokens is more than the thresholdEntitiesPercentage value, then the inspection results in an issue with WARNING severity.
Table 16. Parameters for all Match Groups or Match Groups of a Specific Type
Parameter Value Description
thresholdFewTokens 5 If the number of tokens for an entity is less than the thresholdFewTokens value, then the entity is generating too few tokens.
thresholdEntitiesPercentage 2 If a percentage of entities generating too few tokens is more than the thresholdEntitiesPercentage value, then the inspection results in an issue with ERROR severity.
Table 17. Inspection Results
Parameter Description
examples Array of strings with entity identifiers
entitiesPercentage Actual percentage of entities that generated too few tokens
Table 18. Inspection Issue Details - For a Single Match Group
Parameter Description
Text {0} entities generated very few token phrases.
Formatted Text {0} entities generated very few token phrases.
Severity WARNING
Parameters Id: 0, Type: percentage, Value: 10 (any double from 0 to 100)
Table 19. Inspection Issue Details - For all Match Groups or Match Groups of a Specific Type
Parameter Description
Text {0} entities generated too few token phrases.
Formatted Text {0} entities generated too few token phrases.
Severity ERROR
Parameters Id: 0, Type: percentage, Value: 10 (any double from 0 to 100)
Table 20. Inspection Explanation for a Single Match Group
Parameter Description
Text The number of token phrases generated for an entity by a single match group is normally greater than {0}. {1} of the entities in the data set generated less than {0} token phrase(s) for the match group "{2}". This suggests your tokenization design is too restrictive.
Formatted Text The number of token phrases generated for an entity by a single match group is normally greater than {0}. {1} of the entities in the data set generated less than {0} token phrase(s) for the match group "{2}". This suggests your tokenization design is too restrictive.
Parameters
  • Id: 0, Type: number, Value: 150
  • Id: 1, Type: percentage, Value: 4.8
  • Id: 2, Type: matchGroupUri, Value: The match group URI (in formatted text it appears as match group label)
Table 21. Inspection Explanation for All Match Groups or Match Groups of a Specific Type
Parameter Description
Text The number of token phrases generated for an entity by a set of match groups is normally greater than {0}. {1} of the entities in the data set generated fewer than {0} token phrase(s) which significantly reduces the likelihood of these entities being paired with others for matching. This suggests your tokenization design is too restrictive.
Formatted Text The number of token phrases generated for an entity by a set of match groups is normally greater than {0}. {1} of the entities in the data set generated fewer than {0} token phrase(s) which significantly reduces the likelihood of these entities being paired with others for matching. This suggests your tokenization design is too restrictive.
Parameters
  • Id: 0, Type: number, Value: 5
  • Id: 1, Type: percentage, Value: 2
  • Id: 2, Type: matchGroupUri, Value: The match group URI (in formatted text it appears as match group label)
Table 22. Inspection Recommendation
Parameter Description
Formatted Text Review the tokenization design with a goal of increasing the number of token phrases it generates. You might try increasing the fuzziness of the match token classes.

Information Output

If no errors or warnings are reported, then the inspection results in an information output.

Table 23. Summary Field
Parameter Description
Summary {0} entities generated over {1} token phrase(s).
Severity INFO
Table 24. Explanation Field for a Single Match Group
Parameter Description
Text Great job! Ideally, between {1}-{2} token phrases are generated for an entity per match group. For this data set, only {0} entities fall below the recommended lower limit of token phrase generation and so it appears that entities in your data set have produced an adequate number of token phrases.
Formatted Text Great job! Ideally, between {1}-{2} token phrases are generated for an entity per match group. For this data set, only {0} entities fall below the recommended lower limit of token phrase generation and so it appears that entities in your data set have produced an adequate number of token phrases.
Table 25. Explanation Field for Several Match Groups
Parameter Description
Text Great job! Ideally, between {1}-{2} token phrases are generated for an entity by a set of match groups.For this data set, only {0} entities fall below the recommended lower limit of token phrase generation and so it appears that entities in your data set have produced an adequate number of token phrases.
Formatted Text Great job! Ideally, between {1}-{2} token phrases are generated for an entity by a set of match groups.For this data set, only {0} entities fall below the recommended lower limit of token phrase generation and so it appears that entities in your data set have produced an adequate number of token phrases.
Table 26. Recommendation Field
Parameter Description
Text Keep on going! It appears your match groups are generating an adequate number of token phrases for this data set.
Formatted Text Keep on going! It appears your match groups are generating an adequate number of token phrases for this data set.

Example of the Inspections

The following sample shows the use of both the inspections:

{
  "inspections": [
    {
      "uid": "1-987-asdf-vcxb-sadf",
      "inspectionId": "tooManyTokens",
      "parameters": [
        {
          "name": "thresholdManyTokens",
          "value": 300
        },
        {
          "name": "thresholdEntitiesPercentage",
          "value": "2%"
        }
      ],
      "computationResults": [
        {
          "name": "entitiesPercentage",
          "value": 5.5
        },
        {
          "name": "examples",
          "values": [
            "entityId1",
            "entityId2"
          ]
        }
      ],
      "issue": {
        "severity": "ERROR",
        "text": "{0} entities generated too many token phrases.",
        "parameters": [
          {
            "id": "0",
            "type": "percentage",
            "value": "3"
          }
        ]
      },
      "explanation": {
        "text": "Extended explanation {0} {1} {2}",
        "parameters": [
          {
            "id": "0",
            "type": "percentage",
            "value": "3"
          },
          {
            "id": "1",
            "type": "matchGroupUri",
            "value": "some uri"
          }
        ]
      },
      "recommendation": {
        "text": "Some recommendation {0} {1} {2}",
        "parameters": [
          {
            "id": "0",
            "type": "percentage",
            "value": "3"
          },
          {
            "id": "1",
            "type": "matchGroupUri",
            "value": "some uri"
          }

The following table explains the sections in the JSON output:

Table 27. JSON Output Sections
Name Required Description
inspections No The results of the applied inspections.
inspections[*].uid Yes Unique identifier of an inspection result.
inspections[*].inspectionId Yes Identifier of an applied inspection.
inspections[*].parameters Yes/No, depends on inspectionId Parameters of an applied inspection. For example, for tooManyTokens inspection, the parameters thresholdManyTokens and thresholdEntitiesPercentage are mandatory. Other inspections might have no parameters.
inspections[*].parameters[*].name Yes Name of an inspection parameter.
inspections[*].parameters[*].value Yes Value of an inspection parameter.
inspections[*].computationResults yes/no, depends on inspectionId Details of the applied inspection. For example, for tooManyTokens inspection, the details section contains examples and actual percentage of entities generating too many tokens.
inspections[*].computationResults[*].name Yes Name of the details.
inspections[*].computationResults[*].value No Value object with the details.
inspections[*].computationResults[*].values No

Values array with the details.

inspections[*].summary No The section with a short description of a discovered issue.
inspections[*].summary.severity Yes Severity of an issue. Possible values are ERROR, WARNING, and INFO.
inspections[*].summary.text Yes Short description on an issue. Might be a format string (with placeholders).
inspections[*].summary.parameters No Parameters representing some information during analysis to fill the issue text placeholders.
inspections[*].summary.parameters[*].id Yes Id of a parameter (unique across the parameters section).
inspections[*].summary.parameters[*].type Yes

Type of a parameter, such as text, number, percentage, matchGroupUri, matchGroupUris, attributeUri, attributeUris, matchTokenClass, comparatorClass, and so on.

inspections[*].explanation No The section with a full explanation of a discovered issue.
inspections[*].explanation.text Yes Full explanation on an issue. Might be a format string (with placeholders).
inspections[*].explanation.parameters No Parameters representing some information during analysis to fill the explanation text placeholders.
inspections[*].explanation.parameters[*].id Yes Id of a parameter (unique across the parameters section).
inspections[*].explanation.parameters[*].type Yes Type of a parameter, such as text, number, percentage, matchGroupUri, matchGroupUris, attributeUri, attributeUris, matchTokenClass, comparatorClass, and so on.
inspections[*].recommendation No The section with a recommendation on how to resolve the issue
inspections[*].recommendation.text Yes Text of a recommendation. Might be a format string (with placeholders)
inspections[*].recommendation.parameters No Parameters representing some information during analysis to fill the recommendation text placeholders.
inspections[*].recommendation.parameters[*].id Yes Id of a parameter (unique across the parameters section).
inspections[*].recommendation.parameters[*].type Yes Type of a parameter, such as text, number, percentage, matchGroupUri, matchGroupUris, attributeUri, attributeUris, matchTokenClass, comparatorClass, and so on.