Accelerate the Value of Data

Inspections for Tokens

Information about the inspections that you can perform using the dynamic analyzer.

Overview

The following inspections are available:
  • Highly correlated match tokens
  • High number of match tokens
  • High average number of match tokens
  • Too many tokens are generated by a subset of the processed entities
  • Too few tokens are generated by a subset of the processed entities
The inspections are applied to a single match rule, a group of match rules (suspect, automatic, and relevance-based), and all match rules. The inspection results are available for each of these match rule's subset.
Each inspection provides the following information:
  • Inspection ID (unique across all the match rule analyzer output) that points to the inspection applied
  • Parameters used to perform the inspection
  • Issue section with the severity and text
  • Explanation and recommendation sections

Profiling Section Updated

The inspection section is added under the profiling section of the payload. Under the inspection section, there is a field called formatted, which is set to true by default. As a result, the inspection texts are formatted and the texts have placeholders substituted by the actual values. The section for the issue, explanation, and recommendation appears in the output. The updated profiling section is as follows:

{
  "enabled": true,
  "entityTypes": [],
  "staticAnalysis": {},
  "profiling": {
    "enabled": true,
    "maxObjectsPerType": 10000,
    "inspections": {
      "formatted": true
    }
  }
}

Inspection - Highly correlated match tokens

The highlyCorrelatedMatchTokens inspection is added to identify match groups generating candidate pairs having high correlation.
Note: [{0}] and [{1}] are placeholders for your actual tenant data.
Table 1. Inspection Issue Details - High Correlation
ParameterDescription
Summary Match group [{0}] has a tokenization scheme that is seemingly redundant since it is fully covered in match group [{1}].
ExplanationThe analysis shows that the tokenization scheme of [{0}] is producing the same candidate pairs as [{1}].
RecommendationGet rid of one of the tokenization schemes for better performance. Consider fixing this only after you fix overcollisioned tokens and too many tokens issue. You can transform one of your tokenization schemes to be identical to the other to get rid of one of the schemes. You can also add attributes to ignoreInToken parameter and use the same tokenization and cleanse functions for the attributes (with preference to the most fuzzy ones).
Table 2. Inspection Issue Details - After Fixing Correlation Issue
ParameterDescription
Summary Match group tokenization tactics are evenly distributed.
ExplanationThe analysis shows that your match groups tokenization schemes have a nicely distributed set of tactics, each of which is evaluating candidate pairs differently.
RecommendationKeep going! It appears that your match groups tokenization schemes employ a variety of tactics and are not redundant.

Inspection - High number of match tokens

The documentsWithOmittedTokens inspection is added to identify match rules generating a high number of match tokens. We have set a limit of 1000 tokens per match document. If the limit is exceeded, all the match tokens of the match rule are excluded from the matching process.

Inspection - High average number of match tokens

The highAverageNumberOfTokens inspection detects entities with a high average number of match tokens.

Table 3. Parameters for High Average Number of Tokens
ParameterValueNumber of Match Groups CheckedDescription
highAverageNumberOfTokens 20 to 401A Warning is displayed, when the average number of match tokens per match rule is between 20 to 40.
highAverageNumberOfTokensabove 401An Error is displayed, when the average number of match tokens per match rule is higher than 40.
highAverageNumberOfTokens30 to 50All match rulesA Warning is displayed, when the average number of match tokens for all match rules is between 30 to 50.
highAverageNumberOfTokensabove 50All match rulesAn Error is displayed, when the average number of match tokens for all match rules is higher than 50.

Inspection - Too Many Tokens are Generated

Note: {0}, {1}, and {2} are placeholders for your actual tenant data.
The inspection checks if too many tokens are generated by a subset of the processed entities than the specified threshold for the specified match groups. The following tables provide information about the inspection parameters:
Table 4. Parameters for a Single Match Group
ParameterValueDescription
thresholdManyTokens 150If the number of tokens for an entity is more than the thresholdManyTokens value, then the entity is generating too many tokens.
thresholdEntitiesPercentage2If a percentage of entities generating too many tokens is more than the thresholdEntitiesPercentage value, then the inspection results in an issue with ERROR severity.
Table 5. Parameters for all Match Groups or Match Groups of a Specific Type
ParameterValueDescription
thresholdManyTokens 300If the number of tokens for an entity is more than the thresholdManyTokens value, then the entity is generating too many tokens.
thresholdEntitiesPercentage2If a percentage of entities generating too many tokens is more than the thresholdEntitiesPercentage value, then the inspection results in an issue with ERROR severity.
Table 6. Inspection Results
ParameterDescription
examples Array of strings with entity identifiers
entitiesPercentageActual percentage of entities that generated too many tokens
Table 7. Inspection Issue Details
ParameterDescription
Text {0} of the entities generated an excessive number of token phrases.
SeverityERROR
Parameters

{0} percentage of the entities generated an excessive number of token phrases.

Table 8. Inspection Explanation for a Single Match Group
ParameterDescription
Text The number of token phrases generated by a single match group for an entity is normally less than {0}. {1} of the entities in the data set generated more than {0} token phrase(s) for the match group ''{2}''. This can significantly affect the performance of the match process.
Parameters

{0} expected number of token phrases generated for one entity

{1} percentage of entities that generated more than the expected number of token phrases

{2} match group uri for which entities generated more than the expected number of token phrases

Table 9. Inspection Explanation for a All Match Groups or Match Groups of a Specific Type
ParameterDescription
Text The number of token phrases generated for an entity by a set of match groups is normally below {0}. {1} of the entities in the data set generated more than {0} token phrase(s). This can significantly affect the performance of the match process.
Parameters

{0} expected number of token phrases generated for one entity.

{1} percentage of entities that generated more than the expected number of token phrases.

Table 10. Inspection Recommendation
ParameterDescription
Text Review the tokenization design with a goal of reducing the number of token phrases it generates. Typical tactics include reducing the fuzziness of the match token classes, increasing the use of ignoreInToken to suppress attributes that are unneeded for tokenization, and use of UseOvOnly to reduce the number of attribute values used by the tokenizer.

Information Output

If no errors or warnings are reported, then the inspection results in an information output.

Table 11. Summary Field
ParameterDescription
Summary{0} entities generated less than {1} token phrase(s).
Severity INFO
Parameters

{0} The percentage of entities of a particular entity type which generated small amount of token phrases

{1} the number of token phrases generated for this entity type

Table 12. Explanation Field for a Single Match Group
ParameterDescription
TextGreat job! Ideally, between {1}-{2} token phrases are generated for an entity per match group. For this data set, only {0} entities fall above the recommended upper limit of token phrase generation and so it appears that entities in your data set have not produced an excessive number of token phrases.
Parameters

{0} The number of entities that generated a number of token phrases that exceeds the recommended limit.

{1}-{2} The range of token phrases generated for an enttity per match group.

Table 13. Explanation Field for Several Match Groups
ParameterDescription
TextGreat job! Ideally, between {1}-{2} token phrases are generated for a set of match groups. For this data set, only {0} entities fall above the recommended upper limit of token phrase generation and so it appears that entities in your data set have not produced an excessive number of token phrases.
Parameters

{0} The number of entities that generated a number of token phrases that exceeds the recommended limit.

{1}-{2} The range of token phrases generated for an entity per match group.

Table 14. Recommendation Field
ParameterDescription
TextKeep on going! It appears your match groups are not generating an excessive number of token phrases for this data set.

Inspection - Too Few Tokens are Generated

The inspection checks if too few tokens are generated by a subset of the processed entities than the specified threshold for the specified match groups. The inspection has WARNING severity if applied to a single match group because it might be fine to have entities generating no tokens. The inspection has ERROR severity if applied to all match rules or match rules of a specific type because in this case the entities do not participate in matching.
Note: {0}, {1}, and {2} are placeholders for your actual tenant data.
Table 15. Parameters for a Single Match Group
ParameterValueDescription
thresholdFewTokens 1If the number of tokens for an entity is less than the thresholdFewTokens value, then the entity is generating too few tokens.
thresholdEntitiesPercentage20If a percentage of entities generating too few tokens is more than the thresholdEntitiesPercentage value, then the inspection results in an issue with WARNING severity.
Table 16. Parameters for all Match Groups or Match Groups of a Specific Type
ParameterValueDescription
thresholdFewTokens 5If the number of tokens for an entity is less than the thresholdFewTokens value, then the entity is generating too few tokens.
thresholdEntitiesPercentage2If a percentage of entities generating too few tokens is more than the thresholdEntitiesPercentage value, then the inspection results in an issue with ERROR severity.
Table 17. Inspection Results
ParameterDescription
examples Array of strings with entity identifiers
entitiesPercentagePercentage of entities that generated too few tokens
Table 18. Inspection Issue Details - For a Single Match Group
ParameterDescription
Text {0} entities generated very few token phrases.
SeverityWARNING
Parameters

{0} The percentage of entities that generated very few token phrases.

Table 19. Inspection Issue Details - For all Match Groups or Match Groups of a Specific Type
ParameterDescription
Text {0} entities generated too few token phrases.
SeverityERROR
Parameters

{0} The percentage of entities that generated too few token phrases.

Table 20. Inspection Explanation for a Single Match Group
ParameterDescription
Text The number of token phrases generated for an entity by a single match group is normally greater than {0}. {1} of the entities in the data set generated less than {0} token phrase(s) for the match group "{2}". This suggests your tokenization design is too restrictive.
Parameters

{0} The amount of token phrases is expected to be greater than this number.

{1} The percentage of entities that generated less than the expected number of token phrases.

{2} Match group uri for which less than the expected number of token phrases was generated.

Table 21. Inspection Explanation for All Match Groups or Match Groups of a Specific Type
ParameterDescription
Text The number of token phrases generated for an entity by a set of match groups is normally greater than {0}. {1} of the entities in the data set generated fewer than {0} token phrase(s) which significantly reduces the likelihood of these entities being paired with others for matching. This suggests your tokenization design is too restrictive.
Parameters

{0} The amount of token phrases is expected to be greater than this number.

{1} The percentage of entities that generated less than the expected number of token phrases.

Table 22. Inspection Recommendation
Parameter Description
Text Review the tokenization design with a goal of increasing the number of token phrases it generates. You might try increasing the fuzziness of the match token classes.

Information Output

If no errors or warnings are reported, then the inspection results in an information output.

Table 23. Summary Field
ParameterDescription
Summary{0} entities generated over {1} token phrase(s).
Severity INFO
Parameters

{0} The percentage of entities that generated more than the expected number of token phrases.

{1} The expected number of token phrases.

Table 24. Explanation Field for a Single Match Group
ParameterDescription
TextGreat job! Ideally, between {1}-{2} token phrases are generated for an entity per match group. For this data set, only {0} entities fall below the recommended lower limit of token phrase generation and so it appears that entities in your data set have produced an adequate number of token phrases.
Parameters

{0} The number of entities that generated a number of token phrases that falls below the recommended limit.

{1}-{2} The range of token phrases generated for an entity per match group.

Table 25. Explanation Field for Several Match Groups
ParameterDescription
TextGreat job! Ideally, between {1}-{2} token phrases are generated for an entity by a set of match groups.For this data set, only {0} entities fall below the recommended lower limit of token phrase generation and so it appears that entities in your data set have produced an adequate number of token phrases.
Parameters

{0} The number of entities that generated a number of token phrases that falls below the recommended limit.

{1}-{2} The range of token phrases generated for an entity per match group.

Table 26. Recommendation Field
ParameterDescription
TextKeep on going! It appears your match groups are generating an adequate number of token phrases for this data set.

Example of the Inspections

The following sample shows the use of both the inspections:

{
  "inspections": [
    {
      "uid": "1-987-asdf-vcxb-sadf",
      "inspectionId": "tooManyTokens",
      "parameters": [
        {
          "name": "thresholdManyTokens",
          "value": 300
        },
        {
          "name": "thresholdEntitiesPercentage",
          "value": "2%"
        }
      ],
      "computationResults": [
        {
          "name": "entitiesPercentage",
          "value": 5.5
        },
        {
          "name": "examples",
          "values": [
            "entityId1",
            "entityId2"
          ]
        }
      ],
      "issue": {
        "severity": "ERROR",
        "text": "{0} entities generated too many token phrases.",
        "parameters": [
          {
            "id": "0",
            "type": "percentage",
            "value": "3"
          }
        ]
      },
      "explanation": {
        "text": "Extended explanation {0} {1} {2}",
        "parameters": [
          {
            "id": "0",
            "type": "percentage",
            "value": "3"
          },
          {
            "id": "1",
            "type": "matchGroupUri",
            "value": "some uri"
          }
        ]
      },
      "recommendation": {
        "text": "Some recommendation {0} {1} {2}",
        "parameters": [
          {
            "id": "0",
            "type": "percentage",
            "value": "3"
          },
          {
            "id": "1",
            "type": "matchGroupUri",
            "value": "some uri"
          }

The following table explains the sections in the JSON output:

Table 27. JSON Output Sections
NameRequiredDescription
inspections NoThe results of the applied inspections.
inspections[*].uidYesUnique identifier of an inspection result.
inspections[*].inspectionIdYesIdentifier of an applied inspection.
inspections[*].parametersYes/No, depends on inspectionIdParameters of an applied inspection. For example, for tooManyTokens inspection, the parameters thresholdManyTokens and thresholdEntitiesPercentage are mandatory. Other inspections might have no parameters.
inspections[*].parameters[*].nameYesName of an inspection parameter.
inspections[*].parameters[*].valueYesValue of an inspection parameter.
inspections[*].computationResultsyes/no, depends on inspectionIdDetails of the applied inspection. For example, for tooManyTokens inspection, the details section contains examples and actual percentage of entities generating too many tokens.
inspections[*].computationResults[*].nameYesName of the details.
inspections[*].computationResults[*].valueNoValue object with the details.
inspections[*].computationResults[*].valuesNo

Values array with the details.

inspections[*].summaryNoThe section with a short description of a discovered issue.
inspections[*].summary.severityYesSeverity of an issue. Possible values are ERROR, WARNING, and INFO.
inspections[*].summary.textYesShort description on an issue. Might be a format string (with placeholders).
inspections[*].summary.parametersNoParameters representing some information during analysis to fill the issue text placeholders.
inspections[*].summary.parameters[*].idYesId of a parameter (unique across the parameters section).
inspections[*].summary.parameters[*].typeYes

Type of a parameter, such as text, number, percentage, matchGroupUri, matchGroupUris, attributeUri, attributeUris, matchTokenClass, comparatorClass, and so on.

inspections[*].explanationNoThe section with a full explanation of a discovered issue.
inspections[*].explanation.textYesFull explanation on an issue. Might be a format string (with placeholders).
inspections[*].explanation.parametersNoParameters representing some information during analysis to fill the explanation text placeholders.
inspections[*].explanation.parameters[*].idYesId of a parameter (unique across the parameters section).
inspections[*].explanation.parameters[*].typeYesType of a parameter, such as text, number, percentage, matchGroupUri, matchGroupUris, attributeUri, attributeUris, matchTokenClass, comparatorClass, and so on.
inspections[*].recommendationNoThe section with a recommendation on how to resolve the issue
inspections[*].recommendation.textYesText of a recommendation. Might be a format string (with placeholders)
inspections[*].recommendation.parametersNoParameters representing some information during analysis to fill the recommendation text placeholders.
inspections[*].recommendation.parameters[*].idYesId of a parameter (unique across the parameters section).
inspections[*].recommendation.parameters[*].typeYesType of a parameter, such as text, number, percentage, matchGroupUri, matchGroupUris, attributeUri, attributeUris, matchTokenClass, comparatorClass, and so on.