Inspections for Tokens
Information about the inspections that you can perform using the dynamic analyzer.
Overview
- Highly correlated match tokens
 - High number of match tokens
 - High average number of match tokens
 - Too many tokens are generated by a subset of the processed entities
 - Too few tokens are generated by a subset of the processed entities
 
- Inspection ID (unique across all the match rule analyzer output) that points to the inspection applied
 - Parameters used to perform the inspection
 - Issue section with the severity and text
 - Explanation and recommendation sections
 
Profiling Section Updated
The inspection section is added under the profiling section of the
        payload. Under the inspection section, there is a field called formatted,
        which is set to true by default. As a result, the inspection texts are formatted and the
        texts have placeholders substituted by the actual values. The section for the issue,
        explanation, and recommendation appears in the output. The updated profiling section is as
        follows:
{
  "enabled": true,
  "entityTypes": [],
  "staticAnalysis": {},
  "profiling": {
    "enabled": true,
    "maxObjectsPerType": 10000,
    "inspections": {
      "formatted": true
    }
  }
}
      Inspection - Highly correlated match tokens
highlyCorrelatedMatchTokens inspection is added to identify match
        groups generating candidate pairs having high correlation.[{0}] and  [{1}] are placeholders for your actual tenant data.| Parameter | Description | 
|---|---|
                Summary
               | Match group [{0}] has a tokenization scheme that is seemingly redundant since
                it is fully covered in match group [{1}]. | 
Explanation | The analysis shows that the tokenization scheme of [{0}] is producing the same
                candidate pairs as [{1}]. | 
Recommendation | Get rid of one of the tokenization schemes for better performance. Consider
                fixing this only after you fix overcollisioned tokens and too many tokens issue. You
                can transform one of your tokenization schemes to be identical to the other to get
                rid of one of the schemes. You can also add attributes to
                  ignoreInToken parameter and use the same tokenization and cleanse
                functions for the attributes (with preference to the most fuzzy ones). | 
| Parameter | Description | 
|---|---|
                Summary
               | Match group tokenization tactics are evenly distributed. | 
Explanation | The analysis shows that your match groups tokenization schemes have a nicely distributed set of tactics, each of which is evaluating candidate pairs differently. | 
Recommendation | Keep going! It appears that your match groups tokenization schemes employ a variety of tactics and are not redundant. | 
Inspection - High number of match tokens
The documentsWithOmittedTokens inspection is added to identify match rules
        generating a high number of match tokens. We have set a limit of 1000 tokens per match
        document. If the limit is exceeded, all the match tokens of the match rule are excluded from
        the matching process.
Inspection - High average number of match tokens
The highAverageNumberOfTokens inspection detects entities with a high
        average number of match tokens.
| Parameter | Value | Number of Match Groups Checked | Description | 
|---|---|---|---|
                highAverageNumberOfTokens
               | 20 to 40 | 1 | A Warning is displayed, when the average number of match tokens per match rule is between 20 to 40. | 
highAverageNumberOfTokens | above 40 | 1 | An Error is displayed, when the average number of match tokens per match rule is higher than 40. | 
highAverageNumberOfTokens | 30 to 50 | All match rules | A Warning is displayed, when the average number of match tokens for all match rules is between 30 to 50. | 
highAverageNumberOfTokens | above 50 | All match rules | An Error is displayed, when the average number of match tokens for all match rules is higher than 50. | 
Inspection - Too Many Tokens are Generated
{0}, {1}, and {2} are placeholders for your actual tenant data.| Parameter | Value | Description | 
|---|---|---|
                thresholdManyTokens
               | 150 | If the number of tokens for an entity is more than the
                  thresholdManyTokens value, then the entity is generating too many
                tokens. | 
thresholdEntitiesPercentage | 2 | If a percentage of entities generating too many tokens is more than the
                  thresholdEntitiesPercentage value, then the inspection results in
                an issue with ERROR severity. | 
| Parameter | Value | Description | 
|---|---|---|
                thresholdManyTokens
               | 300 | If the number of tokens for an entity is more than the
                  thresholdManyTokens value, then the entity is generating too many
                tokens. | 
thresholdEntitiesPercentage | 2 | If a percentage of entities generating too many tokens is more than the
                  thresholdEntitiesPercentage value, then the inspection results in
                an issue with ERROR severity. | 
| Parameter | Description | 
|---|---|
                examples
               | Array of strings with entity identifiers | 
entitiesPercentage | Actual percentage of entities that generated too many tokens | 
| Parameter | Description | 
|---|---|
                Text
               | {0} of the entities generated an excessive number of token phrases. | 
Severity | ERROR | 
Parameters | 
  | 
| Parameter | Description | 
|---|---|
                Text
               | The number of token phrases generated by a single match group for an entity is
                normally less than {0}. {1} of the entities in the
                data set generated more than {0} token phrase(s) for the match group
                  ''{2}''. This can significantly affect the performance of the
                match process. | 
Parameters | 
 
 
  | 
| Parameter | Description | 
|---|---|
                Text
               | The number of token phrases generated for an entity by a set of match groups is
                normally below {0}. {1} of the entities in the
                data set generated more than {0} token phrase(s). This can
                significantly affect the performance of the match process. | 
Parameters | 
 
 
  | 
| Parameter | Description | 
|---|---|
                Text
               | Review the tokenization design with a goal of reducing the number of token
                phrases it generates. Typical tactics include reducing the fuzziness of the match
                token classes, increasing the use of ignoreInToken to suppress
                attributes that are unneeded for tokenization, and use of UseOvOnly
                to reduce the number of attribute values used by the tokenizer. | 
Information Output
If no errors or warnings are reported, then the inspection results in an information output.
| Parameter | Description | 
|---|---|
Summary | {0} entities generated less than {1} token
                phrase(s). | 
                Severity
               | INFO | 
Parameters | 
 
  | 
| Parameter | Description | 
|---|---|
Text | Great job! Ideally, between {1}-{2} token phrases are
                generated for an entity per match group. For this data set, only
                  {0} entities fall above the recommended upper limit of token
                phrase generation and so it appears that entities in your data set have not produced
                an excessive number of token phrases. | 
Parameters | 
 
  | 
| Parameter | Description | 
|---|---|
Text | Great job! Ideally, between {1}-{2} token phrases are
                generated for a set of match groups. For this data set, only {0}
                entities fall above the recommended upper limit of token phrase generation and so it
                appears that entities in your data set have not produced an excessive number of
                token phrases. | 
Parameters | 
 
  | 
| Parameter | Description | 
|---|---|
Text | Keep on going! It appears your match groups are not generating an excessive number of token phrases for this data set. | 
Inspection - Too Few Tokens are Generated
{0}, {1}, and {2} are placeholders for your actual tenant data.| Parameter | Value | Description | 
|---|---|---|
                thresholdFewTokens
               | 1 | If the number of tokens for an entity is less than the
                  thresholdFewTokens value, then the entity is generating too few
                tokens. | 
thresholdEntitiesPercentage | 20 | If a percentage of entities generating too few tokens is more than the
                  thresholdEntitiesPercentage value, then the inspection results in
                an issue with WARNING severity. | 
| Parameter | Value | Description | 
|---|---|---|
                thresholdFewTokens
               | 5 | If the number of tokens for an entity is less than the
                  thresholdFewTokens value, then the entity is generating too few
                tokens. | 
thresholdEntitiesPercentage | 2 | If a percentage of entities generating too few tokens is more than the
                  thresholdEntitiesPercentage value, then the inspection results in
                an issue with ERROR severity. | 
| Parameter | Description | 
|---|---|
                examples
               | Array of strings with entity identifiers | 
entitiesPercentage | Percentage of entities that generated too few tokens | 
| Parameter | Description | 
|---|---|
                Text
               | {0} entities generated very few token phrases. | 
Severity | WARNING | 
Parameters | 
  | 
| Parameter | Description | 
|---|---|
Text
               | {0} entities generated too few token phrases. | 
Severity | ERROR | 
Parameters | 
  | 
| Parameter | Description | 
|---|---|
                Text
               | The number of token phrases generated for an entity by a single match group is
                normally greater than {0}. {1} of the entities in
                the data set generated less than {0} token phrase(s) for the match
                group "{2}". This suggests your tokenization design is too
                restrictive. | 
Parameters | 
 
 
 
  | 
| Parameter | Description | 
|---|---|
                Text
               | The number of token phrases generated for an entity by a set of match groups is
                normally greater than {0}. {1} of the entities in
                the data set generated fewer than {0} token phrase(s) which
                significantly reduces the likelihood of these entities being paired with others for
                matching. This suggests your tokenization design is too restrictive. | 
Parameters | 
 
 
  | 
| Parameter | Description | 
|---|---|
                Text
               | Review the tokenization design with a goal of increasing the number of token phrases it generates. You might try increasing the fuzziness of the match token classes. | 
Information Output
If no errors or warnings are reported, then the inspection results in an information output.
| Parameter | Description | 
|---|---|
Summary | {0} entities generated over {1} token
                phrase(s). | 
                Severity
               | INFO | 
Parameters | 
 
  | 
| Parameter | Description | 
|---|---|
Text | Great job! Ideally, between {1}-{2} token phrases are
                generated for an entity per match group. For this data set, only
                  {0} entities fall below the recommended lower limit of token
                phrase generation and so it appears that entities in your data set have produced an
                adequate number of token phrases. | 
Parameters | 
 
  | 
| Parameter | Description | 
|---|---|
Text | Great job! Ideally, between {1}-{2} token phrases are
                generated for an entity by a set of match groups.For this data set, only
                  {0} entities fall below the recommended lower limit of token
                phrase generation and so it appears that entities in your data set have produced an
                adequate number of token phrases. | 
Parameters | 
 
  | 
| Parameter | Description | 
|---|---|
Text | Keep on going! It appears your match groups are generating an adequate number of token phrases for this data set. | 
Example of the Inspections
The following sample shows the use of both the inspections:
{
  "inspections": [
    {
      "uid": "1-987-asdf-vcxb-sadf",
      "inspectionId": "tooManyTokens",
      "parameters": [
        {
          "name": "thresholdManyTokens",
          "value": 300
        },
        {
          "name": "thresholdEntitiesPercentage",
          "value": "2%"
        }
      ],
      "computationResults": [
        {
          "name": "entitiesPercentage",
          "value": 5.5
        },
        {
          "name": "examples",
          "values": [
            "entityId1",
            "entityId2"
          ]
        }
      ],
      "issue": {
        "severity": "ERROR",
        "text": "{0} entities generated too many token phrases.",
        "parameters": [
          {
            "id": "0",
            "type": "percentage",
            "value": "3"
          }
        ]
      },
      "explanation": {
        "text": "Extended explanation {0} {1} {2}",
        "parameters": [
          {
            "id": "0",
            "type": "percentage",
            "value": "3"
          },
          {
            "id": "1",
            "type": "matchGroupUri",
            "value": "some uri"
          }
        ]
      },
      "recommendation": {
        "text": "Some recommendation {0} {1} {2}",
        "parameters": [
          {
            "id": "0",
            "type": "percentage",
            "value": "3"
          },
          {
            "id": "1",
            "type": "matchGroupUri",
            "value": "some uri"
          }
        The following table explains the sections in the JSON output:
| Name | Required | Description | 
|---|---|---|
                inspections
               | No | The results of the applied inspections. | 
inspections[*].uid | Yes | Unique identifier of an inspection result. | 
inspections[*].inspectionId | Yes | Identifier of an applied inspection. | 
inspections[*].parameters | Yes/No, depends on inspectionId | Parameters of an applied inspection. For example, for
                  tooManyTokens inspection, the parameters
                  thresholdManyTokens and
                  thresholdEntitiesPercentage are mandatory. Other inspections
                might have no parameters. | 
inspections[*].parameters[*].name | Yes | Name of an inspection parameter. | 
inspections[*].parameters[*].value | Yes | Value of an inspection parameter. | 
inspections[*].computationResults | yes/no, depends on inspectionId | Details of the applied inspection. For example, for
                  tooManyTokens inspection, the details section contains examples
                and actual percentage of entities generating too many tokens. | 
inspections[*].computationResults[*].name | Yes | Name of the details. | 
inspections[*].computationResults[*].value | No | Value object with the details. | 
inspections[*].computationResults[*].values | No | Values array with the details.  | 
inspections[*].summary | No | The section with a short description of a discovered issue. | 
inspections[*].summary.severity | Yes | Severity of an issue. Possible values are ERROR,
                  WARNING, and INFO. | 
inspections[*].summary.text | Yes | Short description on an issue. Might be a format string (with placeholders). | 
inspections[*].summary.parameters | No | Parameters representing some information during analysis to fill the issue text placeholders. | 
inspections[*].summary.parameters[*].id | Yes | Id of a parameter (unique across the parameters
                section). | 
inspections[*].summary.parameters[*].type | Yes | 
                 Type of a parameter, such as   | 
inspections[*].explanation | No | The section with a full explanation of a discovered issue. | 
inspections[*].explanation.text | Yes | Full explanation on an issue. Might be a format string (with placeholders). | 
inspections[*].explanation.parameters | No | Parameters representing some information during analysis to fill the explanation text placeholders. | 
inspections[*].explanation.parameters[*].id | Yes | Id of a parameter (unique across the parameters
                section). | 
inspections[*].explanation.parameters[*].type | Yes | Type of a parameter, such as text, number,
                  percentage, matchGroupUri,
                  matchGroupUris, attributeUri,
                  attributeUris, matchTokenClass,
                  comparatorClass, and so on. | 
inspections[*].recommendation | No | The section with a recommendation on how to resolve the issue | 
inspections[*].recommendation.text | Yes | Text of a recommendation. Might be a format string (with placeholders) | 
inspections[*].recommendation.parameters | No | Parameters representing some information during analysis to fill the recommendation text placeholders. | 
inspections[*].recommendation.parameters[*].id | Yes | Id of a parameter (unique across the parameters
                section). | 
inspections[*].recommendation.parameters[*].type | Yes | Type of a parameter, such as text, number,
                  percentage, matchGroupUri,
                  matchGroupUris, attributeUri,
                  attributeUris, matchTokenClass,
                  comparatorClass, and so on. |