Accelerate the Value of Data

Inspections for Overcollisioned Tokens

Information about the inspection that identifies overcollisioned tokens.

Overview

If there are tokens that are common across several entities in a data set based on the matchToken and matchTokenIntersection analysis, the tokens are overcollisioned. You can find out if a token is overcollisioned by checking the data storage for tokens marked as overcollisioned.

Inspection for Overcollisioned Tokens

The inspection applies to a single match group, match groups of a specific type, and all match groups. The inspection is based on the overcollisionedMatchToken calculator results.

Table 1. Inspection Result Details
Name Description
examples An array of structures describing the token phrases and the corresponding entity identifiers. 10 examples are provided in the descending order based on the number of entities.
total The total number of tokens phrases having entities more than the threshold.
Table 2. Inspection Summary
Parameter Description
Text {0} overcollisioned token phrases found. The performance of your tenant will likely be impacted.
Severity ERROR (if more than 1% of the entities generate overcollisioned tokens) or WARNING (if less than 1% of the entities generate overcollisioned tokens).
Parameters Id: 0, Type: Number, Value: 20 (not more than the number of processed entities of the specific entity type)
Table 3. Inspection Explanation
Parameter Description
Text The same token phrase(s) must not be generated by more than {0} of the entities in your tenant. If this is happening, it suggests that your match groups are repetitively reliant on the same set of attributes within your entity type, and your tokenization design is too redundant across your match groups. In this data set, {1} overcollisioned token phrases have been found.
Parameters
  • Id: 0, Type: Number, Value: 300 (fixed API threshold for overcollisioned tokens)
  • Id: 1, Type: Number, Value: 20 (not more than the number of processed entities of the specific entity type)
  • Id: 2, Type: Number, Value: 8 (equal to threshold value)
Table 4. Inspection Recommendation for a Single Match Group
Parameter Description
Text Review the match group ''{0}'' associated with these token phrases, and try creating more variation in the match group and the associated tokenization design.
Parameters Id: 0, Type: matchRuleURI, Value: configuration/entityTypes/Type1/matchGroups/MG1
Table 5. Inspection Recommendation for Multiple Match Groups
Parameter Description
Text Review the match groups associated with these token phrases, and try creating more variation in your match groups and the associated tokenization designs.

Information Output

If no errors or warnings are reported, then the inspection results in an information output.

Table 6. Summary Field
Parameter Description
Text The distribution of token phrases across the entities in the data set is reasonable. No overcollisioned token phrases found.
Severity INFO
Table 7. Explanation Field
Parameter Description
Text Great job! The repetition of identical token phrases from this data set does not exceed {0}, thus there are no token phrases that were generated by more than the max recommended number of entities.
Table 8. Recommendation Field
Parameter Description
Text Keep on going! It appears you're getting a good distribution of token phrases across your data set.

Output Example

{
  "uid": "1-987-asdf-vcxb-sadf",
  "inspectionId": "overcollisionedTokens",
  "parameters": [
    {
      "name": "thresholdWarning",
      "value": 3
    },
    {
      "name": "thresholdError",
      "value": "6"
    }
  ],
  "computationResults": [
    {
      "name": "total",
      "value": 20
    },
    {
      "name": "examples",
      "values": [
        {
          "token": "tokenPhrase1",
          "count": 8,
          "entities": [
            "entity1",
            "entity2"
          ]
        },
        {
          "token": "tokenPhrase2",
          "count": 4,
          "entities": [
            "entity3",
            "entity4"
          ]
        }
      ]
    }
  ],
  "summary": {
    "severity": "ERROR",
    "text": "{0} token phrases contain too many entities.",
    "parameters": [
      {
        "id": "0",
        "type": "number",
        "value": "20"
      }
    ]
  },
  "explanation": {
    "text": "The same token phrase(s) must not be generated by more than {0} of the entities in your tenant. If this is happening, it suggests that your match groups are repetitively reliant on the same set of attributes within your entity type, and your tokenization design is too redundant across your match groups. In this data set, {1} overcollisioned token phrases have been found.",
    "parameters": [
      {
        "id": "0",
        "type": "number",
        "value": 300
      },
      {
        "id": "1",
        "type": "number",
        "value": 20
      },
      {
        "id": "2",
        "type": "number",
        "value": 6
      }
    ]
  },
  "recommendation": {
    "text": "Review the match group {0} associated with these token phrases, and try creating more variation in the match group and the associated tokenization scheme.",
    "parameters": [
      {
        "id": "0",
        "type": "matchGroupUri",
        "value": "configuration/entityTypes/Type1/matchGroups/MG1"
      }
    ]
  }
}