Inspections for Overcollisioned Tokens

Information about the inspection that identifies overcollisioned tokens.

Overview

If there are tokens that are common across several entities in a data set based on the matchToken and matchTokenIntersection analysis, the tokens are overcollisioned. You can find out if a token is overcollisioned by checking the data storage for tokens marked as overcollisioned.

Inspection for Overcollisioned Tokens

The inspection applies to a single match group, match groups of a specific type, and all match groups. The inspection is based on the overcollisionedMatchToken calculator results.

Table 1. Inspection Result Details
Name	Description
`examples`	An array of structures describing the token phrases and the corresponding entity identifiers. 10 examples are provided in the descending order based on the number of entities.
`total`	The total number of tokens phrases having entities more than the threshold.

Table 2. Inspection Summary
Parameter	Description
`Text`	`{0}` overcollisioned token phrases found. The performance of your tenant will likely be impacted.
`Severity`	`ERROR` (if more than 1% of the entities generate overcollisioned tokens) or `WARNING` (if less than 1% of the entities generate overcollisioned tokens).
`Parameters`	`Id: 0, Type: Number, Value: 20` (not more than the number of processed entities of the specific entity type)

Table 3. Inspection Explanation
Parameter	Description
`Text`	The same token phrase(s) must not be generated by more than `{0}` of the entities in your tenant. If this is happening, it suggests that your match groups are repetitively reliant on the same set of attributes within your entity type, and your tokenization design is too redundant across your match groups. In this data set, `{1}` overcollisioned token phrases have been found.
`Parameters`	`Id: 0, Type: Number, Value: 300` (fixed API threshold for overcollisioned tokens) `Id: 1, Type: Number, Value: 20` (not more than the number of processed entities of the specific entity type) `Id: 2, Type: Number, Value: 8` (equal to threshold value)

Table 4. Inspection Recommendation for a Single Match Group
Parameter	Description
`Text`	Review the match group ''`{0}`'' associated with these token phrases, and try creating more variation in the match group and the associated tokenization design.
`Parameters`	`Id: 0, Type: matchRuleURI, Value: configuration/entityTypes/Type1/matchGroups/MG1`

Table 5. Inspection Recommendation for Multiple Match Groups
Parameter	Description
`Text`	Review the match groups associated with these token phrases, and try creating more variation in your match groups and the associated tokenization designs.

Information Output

If no errors or warnings are reported, then the inspection results in an information output.

Table 6. Summary Field
Parameter	Description
`Text`	The distribution of token phrases across the entities in the data set is reasonable. No overcollisioned token phrases found.
`Severity`	`INFO`

Table 7. Explanation Field
Parameter	Description
`Text`	Great job! The repetition of identical token phrases from this data set does not exceed `{0}`, thus there are no token phrases that were generated by more than the max recommended number of entities.

Table 8. Recommendation Field
Parameter	Description
`Text`	Keep on going! It appears you're getting a good distribution of token phrases across your data set.

Output Example

{
  "uid": "1-987-asdf-vcxb-sadf",
  "inspectionId": "overcollisionedTokens",
  "parameters": [
    {
      "name": "thresholdWarning",
      "value": 3
    },
    {
      "name": "thresholdError",
      "value": "6"
    }
  ],
  "computationResults": [
    {
      "name": "total",
      "value": 20
    },
    {
      "name": "examples",
      "values": [
        {
          "token": "tokenPhrase1",
          "count": 8,
          "entities": [
            "entity1",
            "entity2"
          ]
        },
        {
          "token": "tokenPhrase2",
          "count": 4,
          "entities": [
            "entity3",
            "entity4"
          ]
        }
      ]
    }
  ],
  "summary": {
    "severity": "ERROR",
    "text": "{0} token phrases contain too many entities.",
    "parameters": [
      {
        "id": "0",
        "type": "number",
        "value": "20"
      }
    ]
  },
  "explanation": {
    "text": "The same token phrase(s) must not be generated by more than {0} of the entities in your tenant. If this is happening, it suggests that your match groups are repetitively reliant on the same set of attributes within your entity type, and your tokenization design is too redundant across your match groups. In this data set, {1} overcollisioned token phrases have been found.",
    "parameters": [
      {
        "id": "0",
        "type": "number",
        "value": 300
      },
      {
        "id": "1",
        "type": "number",
        "value": 20
      },
      {
        "id": "2",
        "type": "number",
        "value": 6
      }
    ]
  },
  "recommendation": {
    "text": "Review the match group {0} associated with these token phrases, and try creating more variation in the match group and the associated tokenization scheme.",
    "parameters": [
      {
        "id": "0",
        "type": "matchGroupUri",
        "value": "configuration/entityTypes/Type1/matchGroups/MG1"
      }
    ]
  }
}

Unify and manage your data

Inspections for Overcollisioned Tokens

Overview

Inspection for Overcollisioned Tokens

Footer