Inspections for Overcollisioned Tokens
Information about the inspection that identifies overcollisioned tokens.
Overview
If there are tokens that are common across several entities in a data set based on the
matchToken
and matchTokenIntersection
analysis, the
tokens are overcollisioned. You can find out if a token is overcollisioned by checking the
data storage for tokens marked as overcollisioned.
Inspection for Overcollisioned Tokens
The inspection applies to a single match group, match groups of a specific type, and all
match groups. The inspection is based on the overcollisionedMatchToken
calculator results.
Name | Description |
---|---|
examples
|
An array of structures describing the token phrases and the corresponding entity identifiers. 10 examples are provided in the descending order based on the number of entities. |
total |
The total number of tokens phrases having entities more than the threshold. |
Parameter | Description |
---|---|
Text
|
{0} overcollisioned token phrases found. The performance of
your tenant will likely be impacted. |
Severity |
ERROR (if more than 1% of the entities generate
overcollisioned tokens) or WARNING (if less than 1% of the entities
generate overcollisioned tokens). |
Parameters |
Id: 0, Type: Number, Value: 20 (not more than the number of
processed entities of the specific entity type) |
Parameter | Description |
---|---|
Text
|
The same token phrase(s) must not be generated by more than
{0} of the entities in your tenant. If this is happening, it
suggests that your match groups are repetitively reliant on the same set of
attributes within your entity type, and your tokenization design is too redundant
across your match groups. In this data set, {1} overcollisioned
token phrases have been found. |
Parameters |
|
Parameter | Description |
---|---|
Text
|
Review the match group ''{0} '' associated with these token
phrases, and try creating more variation in the match group and the associated
tokenization design. |
Parameters |
Id: 0, Type: matchRuleURI, Value:
configuration/entityTypes/Type1/matchGroups/MG1
|
Parameter | Description |
---|---|
Text
|
Review the match groups associated with these token phrases, and try creating more variation in your match groups and the associated tokenization designs. |
Information Output
If no errors or warnings are reported, then the inspection results in an information output.
Parameter | Description |
---|---|
Text |
The distribution of token phrases across the entities in the data set is reasonable. No overcollisioned token phrases found. |
Severity
|
INFO |
Parameter | Description |
---|---|
Text |
Great job! The repetition of identical token phrases from this data set does
not exceed {0} , thus there are no token phrases that were generated
by more than the max recommended number of entities. |
Parameter | Description |
---|---|
Text |
Keep on going! It appears you're getting a good distribution of token phrases across your data set. |
Output Example
{
"uid": "1-987-asdf-vcxb-sadf",
"inspectionId": "overcollisionedTokens",
"parameters": [
{
"name": "thresholdWarning",
"value": 3
},
{
"name": "thresholdError",
"value": "6"
}
],
"computationResults": [
{
"name": "total",
"value": 20
},
{
"name": "examples",
"values": [
{
"token": "tokenPhrase1",
"count": 8,
"entities": [
"entity1",
"entity2"
]
},
{
"token": "tokenPhrase2",
"count": 4,
"entities": [
"entity3",
"entity4"
]
}
]
}
],
"summary": {
"severity": "ERROR",
"text": "{0} token phrases contain too many entities.",
"parameters": [
{
"id": "0",
"type": "number",
"value": "20"
}
]
},
"explanation": {
"text": "The same token phrase(s) must not be generated by more than {0} of the entities in your tenant. If this is happening, it suggests that your match groups are repetitively reliant on the same set of attributes within your entity type, and your tokenization design is too redundant across your match groups. In this data set, {1} overcollisioned token phrases have been found.",
"parameters": [
{
"id": "0",
"type": "number",
"value": 300
},
{
"id": "1",
"type": "number",
"value": 20
},
{
"id": "2",
"type": "number",
"value": 6
}
]
},
"recommendation": {
"text": "Review the match group {0} associated with these token phrases, and try creating more variation in the match group and the associated tokenization scheme.",
"parameters": [
{
"id": "0",
"type": "matchGroupUri",
"value": "configuration/entityTypes/Type1/matchGroups/MG1"
}
]
}
}