Accelerate the Value of Data

Tokenization Using Query Filter

Change in the behavior of the tokenization process

If you make any of the following changes, you must retokenize your entire data set:
  • Change your match rule configuration
  • Add a string to your string replacement list
  • Add new words to your noise word list

For a large data set, the retokenization can take up to 2 to 3 days. The process consumes a large part of the system resources and puts constraints on other activities in the tenant. To reduce the time for retokenization, the RebuildMatchTableTask contains parameters to filter records and exclude data that doesn't need to be retokenized. Since only a subset of the data is selected, the retokenization time is significantly reduced.

Note: The old format without the request body or with the list of entities is still supported.

The body of the RebuildMatchTableTask has a new format that contains the objects to be included or excluded and a query section that filters data based on the match rules. The following is a sample of the body of the task:

{
  "objects": {
    "includeList": [
      "entities/ID1",
      "entities/ID2",
      "entities/ID3"
    ],
    "excludeList": [
      "entities/ID4",
      "entities/ID5"
    ]
  },
  "query": [
    {
      "filter": "equals(attributes.Address.City, 'Las Vegas') and equals(type, 'configuration/entityTypes/HCP')",
      "activeness": "active",
      "options": "searchByOv"
    },
    {
      "filter": "equals(attributes.Address.State, 'Ohio') and equals(type, 'configuration/entityTypes/HCO')",
      "activeness": "all"
    }
  ]
}
ParameterTypePossible ValueRequiredDescription
objects.includeListArrayentity URIs listNoExplicit list of objects to iterate. If no list is specified, then all objects found by query.filter are tokenized.
objects.excludeListArrayentity URIs listNoExplicit list of objects to exclude from the tokenization. If an object is found by the search request and has a URI from the objects.excludeList, it is skipped by the task.
queryArraylist of query definitionsNoThis section represents different search queries. The results of the search queries are aggregated and passed to rebuild. In other words, the search queries are combined with the ‘OR’ logical statement.
query.filterStringSearch filter as described in - Filtering EntitiesNoThe search filter to find entities having the required characteristics. If the ‘objects.includeList’ section is not defined, the search request is passed to the search engine. If the objects.includeList section exists, the listed entities are filtered by the specified search request.
query.activenessStringACTIVE, NOT_ACTIVE, ALLNoThe query searches for objects that are active. This parameter is ignored if there is no ‘query.filter’ section.
query.optionsStringsearchByOvNoIf the searchByOv option is specified, then the search considers only Operational Values.

Searching Entities

If you specify the query filter and do not specify the explicit list of entities to be included or excluded, then an additional task to search entities by the filter is created. The task is named QueryObjectsTask and queries the object URIs by using the filter criteria. The entity URIs are written to a file with line breaks which means each entity URI is written on a separate line as follows:

entities/id1
entities/id2
…
entities/idN

In the case of multiple queries, you might encounter duplicate entity URIs. If the number of entity URIs exceeds one million, then duplicate entity URIs are also included in the file.

The filename is unique, autogenerated, and stored in the QueryObjectsTask and RebuildMatchTableTask in a field called uriList.

The file is placed into folders in the external storage based on the following structure:

configuredBucket/<tenantId>/<filename>

Example Response of a RebuildMatchTableTask

[
  {
    "id": "aab5b838-cead-45db-8615-b97469b2e90f",
    "groupId": "b55bec18-bf60-41ec-aa94-ca0ca65af80a",
    "createdTime": 1566901266053,
    "createdBy": "admin",
    "updatedTime": 1566901266053,
    "updatedBy": "admin",
    "type": "com.reltio.businesslogic.tasks.common.QueryObjectsTask",
    "status": "SCHEDULED",
    "name": "Query Objects",
    "createdOnHost": "RLT-L-IM01",
    "parallelExecution": false,
    "nodesGroup": "test",
    "parameters": {
      "tenantId": "Merill",
      "uriList": "[\"bc88e2dc-ade0-43df-861b-890cd7bb61c5\"]",
      "excludeUriList": null,
      "queryObjectsSpecifications": "[{\"filter\":\"equals(attributes.Name,'Name')\",\"activeness\":\"ACTIVE\"}]",
      "s3Bucket": null,
      "s3Region": null
    },
    "currentState": {},
    "duration": "0s"
  },
  {
    "id": "f1d19fe6-33ad-47e7-aa30-87cb07b1dd0e",
    "createdTime": 1566901266053,
    "createdBy": "admin",
    "updatedTime": 1566901266053,
    "updatedBy": "admin",
    "type": "com.reltio.businesslogic.tasks.matching.RebuildMatchTableTask",
    "status": "SCHEDULED",
    "name": "Rebuild match table process for all types in tenant Merill",
    "createdOnHost": "RLT-L-IM01",
    "parallelExecution": false,
    "nodesGroup": "test",
    "parameters": {
      "tenantId": "Merill",
      "entityType": "all types",
      "skipEntitiesCount": "0",
      "entitiesLimit": "-1",
      "uriList": "[\"bc88e2dc-ade0-43df-861b-890cd7bb61c5\"]",
      "rebuildMatchTableTaskVersion": "v1",
      "excludeUriList": null,
      "queryObjectsSpecifications": "[{\"filter\":\"equals(attributes.Name,'Name')\",\"activeness\":\"ACTIVE\"}]",
      "fileFormat": "line_break",
      "deleteOldCF": "false",
      "s3Bucket": null,
      "s3Region": null,
      "uriListType": "SINGLE_S3",
      "dependedTasksIds": "aab5b838-cead-45db-8615-b97469b2e90f"
    },
    "currentState": {},
    "duration": "0s",
    "dependsOnTasksIds": "aab5b838-cead-45db-8615-b97469b2e90f"
  }
]
      

Example of the Completed Periodic Tasks

[
  {
    "id": "aab5b838-cead-45db-8615-b97469b2e90f",
    "groupId": "b55bec18-bf60-41ec-aa94-ca0ca65af80a",
    "createdTime": 1566901266053,
    "createdBy": "admin",
    "updatedTime": 1566901266053,
    "updatedBy": "admin",
    "type": "com.reltio.businesslogic.tasks.common.QueryObjectsTask",
    "status": "COMPLETED",
    "name": "Query Objects",
    "createdOnHost": "RLT-L-IM01",
    "executedOnHost": "RLT-L-IM01",
    "parallelExecution": false,
    "nodesGroup": "test",
    "startTime": 1566901270780,
    "endTime": 1566901271896,
    "parameters": {
      "tenantId": "Merill",
      "uriList": "[\"bc88e2dc-ade0-43df-861b-890cd7bb61c5\"]",
      "excludeUriList": "",
      "queryObjectsSpecifications": "[{\"filter\":\"equals(attributes.Name,'Name')\",\"activeness\":\"ACTIVE\"}]",
      "s3Bucket": "",
      "s3Region": ""
    },
    "currentState": {
      "lastHourThroughput": 0,
      "numberOfProcessedObjects": 75,
      "numberOfFailedToPublishEvents": 0,
      "statistics": {
        "queries": [
          {
            "query": {
              "filter": "equals(attributes.Name,'Name')",
              "options": "",
              "activeness": "ACTIVE"
            },
            "objectsCount": 75,
            "fetchTime": 245
          }
        ],
        "totalUrisFound": 75
      },
      "status": "Completed"
    },
    "throughput": 75,
    "duration": "1s"
  },
  {
    "id": "f1d19fe6-33ad-47e7-aa30-87cb07b1dd0e",
    "createdTime": 1566901266053,
    "createdBy": "admin",
    "updatedTime": 1566901266053,
    "updatedBy": "admin",
    "type": "com.reltio.businesslogic.tasks.matching.RebuildMatchTableTask",
    "status": "COMPLETED",
    "name": "Rebuild match table process for all types in tenant Merill",
    "createdOnHost": "RLT-L-IM01",
    "executedOnHost": "RLT-L-IM01",
    "parallelExecution": false,
    "nodesGroup": "test",
    "startTime": 1566901271917,
    "endTime": 1566901272970,
    "parameters": {
      "tenantId": "Merill",
      "entityType": "all types",
      "skipEntitiesCount": "0",
      "entitiesLimit": "-1",
      "uriList": "[\"bc88e2dc-ade0-43df-861b-890cd7bb61c5\"]",
      "rebuildMatchTableTaskVersion": "v1",
      "excludeUriList": "",
      "queryObjectsSpecifications": "[{\"filter\":\"equals(attributes.Name,'Name')\",\"activeness\":\"ACTIVE\"}]",
      "fileFormat": "line_break",
      "deleteOldCF": "false",
      "s3Bucket": "",
      "s3Region": "",
      "uriListType": "SINGLE_S3",
      "dependedTasksIds": "aab5b838-cead-45db-8615-b97469b2e90f"
    },
    "currentState": {
      "lastHourThroughput": 0,
      "numberOfProcessedObjects": 75,
      "entityType": "all types",
      "numberOfFailedToPublishEvents": 0,
      "status": "Completed"
    },
    "throughput": 75,
    "duration": "1s",
    "dependsOnTasksIds": "aab5b838-cead-45db-8615-b97469b2e90f"
  }
]
  • The numberOfProcessedObjects is filled for the QueryObjectsTask based on the number of objects found.
  • As part of the currentState section in the QueryObjectsTask, the Statistics section contains information about the number of objects found, execution time for each query, and the count of the total objects.
Note:
  • The rebuild process triggers the match service event for entities that meet the search filter criteria. However, these filtered entities may match with other entities. In this case, the match service event is also triggered for the entities that match with the filtered entities.