Accelerate the Value of Data

Reindex Data Task

Reindexes entity data in a tenant.

You must run this task after a data load or if any data synchronization issues are detected. This job does not rebuild match tables. It is recommended that you also run the rebuild match tables jobs after reindexing data in your tenant.

This task refreshes the index that supports Elastic search of a tenant's Entity data. To speed up the search of a large database, the database management system builds one or more database indexes. Like a book index, a database index is a list of paired items: one item is a significant term from the database, and the other is the address of that term in the data storage (like the page number in a large book), where the program can go to find that term. When you type in a term to search for, it's much faster for the program to scan the index for the term, and go directly to the indexed address, than it is to search through the entire database. Database indexes keep large databases tuned up, so that searches can be performed quickly.

You must reindex the data only when there is a specific reason to rebuild the index. Reindexing should not be a regular, scheduled, or, periodic task. You should reindex the data in the following cases:

  • Metadata Configuration/L3 changes such as:

    • Existing attribute changes (type, label)

    • Existing attribute is removed

    • Existed source is removed

    • OV rules are modified

    • Cleanse configuration is modified

    • Lookups (RDM) mappings are modified

    • MatchFieldUris are modified

    • Surrogate crosswalks settings are modified

    • Reference attribute settings are modified

    • Sub-nested attributes are added or removed

  • Tenant's physical configuration changes in:

    • SurvivorshipAdvancedBehavior

    • indexOvStrategy

Reltio's enterprise-scale database system includes a primary Entity database which is searched by a utility called ElasticSearch. In addition to the ElasticSearch index, the system is supported by a sub-database that captures the history of all database transactions, an index of all matching records from various sources, and tables that store the output of Reltio's analytics layer (RI). Each of these supporting data structures needs to be updated when the primary database changes. The Reindex Data task updates the ElasticSearch index and updates the supporting structures as well.

Important: This work can put a significant strain on the system's resources. In addition, each change to the database and any supporting structure generates a database event which is streamed to recording servers. This sudden surge of new events can overwhelm these streaming channels. To avoid overwhelming your databases and reporting channels when reindexing, you can tweak the Reindex Data task in the following ways:
  • Instruct the task to reindex the ElasticSearch index but leave the match tables, history, and analytics data alone.
  • Restrict the reindexing process to a subset of the Entity database.
  • Instruct the process to skip some number of Entities.
  • Cap the number of Entities to be reindexed.
  • Distribute the reindexing task among a number of processing nodes.
  • Run the reindexing job against a limited list of Entities (by appending a list of URLs in the body of the Reindex Data request).
Note: It is not recommended to update the business configuration of the tenant while the reindex data task is running.

The table below provides instructions on how to specify one choice or any combination of choices when you run the Reindex Data task.

Note: This task also starts PotentialMatchesReindexTask (see Potential Matches Reindex Task) to reindex potential matches information for the entire tenant. However, when you reindex specific entities, the PotentialMatchesReindexTask is not started.

Stop and Pause are supported.

Requests

Admin role is required:

POST {ApplicationURL}/reindex?tenantId={tenantId}

Tenant admin role is required:

POST {ApplicationURL}/api/{tenantId}/reindex
File mode example:
POST {ApplicationURL}/api/{tenantId}/reindex?bucket={bucket}

Headers:
Authorization: Bearer {accessToken} 
awsAccessKey: {awsAccessKey} 
awsSecretKey: {awsSecretKey}

Body: ["path/file01.json", "path/file02.json", "path/file03.json"]
Table 1. Parameters
ParameterRequiredDescription
tenantId YesID of the tenant to reindex entities.
updateEntities NoIf set to true (default), in addition to refreshing the ElasticSearch index, the task also updates history, match tables, and the analytics layer (RI). This ensures that all indexes and supporting structures are as up-to-date as possible. As explained above, however, triggering all these activities may decrease the overall performance level of the database system for business work, and overwhelm the event streaming channels. If set to false, the task updates ElasticSearch data only. It does not perform rematching, or update history or analytics. These other activities can be performed at different times to spread out the performance impact.
Note: See the Reindexing and Streaming section for details.
entityType NoIf provided, the task restricts the reindexing scope to Entities of specified type.
skipEntitiesCount NoIf provided, sets the number of Entities which are skipped during reindexing. Default value: 0.
entitiesLimit NoIf provided, sets the maximum number of Entities are reindexed. Default value: infinity.
updatedSince NoTimestamp in Unix format. If this parameter is provided, then only entities with greater or equal timestamp are reindexed. This is a good way to limit the reindexing to newer records.
checkCrosswalksConsistency NoSpecify true to reindex each Entity, whether it has changed or not. This operation ensures that each Entity in the database is processed. Reltio does not recommend this option – it decreases the performance of the reindex task dramatically, and may overload the server, which will interfere with all database operations. Default value: false
distributed NoIf set to true, the task runs in distributed mode, which is a good way to take advantage of a networked or clustered computing environment to spread the performance demands of reindexing over several nodes. Default value is false. For details, refer to Distributed Mode.
taskPartsCount No
The number of tasks which are created for distributed reindexing. Each task reindexes its own subset of Entities. Each task may be executed on a different API node, so that all tasks can run in parallel. Recommended value: the number of API nodes which can execute the tasks. Default value: 2
Note: This parameter is used only in distributed mode ( distributed=true); otherwise, its ignored.
forceIgnoreInStreamingNoIf set to true, only events produced by the task are ignored in streaming. Default value is false.
enableSeparateIndexingNo

If set to true, the Reindex Data task builds a new index. When the task is completed, the task replaces the old index with the new one. Run the Reindex Data task with this parameter to continue your activities in Reltio Cloud without waiting for the task to finish. Default value: false.

bucketNoBucket name
s3RegionNoAWS S3 region
Table 2. Header Parameters
Header Key NameRequiredValue Description
awsAccessKeyNoS3 access key for amazon bucket.
awsSecretKeyNo S3 secret key for amazon bucket.
googleCredentialsNoGoogle credentials

Body (optional)

JSON Array of entity URIs to reindex or JSON Array of file names with entity URIs in case you send awsAccessKey and awsSecretKey or googleCredentials.

File format is JSON Array of entity URIs:
["entities/aaaaaaa", "entities/bbbbbbb", "entities/ccccccc"]
Note: If a JSON array of entity URIs is specified, then only the specified entities are reindexed. If the header parameters are specified, then only file URI's are accepted in the request body.

Reindexing and Streaming

The ReindexData data task can do the following:

  • Rebuild the index for the tenant's Entities (to do this, it creates an ENTITY_CHANGED event for each one)
  • Update the history table
  • Update the match tables to reflect each Entity change/creation/deletion caused by reindexing
  • Repopulate the analytics layer (RI)

Each of these activities can generate a large volume of events, which may overwhelm the tenant's event queues.

There are two ways to prevent this problem:

  • Run reindex with updateEntities set to false, as in the following example:

    POST /reltio/{tenantId}/reindex?updateEntities=false

    When run with this option, reindex only reindexes the ElasticSearch index; the History, Match, and Analytics processors are not triggered.

  • Run reindex with forceIgnoreInStreaming set to true, as in the following example:

    POST /reltio/{tenantId}/reindex?forceIgnoreInStreaming=true

    This option forces reindex to ignore streaming when running a reindex job.

Note: For reindexing of analytics attributes, run the Reindex Analytics Attributes Task.