Reindex Data Task

Learn about reindexing entity data in a tenant

Note: This job does not rebuild match tables. It is recommended that you also run the rebuild match table jobs after reindexing data in your tenant.

This task refreshes the index that supports Elastic search of a tenant's Entity data. To speed up the search of a large database, the database management system builds one or more database indexes. Like a book index, a database index is a list of paired items: one item is a significant term from the database, and the other is the address of that term in the data storage (like the page number in a large book), where the program can go to find that term. When you type in a term to search for, it's much faster for the program to scan the index for the term, and go directly to the indexed address, than it is to search through the entire database. Database indexes keep large databases tuned up, so that searches can be performed quickly.

Reindexing should not be a routine, scheduled, or periodic task.

Reindex your data when the following changes occur by setting enableSeparateIndexing=true. When you run the Reindex task, Elasticsearch stores values based on the current survivorship strategies. If you update survivorship strategies after reindexing, those changes will not apply until you run the Reindex task again with enableSeparateIndexing=true.

Note: Do not set enableSeparateIndexing to true for a subset of your data, only use it on all of your data.

Metadata Configuration/L3 changes such as:
- Existing attribute changes (type, label)
- Existing attribute is removed
- Existed source is removed
- Survivorship rules are modified
- Cleanse configuration is modified
- Lookups (RDM) mappings are modified
- MatchFieldUris are modified
- Surrogate crosswalks settings are modified
- Reference attribute settings are modified
- Sub-nested attributes are added or removed
Tenant's physical configuration changes in:
- survivorshipAdvancedBehavior
- indexOvStrategy

Reltio's enterprise-scale database system includes a primary Entity database which is searched by ElasticSearch. In addition to the ElasticSearch index, the system is supported by a sub-database that captures the history of all database transactions, an index of all matching records from various sources, and tables that store the output of Reltio's analytics layer (RI). Each of these supporting data structures needs to be updated when the primary database changes. The Reindex Data task updates the ElasticSearch index and updates the supporting structures as well.

Important: This work can put a significant strain on the system's resources. In addition, each change to the database and any supporting structure generates a database event which is streamed to recording servers. This sudden surge of new events can overwhelm these streaming channels. To avoid overwhelming your databases and reporting channels when reindexing, you can tweak the Reindex Data task in the following ways:

Instruct the task to reindex the ElasticSearch index but leave the match tables, history, and analytics data alone.
Restrict the reindexing process to a subset of the Entity database.
Instruct the process to skip some number of Entities.
Cap the number of Entities to be reindexed.
Distribute the reindexing task among a number of processing nodes.
Run the reindexing job against a limited list of Entities (by appending a list of URLs in the body of the Reindex Data request).

Note: It is not recommended to update the business configuration of the tenant while the reindex data task is running.

The table below provides instructions on how to specify one choice or any combination of choices when you run the Reindex Data task.

Note: This task also starts PotentialMatchesReindexTask (see Potential Matches Reindex Task) to reindex potential matches information for the entire tenant. However, when you reindex specific entities, the PotentialMatchesReindexTask is not started.

Important: The number of processed objects reported in the Reindex Data task may vary. This variation occurs because the Reindex Data task triggers the PotentialMatchesReindexTask, which reindexes data from match tables and updates it in Elasticsearch. During this process, entities with potential matches are reprocessed.

The PotentialMatchesReindexTask generates events for changes in potential matches. Since it can run in distributed mode and execute in multiple parts, the processed object count may differ depending on the dataset range handled during each execution.

Stop and Pause are supported.

Requests

Tenant admin role is required:

POST {ApplicationURL}/reindex?tenantId={tenantId}

Tenant admin role is required:

POST {ApplicationURL}/api/{tenantId}/reindex

File mode example: POST {ApplicationURL}/api/{tenantId}/reindex?bucket={bucket} Headers: Authorization: Bearer {accessToken} awsAccessKey: {awsAccessKey} awsSecretKey: {awsSecretKey} Body: ["path/file01.json", "path/file02.json", "path/file03.json"]

Table 1. Parameters
Parameter	Required	Description
`tenantId`	Yes	ID of the tenant to reindex entities.
`updateEntities`	No	If set to `true` (default), in addition to refreshing the ElasticSearch index, the task also updates history, match tables, and the analytics layer (RI). This ensures that all indexes and supporting structures are as up-to-date as possible. As explained above, however, triggering all these activities may decrease the overall performance level of the database system for business work, and overwhelm the event streaming channels. If set to `false`, the task updates ElasticSearch data only. It does not perform rematching, or update history or analytics. These other activities can be performed at different times to spread out the performance impact. Note: See the Reindexing and Streaming section for details.
`entityType`	No	If provided, restricts the reindexing scope to the specified entity type(s). Supports multiple entity types as comma-separated values (e.g., entityType=Person,Organization).
`skipEntitiesCount`	No	If provided, sets the number of Entities which are skipped during reindexing. Default value: `0`.
`entitiesLimit`	No	If provided, sets the maximum number of Entities are reindexed. Default value: `infinity`.
`updatedSince`	No	Timestamp in Unix format. If this parameter is provided, then only entities with a greater or equal timestamp are reindexed. This is a good way to limit the reindexing to newer records. Note: The task must still scan the entire database even when this parameter limits the number of indexed entities. As a result, the task execution time will generally remain the same as running it without this parameter.
`checkCrosswalksConsistency`	No	Specify `true` to reindex each Entity, whether it has changed or not. This operation ensures that each Entity in the database is processed. Reltio does not recommend this option – it decreases the performance of the reindex task dramatically, and may overload the server, which will interfere with all database operations. Default value: `false`
`distributed`	No	If set to `true`, the task runs in distributed mode, which is a good way to take advantage of a networked or clustered computing environment to spread the performance demands of reindexing over several nodes. Default value is `false`. For details, refer to Distributed Mode.
`taskPartsCount`	No	The number of tasks which are created for distributed reindexing. Each task reindexes its own subset of Entities. Each task may be executed on a different API node, so that all tasks can run in parallel. Recommended value: the number of API nodes which can execute the tasks. Default value: `2` Note: This parameter is used only in distributed mode ( `distributed=true`); otherwise, its ignored.
`forceIgnoreInStreaming`	No	If set to `true`, only events produced by the task are ignored in streaming. Default value is `false`. Note: When you set this parameter to true, the events are generated, but not streamed to external queues. The generated events are used by the internal queue to rebuild the index. This will not affect performance in any way, since the current index remains available throughout the process. For more information about internal and external queues, see topic Queues at a glance.
`enableSeparateIndexing`	No	If set to `true`, the Reindex Data task builds a new index. When the task is completed, the task replaces the old index with the new one. Run the Reindex Data task with this parameter to continue your activities in Reltio Cloud without waiting for the task to finish. Default value: `false`.
`bucket`	No	Bucket name
`s3Region`	No	AWS S3 region

Table 2. Header Parameters
Header Key Name	Required	Value Description
`awsAccessKey`	No	S3 access key for amazon bucket.
`awsSecretKey`	No	S3 secret key for amazon bucket.
`googleCredentials`	No	Google credentials

Body (optional)

JSON Array of entity URIs to reindex or JSON Array of file names with entity URIs in case you send awsAccessKey and awsSecretKey or googleCredentials.

File format is JSON Array of entity URIs: ["entities/aaaaaaa", "entities/bbbbbbb", "entities/ccccccc"]

Note: If a JSON array of entity URIs is specified, then only the specified entities are reindexed. If the header parameters are specified, then only file URI's are accepted in the request body.

Reindexing and Streaming

The ReindexData data task can do the following:

Rebuild the index for the tenant's Entities (to do this, it creates an ENTITY_CHANGED event for each one)
Update the history table
Update the match tables to reflect each Entity change/creation/deletion caused by reindexing
Repopulate the analytics layer (RI)

Each of these activities can generate a large volume of events, which may overwhelm the tenant's event queues.

To optimize performance and reduce event processing load, consider the following options:

Limit the scope of reindexing by specifying multiple entity types in the entityType parameter (e.g., entityType=Person,Organization).
Run reindex with updateEntities set to false, as in the following example: POST /reltio/{tenantId}/reindex?updateEntities=false
When run with this option, reindex only reindexes the ElasticSearch index; the History, Match, and Analytics processors are not triggered.
Run reindex with forceIgnoreInStreaming set to true, as in the following example: POST /reltio/{tenantId}/reindex?forceIgnoreInStreaming=true
This option forces reindex to ignore streaming when running a reindex job.

Note: For reindexing of analytics attributes, run the Reindex Analytics Attributes Task.

Unify and manage your data

Reindex Data Task

Reindexing and Streaming

Footer