Reindex Data Task
Learn about reindexing entity data in a tenant
You must run this task or if any data synchronization issues are detected. This job does not rebuild match tables. It is recommended that you also run the rebuild match tables jobs after reindexing data in your tenant.
This task refreshes the index that supports Elastic search of a tenant's Entity data. To speed up the search of a large database, the database management system builds one or more database indexes. Like a book index, a database index is a list of paired items: one item is a significant term from the database, and the other is the address of that term in the data storage (like the page number in a large book), where the program can go to find that term. When you type in a term to search for, it's much faster for the program to scan the index for the term, and go directly to the indexed address, than it is to search through the entire database. Database indexes keep large databases tuned up
, so that searches can be performed quickly.
Reindexing should not be a routine, scheduled, or periodic task. You must reindex data after loading it, if data synchronization issues are observed, or after making specific changes.
enableSeparateIndexing=true
. When the Reindex task is executed, the values are stored in Elasticsearch based on the current survivorship strategies. However, if there are any changes to the survivorship strategies, these changes will not be reflected in the values until the Reindex task is executed again with the enableSeparateIndexing
parameter set to true.-
Metadata Configuration/L3 changes such as:
- Existing attribute changes (type, label)
- Existing attribute is removed
- Existed source is removed
- Survivorship rules are modified
- Cleanse configuration is modified
- Lookups (RDM) mappings are modified
- MatchFieldUris are modified
- Surrogate crosswalks settings are modified
- Reference attribute settings are modified
- Sub-nested attributes are added or removed
-
Tenant's physical configuration changes in:
- survivorshipAdvancedBehavior
- indexOvStrategy
Reltio's enterprise-scale database system includes a primary Entity database which is searched by ElasticSearch. In addition to the ElasticSearch index, the system is supported by a sub-database that captures the history of all database transactions, an index of all matching records from various sources, and tables that store the output of Reltio's analytics layer (RI). Each of these supporting data structures needs to be updated when the primary database changes. The Reindex Data task updates the ElasticSearch index and updates the supporting structures as well.
- Instruct the task to reindex the ElasticSearch index but leave the match tables, history, and analytics data alone.
- Restrict the reindexing process to a subset of the Entity database.
- Instruct the process to skip some number of Entities.
- Cap the number of Entities to be reindexed.
- Distribute the reindexing task among a number of processing nodes.
- Run the reindexing job against a limited list of Entities (by appending a list of URLs in the body of the Reindex Data request).
The table below provides instructions on how to specify one choice or any combination of choices when you run the Reindex Data task.
PotentialMatchesReindexTask
(see Potential Matches Reindex Task) to reindex potential matches information for the entire tenant. However, when you reindex specific entities, the PotentialMatchesReindexTask
is not started.Stop and Pause are supported.
Requests
Tenant admin role is required:
POST {ApplicationURL}/reindex?tenantId={tenantId}
Tenant admin role is required:
POST {ApplicationURL}/api/{tenantId}/reindex
POST {ApplicationURL}/api/{tenantId}/reindex?bucket={bucket}
Headers:
Authorization: Bearer {accessToken}
awsAccessKey: {awsAccessKey}
awsSecretKey: {awsSecretKey}
Body: ["path/file01.json", "path/file02.json", "path/file03.json"]
Parameter | Required | Description |
---|---|---|
tenantId
| Yes | ID of the tenant to reindex entities. |
updateEntities
| No | If set to true (default), in addition to refreshing the ElasticSearch index, the task also updates history, match tables, and the analytics layer (RI). This ensures that all indexes and supporting structures are as up-to-date as possible. As explained above, however, triggering all these activities may decrease the overall performance level of the database system for business work, and overwhelm the event streaming channels. If set to false , the task updates ElasticSearch data only. It does not perform rematching, or update history or analytics. These other activities can be performed at different times to spread out the performance impact. Note: See the Reindexing and Streaming section for details. |
entityType
| No | If provided, the task restricts the reindexing scope to Entities of specified type. |
skipEntitiesCount
| No | If provided, sets the number of Entities which are skipped during reindexing. Default value: 0 . |
entitiesLimit
| No | If provided, sets the maximum number of Entities are reindexed. Default value: infinity . |
updatedSince
| No | Timestamp in Unix format. If this parameter is provided, then only entities with greater or equal timestamp are reindexed. This is a good way to limit the reindexing to newer records. |
checkCrosswalksConsistency
| No | Specify true to reindex each Entity, whether it has changed or not. This operation ensures that each Entity in the database is processed. Reltio does not recommend this option – it decreases the performance of the reindex task dramatically, and may overload the server, which will interfere with all database operations. Default value: false |
distributed
| No | If set to true , the task runs in distributed mode, which is a good way to take advantage of a networked or clustered computing environment to spread the performance demands of reindexing over several nodes. Default value is false . For details, refer to Distributed Mode. |
taskPartsCount
| No |
The number of tasks which are created for distributed reindexing. Each task reindexes its own subset of Entities. Each task may be executed on a different API node, so that all tasks can run in parallel. Recommended value: the number of API nodes which can execute the tasks. Default value:
2 Note: This parameter is used only in distributed mode ( distributed=true ); otherwise, its ignored. |
forceIgnoreInStreaming | No | If set to true , only events produced by the task are ignored in streaming. Default value is false . Note: When you set this parameter to true, the events are generated, but not streamed to external queues. The generated events are used by the internal queue to rebuild the index. This will not affect performance in any way, since the current index remains available throughout the process. For more information about internal and external queues, see topic Queues at a glance. |
enableSeparateIndexing | No |
If set to |
bucket | No | Bucket name |
s3Region | No | AWS S3 region |
Header Key Name | Required | Value Description |
---|---|---|
awsAccessKey | No | S3 access key for amazon bucket. |
awsSecretKey | No | S3 secret key for amazon bucket. |
googleCredentials | No | Google credentials |
Body (optional)
JSON Array of entity URIs to reindex or JSON Array of file names with entity URIs in case you send awsAccessKey and awsSecretKey or googleCredentials.
["entities/aaaaaaa", "entities/bbbbbbb", "entities/ccccccc"]
Reindexing and Streaming
The ReindexData
data task can do the following:
- Rebuild the index for the tenant's Entities (to do this, it creates an
ENTITY_CHANGED
event for each one) - Update the history table
- Update the match tables to reflect each Entity change/creation/deletion caused by reindexing
- Repopulate the analytics layer (RI)
Each of these activities can generate a large volume of events, which may overwhelm the tenant's event queues.
There are two ways to prevent this problem:
-
Run
reindex
withupdateEntities
set to false, as in the following example:POST /reltio/{tenantId}/reindex?updateEntities=false
When run with this option,
reindex
only reindexes the ElasticSearch index; the History, Match, and Analytics processors are not triggered. -
Run
reindex
withforceIgnoreInStreaming
set to true, as in the following example:POST /reltio/{tenantId}/reindex?forceIgnoreInStreaming=true
This option forces
reindex
to ignore streaming when running a reindex job.