Unify and manage your data

Reindex Data Task

Learn more about how to reindex entity data in a tenant and control the scope, execution mode, and follow-up processing for this background task.

The Reindex Data task refreshes the Elasticsearch index for a tenant's entity data. Use it when existing tenant data must be rebuilt into search structures after specific configuration changes or other maintenance scenarios.

Note: This task does not rebuild match tables. To ensure match integrity, run the rebuild match table task after reindexing your tenant data.

When to run this task

Reindexing should not be a scheduled or routine task. Reltio automatically maintains Elasticsearch indexes during normal operations, such as:

  • Loading or updating entities
  • Creating relationships or interactions

Only reindex your data when the following changes occur, and set enableSeparateIndexing=true to ensure complete index refresh:

Note: Set enableSeparateIndexing to true only when reindexing all your entity data. Do not use it for a subset of records.
  • Metadata Configuration/L3 changes such as:
    • Existing attribute changes (type, label)
    • Existing attribute is removed
    • Existing source is removed
    • Survivorship rules are modified
    • Cleanse configuration is modified
    • Lookups (RDM) mappings are modified
    • MatchFieldUris are modified
    • Surrogate crosswalk settings are modified
    • Reference attribute settings are modified
    • Sub-nested attributes are added or removed
  • Tenant physical configuration changes in:
    • survivorshipAdvancedBehavior
    • indexOvStrategy

HTTP method and endpoint

Use the following HTTP method and endpoint to start the task:

Administrator role is required:

POST {ApplicationURL}/reindex?tenantId={tenantId}

Tenant admin role is required:

POST {ApplicationURL}/api/{tenantId}/reindex

Request headers

Include the following headers when you run the task in file mode.

NameRequiredDescription
awsAccessKeyNoS3 access key for the Amazon bucket.
awsSecretKeyNoS3 secret key for the Amazon bucket.
googleCredentialsNoGoogle credentials used for file-based input.

Query parameters

The following table describes the supported query parameters.

ParameterRequiredDescription
tenantIdYesID of the tenant to reindex entities.
updateEntitiesNoIf set to true (default), in addition to refreshing the Elasticsearch index, the task also updates history, match tables, and the analytics layer (RI). If set to false, the task updates Elasticsearch data only. It does not perform rematching or update history or analytics.
entityTypeNoIf provided, restricts the reindexing scope to the specified entity type or entity types. Supports multiple values as comma-separated items, for example entityType=Person,Organization.
skipEntitiesCountNoIf provided, sets the number of entities skipped during reindexing. Default value: 0.
entitiesLimitNoIf provided, sets the maximum number of entities reindexed. Default value: infinity.
updatedSinceNoTimestamp in Unix format. If this parameter is provided, only entities with a greater or equal timestamp are reindexed.
Note: The task must still scan the entire database even when this parameter limits the number of indexed entities. As a result, the task execution time will generally remain similar to running it without this parameter.
checkCrosswalksConsistencyNoSpecify true to reindex each entity, whether it has changed or not. Reltio does not recommend this option because it can decrease performance dramatically and overload the server. Default value: false.
distributedNo
If set to true, the task runs in distributed mode. Default value is false. For more information, see Distributed mode.
Note: For large tenants, set distributed=true and configure taskPartsCount to match the number of available API nodes. This enables parallel execution across nodes and can improve task performance.
taskPartsCountNo
Specifies the maximum number of sub-tasks for distributed execution. The platform determines the optimal number based on performance limits. Default value is 2.
Note: This parameter is only applicable when distributed=true. Otherwise, it s ignored.
forceIgnoreInStreamingNoIf set to true, only events produced by the task are ignored in streaming. Default value: false.
Note: When you set this parameter to true, events are generated but not streamed to external queues. The generated events are still used by the internal queue to rebuild the index. For more information about internal and external queues, see Queues at a glance.
enableSeparateIndexingNo

If set to true, the Reindex Data task builds a new Elasticsearch index for the tenant's entity data. When the task is completed, the task replaces the old index with the new one. Run the Reindex Data task with this parameter to continue your activities in Reltio Cloud without waiting for the task to finish.

Note: When enableSeparateIndexing=true, PotentialMatchesReindexTask is automatically triggered upon completion of the entity reindex because the new index must contain potential matches data before the tenant switches to the new index..

Default value: false.

bucketNoBucket name.
s3RegionNoAWS S3 region.

Request body

The request body is optional.

Use a JSON array of entity URIs to reindex only the specified entities, or use a JSON array of file names that contain entity URIs when you provide awsAccessKey and awsSecretKey or googleCredentials.

The following example shows the file format for a JSON array of entity URIs:

["entities/aaaaaaa", "entities/bbbbbbb", "entities/ccccccc"]
Note: If a JSON array of entity URIs is specified, only the specified entities are reindexed. If the header parameters are specified, only file URIs are accepted in the request body.

Example request

The following example shows how to run the task in file mode:

POST {ApplicationURL}/api/{tenantId}/reindex?bucket={bucket}

Headers:
Authorization: Bearer {accessToken}
awsAccessKey: {awsAccessKey}
awsSecretKey: {awsSecretKey}

Body:
["path/file01.json", "path/file02.json", "path/file03.json"]

Reindexing and streaming

The Reindex Data task rebuilds the Elasticsearch index for tenant entities and, depending on your configuration, updates history, match tables, and analytics data.

Because this task can process a large volume of data, system load and event volume can increase significantly. Plan the task to minimize disruption to search, matching, analytics processing, and streaming.

To control task scope and reduce processing overhead, you can:

  • Reindex only Elasticsearch data by setting updateEntities=false. In this mode, the task does not update history, match tables, or analytics data.
  • Restrict the task to specific entity types with entityType.
  • Skip a number of entities with skipEntitiesCount.
  • Limit the number of processed entities with entitiesLimit.
  • Run the task in distributed mode with distributed=true and taskPartsCount.
  • Run the task only for selected entities by passing entity URIs in the request body.
  • Prevent task-generated events from being sent to external queues by setting forceIgnoreInStreaming=true.

PotentialMatchesReindexTask is triggered automatically only when the Reindex Data task creates a new index, such as when enableSeparateIndexing=true is specified. For all other reindex scenarios, run PotentialMatchesReindexTask separately if needed. See Potential Matches Reindex Task.

Do not update the tenant business configuration while the Reindex Data task is running.

The reported numberOfProcessedObjects can vary between task executions because entities may be added to or deleted from the tenant during processing. When the task runs in distributed mode, add the numberOfProcessedObjects values from all child tasks to calculate the total.

For reindexing analytics attributes, run the Reindex Analytics Attributes Task task.