Run Reltio Embedded Entity Resolution in Databricks
Learn how to run Reltio Embedded Entity Resolution in Databricks by opening the solution notebook, attaching the configured cluster, executing the notebook cells in sequence, and sharing telemetry data with Reltio.
After you Install the required library on the cluster, run the Entity Resolution pipeline notebook from the cloned repository in your Databricks workspace to process your data and generate Entity Resolution output.
After the first successful run, you can optionally share telemetry data with Reltio for operational monitoring and support.
Prerequisites
Run the pipeline notebook
- In Databricks, go to the cloned solution repository in your workspace which you created in the step Clone the repository into your Databricks workspace, navigate to the
src/tasks/home_pagepath and open the notebook. - Locate the compute selector on the notebook toolbar. It appears on the right side, between Run all and Schedule. Select the cluster that you configured earlier in the step Install the required library on the cluster.
- Follow the instructions in the notebook and run each cell in sequence. Provide the required inputs in each cell as prompted.
Mandatory: Share operational telemetry with Reltio
After a successful run, you must share operational telemetry with Reltio for monitoring and support. The pipeline creates this table automatically during the first run.
The telemetry data does not include personally identifiable information (PII) or your source data. The telemetry table contains only operational execution metrics, such as:
- Session ID
- Workspace ID
- Record counts
- Execution duration
- Job status
The telemetry table is created at reltio_er_metadata_catalog.metadata_schema.pipeline_telemetry
To enable Reltio to read this metrics data, you must configure a "Data Share" by following the below steps.
- In Databricks, open Catalog Explorer.
- Navigate to the telemetry table present at
reltio_er_metadata_catalog.metadata_schema.pipeline_telemetry. - On the top right cornier, click on .
- Select Create a new share with the streaming table and enter a name for the Delta Share.
- In the Recipients field, enter the recipient ID aws:us-east-1:a1958d2c-99c9-4481-a52c-f50cd6f003ba and click on Share.
- Wait for Reltio to confirm that the telemetry share is visible and accessible.
The pipeline runs successfully in your Databricks environment, generates the initial Entity Resolution output, and you must share telemetry data with Reltio for operational monitoring and support.
Verification
Verify that the pipeline runs successfully:
- All notebook cells run without errors.
- The telemetry sharing step completes successfully.
- Output tables and match results are generated.
Result
The entity resolution pipeline runs successfully, and matched records are generated. You can now review match results, grouped records, and resolution metrics in the output. The telemetry data is shared with Reltio to enable operational monitoring and support.