Unify and manage your data

Run Reltio Embedded Entity Resolution in Databricks

Learn how to run Reltio Embedded Entity Resolution in Databricks by opening the solution notebook, attaching the configured cluster, executing the notebook cells in sequence, and sharing telemetry data with Reltio.

After you Install the required library on the cluster, run the Entity Resolution pipeline notebook from the cloned repository in your Databricks workspace to process your data and generate Entity Resolution output.

After the first successful run, you can optionally share telemetry data with Reltio for operational monitoring and support.

Prerequisites

Run the pipeline notebook

  1. In Databricks, go to the cloned solution repository in your workspace which you created in the step Clone the repository into your Databricks workspace, navigate to the src/tasks/home_page path and open the notebook.
  2. Locate the compute selector on the notebook toolbar. It appears on the right side, between Run all and Schedule. Select the cluster that you configured earlier in the step Install the required library on the cluster.
  3. Follow the instructions in the notebook and run each cell in sequence. Provide the required inputs in each cell as prompted.

Mandatory: Share operational telemetry with Reltio

After a successful run, you must share operational telemetry with Reltio for monitoring and support. The pipeline creates this table automatically during the first run.

The telemetry data does not include personally identifiable information (PII) or your source data. The telemetry table contains only operational execution metrics, such as:

  • Session ID
  • Workspace ID
  • Record counts
  • Execution duration
  • Job status

The telemetry table is created at reltio_er_metadata_catalog.metadata_schema.pipeline_telemetry

To enable Reltio to read this metrics data, you must configure a "Data Share" by following the below steps.

  1. In Databricks, open Catalog Explorer.
  2. Navigate to the telemetry table present at reltio_er_metadata_catalog.metadata_schema.pipeline_telemetry.
  3. On the top right cornier, click on Share > Share via Delta Sharing.
  4. Select Create a new share with the streaming table and enter a name for the Delta Share.
  5. In the Recipients field, enter the recipient ID aws:us-east-1:a1958d2c-99c9-4481-a52c-f50cd6f003ba and click on Share.
  6. Wait for Reltio to confirm that the telemetry share is visible and accessible.
CAUTION: Do not modify the telemetry table. Any unauthorized changes may result in breaching of legal contract.

The pipeline runs successfully in your Databricks environment, generates the initial Entity Resolution output, and you must share telemetry data with Reltio for operational monitoring and support.

Verification

Verify that the pipeline runs successfully:

  • All notebook cells run without errors.
  • The telemetry sharing step completes successfully.
  • Output tables and match results are generated.

Result

The entity resolution pipeline runs successfully, and matched records are generated. You can now review match results, grouped records, and resolution metrics in the output. The telemetry data is shared with Reltio to enable operational monitoring and support.