Configure and run entity resolution in Databricks

Learn how to configure and run the entity resolution pipeline and review results inside Databricks.

You must have imported the ready-to-run notebooks from the repository we provide into your Databricks workspace and completed the setup. See Reltio Entity Resolution in Databricks set up.

To configure and run the entity resolution pipeline, follow these steps:

Select the Delta Table containing your input data
In the notebook, locate step Select Delta Table in Unity Catalog. Use the dropdowns to choose your Unity Catalog, schema, and the Delta Table that holds the records for your Individual entity.
Select the machine learning model catalog
In step Machine Learning Models Catalog Selection, choose the catalog that you created as part of the setup. It powers the match logic in the pipeline.
Map matching input columns
In step Column Mapping with Input Data, do the following:
- Preview your table and verify sample records
- Select the primary key column
- Map attributes like name, email, phone, and address to match fields
Leave any fields blank if your dataset doesn’t contain the suggested input.
Customize matching parameters
In step Customize Matching Parameters, adjust the behavior of the resolution engine as needed:
- Number of Neighbors: Maximum number of top similar records to retrieve for each input record.
  - Higher values (e.g., 90–100): Increase recall, retrieve more candidates, but may increase cost and runtime.
  - Lower values (e.g., 25–50): Improve performance, reduce cost, but may reduce match coverage.
  - Default and maximum: 100
- Similarity Search Threshold: Minimum similarity score (0 to 1) required to consider a record as a candidate match.
  - Lower values (e.g., 0.3–0.5): More inclusive — higher recall, more weak matches, risk of false positives.
  - Higher values (e.g., 0.6–0.8+): More selective — higher precision, fewer false positives, risk of missing subtle matches.
  - Default: 0.6
- Final Match Threshold: Minimum final score required to confirm a match (e.g., 0.85 for high precision).
You can use default values or fine-tune based on your data. You must click Save to apply the configuration.
Run the resolution pipeline
Locate the cell labeled Find Matches. Run the pipeline to execute the full matching logic using the parameters you’ve configured.
The pipeline will output matched groups and scores in a new Delta Table with the same name as your input table appended with _matches.
Review match results and KPIs
After the run completes, use the steps View Match Results and Entity Resolution Summary to:
- Search by primary key to inspect matched groups
- Analyze KPIs such as total records, matched records, and cluster size

After completing these steps, you'll have a validated set of resolved individual entity records stored in a Delta Table of your Unity Catalog, ready for use in analytics or downstream pipelines.

Unify and manage your data

Configure and run entity resolution in Databricks

Footer