Set up your Databricks environment
Learn how to prepare your Databricks environment so that you can set up and run entity resolution.
Reltio Embedded Entity Resolution in Databricks requires both solution access and environment preparation before you can run the pipeline. This process includes preparing your Databricks workspace, accessing shared assets, and configuring compute resources.
Prerequisites
Ensure that you have access to your Databricks workspace.
Enable Delta Sharing
Enable Delta Sharing on a metastore in your Databricks environment to support secure access to ML models and notebooks. For more information, see Enable Delta Sharing on a metastore.
Configure compute cluster
| Requirement | Value |
|---|---|
| Databricks Runtime version | 16.4 LTS or later |
| Performance mode | Machine Learning enabled |
| Photon Acceleration | Disabled |
| Single Node | Disabled |
| Worker nodes | Minimum 2 (increase based on data volume) |
| Autoscaling | Not recommended |
| Worker type | g5.4xlarge [A10G] or equivalent with at least 64 GB memory and 1 GPU |
Configure Unity Catalog resources
Create the required Unity Catalog objects and assign permissions for the user or service principal that runs the notebook.
To manage catalogs, schemas, and permissions:
- In the Databricks workspace, click Catalog in the left sidebar to open Catalog Explorer.
- In Catalog Explorer, create a new catalog:
- Click Create catalog
- Enter the catalog name (
reltio_er_metadata_catalog) - Confirm the creation
- Select the created catalog, then create a schema:
- Click Create schema
- Enter the schema name (
metadata_schema) - Confirm the creation
- Assign permissions to the user or service principal:
- Select the catalog or schema
- Go to the Permissions tab
- Grant the required privileges
| Resource type | Name or details | Required permissions |
|---|---|---|
| Catalog | reltio_er_metadata_catalog | USE CATALOG, CREATE SCHEMA |
| Schema | metadata_schema in reltio_er_metadata_catalog | CREATE TABLE, USE SCHEMA, SELECT, READ VOLUME, CREATE VOLUME, MODIFY, WRITE VOLUME |
| Source data | Your source catalogs and tables | USE CATALOG on each source catalog, SELECT on each source table |
Verification
Verify that your Databricks environment meets the required prerequisites:
- Delta Sharing is enabled in your Unity Catalog metastore.
- A compute cluster is available with the required configuration.
- The required Unity Catalog objects and permissions are defined.
Result
Your Databricks environment is ready for onboarding Reltio Embedded Entity Resolution.
You can now clone the repository into your Databricks workspace and continue with the setup process.