Configure Databricks pipeline for AWS using Console UI
Learn how to use the Reltio Console to configure a Databricks pipeline for AWS.
The Reltio Data Pipeline for Databricks delivers clean, merged, and validated profile data to your Databricks warehouse, so that you always have accurate data to inform your analytical and reporting decisions. For more information, see topic Reltio Data Pipeline for Databricks.
To configure a new Databricks Data pipeline for AWS:
- In the Console, select the Data Pipelines application. The page displays existing pipelines or, if empty, a note pointing to the documentation.
- In Data Pipelines, select +NEW PIPELINE.
- Select Databricks.The Configure Databricks pipeline page displays:
- In the Name field, enter the name of the pipeline.The name is used to refer to the pipeline in the UI. The name must be alphanumeric, can be 3 to 20 characters in length, and should not contain any spaces
- In the Authentication section, from the Cloud provider dropdown, select AWS.
- Select Authenticate using role to authenticate requests using the role.
- In the Role field, enter the name of the role assigned to the user. You must use the existing role assigned to you.Note: The existing role assigned to you must have the Reltio ARN role given below as part of the trust relationship. For more information, see topic Create Databricks IAM role in AWS.
- Select Copy to copy the default role. You can use this button to copy the role to add it as a trust relationship.
- In the Role field, enter the name of the role assigned to the user. You must use the existing role assigned to you.
- Select Authenticate using key/secret to authenticate requests using a key and secret.
- In the Key field, enter the access key ID.
- In the Secret field, enter the secret access key.
- In the Staging Bucket Name field, enter the name of the bucket on AWS.
- In the Staging Bucket Region field, enter the region if the bucket region is not from the United States.
- In the External ID field, enter the unique external ID to uniquely access the role.
- In the Target configuration section, select one of the following options for Databricks storage type.
- If you select Unity Catalog, enter the name of the unity catalog in the Catalog name field. This catalog is where the schema is created by the Delta Live Tables pipelines.
-
If you select Hive metastore, enter the Target bucket name to store the delta tables.
Note: Unity Catalog helps govern data and AI assets on Databricks. It enables you to seamlessly explore datasets, tables, and views within the Databricks environment. It also offers a consolidated view of all available data assets, simplifying data discovery and querying. Hive Metastore (HMS) is a central repository of metadata for Hive tables and partitions in a relational database. - In the Databricks configuration section, enter these details:
- In the Host URL field, enter the Databricks host URL.
- Specify the number of cluster workers for the DLT pipeline to run. You can either enter the range of minimum and maximum clusters workers in the editable fields provided or drag the range slider to indicate the minimum and maximum clusters. Note that the slider and numbers in the editable field changes accordingly.
- Select the Run continuously checkbox if the DLT pipeline must run continuously and process new files instantly, instead of waiting for you to trigger the process.
- In the Databricks token field, enter the number of tokens required.
- Select the DTL File Notification checkbox to receive a notification when new files are added to the specified location/directory.
- In the Query URL field, enter the URL of the AWS queue that the DLT pipeline reads to process new files in the staging bucket.
- In the Instance profile ARN field, enter the unique ID used in AWS for an IAM role associated with an EC2 instance.
- In the AWS role ARN field, enter the Amazon Resource Name of the desired IAM role.
- In the External ID field, enter the unique external ID to uniquely access the role.
- In the Data delivery options section, enter these details:
- From the Attribute format dropdown, select the format of the stored attribute, which can be Standard, Flatten, or Flatten single value.
- Select the Storage format to store data, which can be JSON or Columns.
- Select the Transmit OV values only checkbox to sync only the operational values to Databricks.
- Select the Serialize the initial sources in crosswalks checkbox to send initial sources in crosswalks.
- Select Save & Create at the top of the page to save and create the Databricks pipeline for AWS.