Accelerate the Value of Data

Reltio Data Pipeline for Databricks architecture

Learn about the Reltio Data Pipeline for Databricks architecture

This Reltio Data Pipeline for Databricks architecture diagram illustrates the movement of data among your Reltio tenant, your cloud platform account, and your Databricks environment.

Note: Reltio Data Pipeline for Databricks runs on AWS, Azure, and GCP platforms only.

The Data Pipeline Hub service uploads your raw data as compressed JSON files to Azure's file storage. The Databricks Delta Live Tables pipeline then reads this data from the staged location and saves it as Delta live tables — like entities, relations, interactions, matches, and merges — in a designated location, with support for both Hive Metastore and Unity Catalog.

Use the Databricks notebook to access and interact with these tables for data management and analysis insights.

Delta Live Tables Pipeline Modes

Choose how to manage your data pipeline with two modes:

Triggered Mode
This mode is best suited for scenarios where you want to execute your data pipeline in response to specific events or on a scheduled basis. It ensures that your data is processed and updated at the right times, providing timely insights and keeping your data warehouse current.
Continuous Mode
For real-time data processing needs, the Continuous mode allows your data pipeline to run indefinitely, processing new data as it arrives. This mode is ideal for use cases requiring immediate data availability and analysis, ensuring that your data warehouse is always up to date with the latest information.

File Detection Modes for Delta Live Tables

When you configure Delta Live Tables, you have two main file detection modes to monitor and ingest new data:

Directory Listing Mode
In this mode, the system periodically scans the input directory for new files. It's a straightforward approach that works well for many use cases, especially when dealing with a moderate volume of new data.
Pros
Doesn't require any additional permissions.
Cons

Slower to identify new files.

Slower with larger data sizes because it continuously lists the storage contents.

File Notification Mode
Leveraging cloud services' file notification features, this mode provides a more scalable solution for detecting new files. It's particularly effective for high-volume data scenarios, reducing the time and resources needed to identify and process new data.
Pros
Faster and scalable for large input.
Cons
Requires additional permissions to access notifications queues.

Next steps

The following sections show you how to configure the Reltio Data Pipeline for Databricks for use in your cloud platform.