Accelerate the Value of Data

Configure Databricks pipeline for GCP using Console UI

Learn how to configure Databricks pipeline for GCP in the Console.

By now, you know the configuration prerequisites from Before you start. Let's see how to configure a pipeline using the Console UI.
To configure a new Databricks Data pipeline for GCP:
  1. In the Console, select the Data Pipelines application.
    The page displays existing pipelines or, if empty, a note pointing to the documentation.
  2. In Data Pipelines, select +NEW PIPELINE.
  3. Select Databricks. The Configure Databricks pipeline page displays.
  4. In the Name field, enter the name of the pipeline, which is used to refer to the pipeline in the UI. The name must be alphanumeric, can be 3 to 20 characters in length, and should not contain any spaces.
  5. In the Authentication section, from the Cloud provider dropdown, select GCP.
  6. Select Upload service account key file to browse, select, and upload your account key file.
  7. Enter the following details in the Staging Configuration section, in the Project name,
    1. In the Project name field, enter the GCP project name.
    2. In the Staging Bucket Name field, enter the bucket name for GCP.
    3. In the Staging Bucket Region, enter the bucket region only if the region is not in the United States.
  8. In the Target configuration section, select one of the following options for Databricks storage type.
    • If you select Unity Catalog, enter the name of the unity catalog in the Catalog name field. This catalog is where the schema is created by the Delta Live Tables pipelines.
    • If you select Hive metastore, enter the Target bucket name to store the delta tables.

    Note: Unity Catalog helps govern data and AI assets on Databricks. It enables you to seamlessly explore datasets, tables, and views within the Databricks environment. It also offers a consolidated view of all available data assets, simplifying data discovery and querying. Hive Metastore (HMS) is a central repository of metadata for Hive tables and partitions in a relational database.
  9. In the Databricks configuration section, enter these details:
    1. In the Host URL field, enter the Databricks host URL.
    2. Specify the number of cluster workers for the DLT pipeline to run.
      You can either enter the range of minimum and maximum clusters workers in the editable fields provided or drag the range slider to indicate the minimum and maximum clusters. Note that the slider and numbers in the editable field changes accordingly.
    3. Select the Run Continuously checkbox if the DLT pipeline must run continuously and process new files instantly, instead of waiting for you to trigger the process.
    4. Select the DLT File Notification checkbox to receive a notification when new files are added to the specified location/directory.
    5. In the Databricks token field, enter the number of tokens required to authenticate access to resources and APIs in the Databricks workplace.
    6. In the Client ID field, enter the client ID of the GCP service account.
    7. In the Client Email field, enter the email address of the GCP service account.
    8. In the Project ID field, enter the ID of the project that owns the service account.
    9. In the Secrets scope name field, enter the name of the secret scope you created to store the secrets.
    10. In the Private Key ID Secret Name field, enter the name of the secret that holds the private key ID.
    11. In the Private Key Secret Name field, enter the name of the secret that holds the private key.
    12. In the Subscription ID field, enter the Subscription ID associated with all the storage accounts.
  10. In the Data delivery option section, enter these details:
    1. From the Attribute format dropdown, select the format of the stored attribute, which can be Standard, Flatten, or Flatten single value.
    2. Select the Storage format to store data, which can be JSON or Columns.
    3. Select the Transmit OV values only checkbox if you want to sync only operational values to the Databricks.
    4. Select the Serialize initial sources in crosswalks checkbox to enable sending initial source in crosswalk.
  11. Select Save & Create at the top of the page to save and create the Databricks pipeline for GCP.