Reltio Data Science Spark SDK

Learn about a collection of APIs available for running analytics on Reltio Connected Cloud data in Apache Spark.

Reltio Data Science SDK

Reltio Data Science Spark SDK consists of Reltio APIs exposed as Spark libraries. These APIs can be used for custom analytics related to code development.

The Reltio Data Science Spark SDK provides a collection of data management APIs for analytics processing based on Apache Spark libraries.

The SDK and Spark tools enable you to perform the following tasks:

  • Load data from Reltio into Spark Datasets (Data Access API)
  • Develop analytics using Spark tools:
    • Spark SQL: Write and execute SQL queries against Datasets
    • Spark GraphX: built-in Graph algorithms
    • Spark MLlib: Machine Learning algorithms
  • Load data back to Reltio (Data Persist API)
  • Delete data from Spark Datasets (Data Delete API)
Note: Before you start using Reltio Data Science Spark SDK, note the following details:
  • Qubole Data Service (QDS) is the recommended environment for Reltio Data Science SDK.
  • QDS generally supports Python, Scala, SQL, R, Reltio Data Science Spark SDK is Scala-based.

Spark Reference

Apache Spark is an open-source, scalable, massively parallel, in-memory execution environment for running analytics applications. Think of it as an in-memory layer that sits above multiple data stores, where data can be loaded into memory and analyzed in parallel across a cluster.

The Spark Notebook is the open source notebook aimed at enterprise environments, providing Data Scientists and Data Engineers with an interactive web-based editor that can combine Scala code, SQL queries, Markup and JavaScript in a collaborative manner to explore, analyze, and learn from massive data sets.

A Dataframe or Dataset is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood.