Azure Databricks is the Azure implementation of Apache Spark. Apache Spark is an open source big data processing platform. Azure Databricks was developed in consultation with the developer of Apache Spark, M. Zaharia, who later became the founder of a company named Databricks.
Databricks sits in the data preparation or processing stage in the data lifecycle. This starts with the data being ingested in Azure using Data Factory and stored in a permanent storage (such as ADLS Gen2 or Blob Storage). Alternatively, data could be streamed using Kafka, Event Hub or IoT Hub.
In the next stage data is processed using Machine Learning in Databricks (which uses Apache Spark under the hood) and the extracted insights are then loaded into one of the Analysis Services in Azure (Cosmos DB, Synapse Analytics or SQL Database).
These insights are now ready to be visualized and presented to the end users with the help of Analytical reporting tools such as Power BI.
Let’s have a look at the Apache Spark modules available in Azure Databricks:
SparkSQL and Dataframes: Spark SQL is the module that enables users to query structured data. Dataframe is a Spark data structure which is equivalent of a table in SQL. Data is organized into named columns in a dataframe.
Streaming: This module adds real time data processing capabilities. Data can be streamed using HDFS, Flume or Kafka.
MLlib: Machine Learning library consists of a variety of machine learning tools such as ML Algorithms, pipelines and utilities.
GraphX: Graphx module provides graph computation features for various analytics use cases.
Spark Core API: support for R, SQL, Python, Scala, and Java.