An Azure Databricks Cluster is a grouping of computation resources which are used to run data engineering and data science workloads. These workloads include ETL pipelines, streaming data processing and machine learning.
Based on the usage, Azure Databricks clusters can be of two types:
- Interactive Cluster: This type of cluster can be used to work collaboratively in a team where the work needs to be viewed and shared between team members. Interactive clusters can be stopped and restarted manually using the Azure portal or through a CLI command.
- Automated Cluster: As the name suggests, an automated cluster is created automatically by Azure Databricks job scheduler when a user runs a job. The cluster remains alive as long as the job is running, after which it is terminated automatically.
Another important concept regarding Azure Databricks clusters is the mode of the cluster. While creating a cluster there are two modes available:
- Standard mode: This is the default cluster mode while creating a cluster. This mode is suitable for a single user. It supports many languages such as R, Python, Scala and SQL.
- High Concurrency mode: This cluster mode is optimized for high level of resource utilization by sharing the cluster among multiple users. This provides cost effective sharing of resources while minimizing latency. This mode only supports R, Python and SQL. Two important concepts of high concurrency mode are:
- Pre-emption: Pre-emption is a way of preventing one big task from consuming resources for a long duration. High concurrency mode uses the native Spark feature for controlling the level of pre-emption on the concurrent jobs to ensure that all the jobs get a fair share of compute resources.
- Fault Isolation: This refers to creation of a separate environment for each individual notebook (user customizable coding and output application). This ensures that each user code runs independently and is not affected by errors committed by other users.