As we have seen in previous post, Azure Synapse Analytics has a Massively Parallel Processing (MPP) architecture with multiple compute nodes controlled by a control node. To take advantage of the MPP architecture, data ingestion must be parallelized. PolyBase optimizes the data ingestion into PDW. The other advantage that Polybase offers is that, it supports T-SQL. This enables developers to query external data transparently from supported data stores , irrespective of the storage architecture of the external data store.
Polybase can be used to access data stored in Azure Blob Storage, Azure Data Lake Storage or any Hadoop instance such as Azure HDInsight.
PolyBase uses an HDFS bridge to connect to external data source e.g. Azure Blob Storage. The connection is bidirectional and can be used to transfer data between Azure Synapse Analytics and the external source extremely fast.
PolyBase comes very handy when joining data stored in the SQL Server Data Warehouse (hosted on Azure Synapse Analytics) with external source (e.g. Azure Blob Storage) since its native support for T-SQL. This eliminates the need to retrieve the external data separately and loading it into the SQL Data Warehouse for further analysis.
Let’s have a look at the use cases for PolyBase:
- Query data stored in Hadoop, Azure Blob Storage or Azure Data Lake Store from Azure SQL Database or Azure Synapse Analytics – This eliminates the need to import data from the external source. Also, PolyBase supports T-SQL which makes it easy for developers to query external data.
- Import data from Hadoop, Azure Blob Storage, or Azure Data Lake Store – No need to install a third party ETL tool and this can be achieved with a few simple T-SQL queries
- Export data to Hadoop, Azure Blob Storage, or Azure Data Lake Store – Supports export and archiving of data to external data stores