In the previous post, we learnt the basics of Polybase and how it makes data ingestion much faster. In this post we are going to look at the steps that we need to perform to ingest data into Azure Synapse Analytics.
Step 1 : Create a Database Master key
CREATE MASTER KEY;
GO
Explanation: Creating a database master key helps us encrypt the login credentials later.
Step 2: Create external data source
CREATE EXTERNAL DATA SOURCE <Data _source_ name>
WITH
( LOCATION = '<prefix>://<path>'
[, CREDENTIAL = <database scoped credential> ]
, TYPE = HADOOP
)
Explanation: LOCATION parameter is the path of the stored data that we want to ingest
Step 3: Create external file format
CREATE EXTERNAL FILE FORMAT TextDelimFormat
WITH
(
FORMAT_TYPE = DELIMITEDTEXT
, FORMAT_OPTIONS ( FIELD_TERMINATOR = '|'
, STRING_DELIMITER = ''
, DATE_FORMAT = 'yyyy-MM-dd HH:mm:ss.fff'
, USE_TYPE_DEFAULT = FALSE
) ;
Explanation: The above code creates an external file format for CSV file with the “|” symbol as the column separator (FIELD_TERMINATOR). Following format types are available PARQUET, ORC (Optimized Row Columnar), RCFILE, DELIMITEDTEXT (CSV), JSON. Arguments will change according to the file format selected.
Step 4: Create external table
CREATE EXTERNAL TABLE [dbo].[DimItemExternal]
( [ItemKey] [int] NOT NULL,
[ItemType] nvarchar NULL,
[ItemName] nvarchar NULL )
WITH
(
LOCATION='/DimItem/'
, DATA_SOURCE = AzureDataLakeStore
, FILE_FORMAT = TextFileFormat
, REJECT_TYPE = VALUE
, REJECT_VALUE = 0
) ;
Explanation: The above example connects to the CSV file stored in Azure Data Lake Store. REJECT_VALUE and REJECT_TYPE parameters are provided for error handling. REJECT_VALUE is the error count and the REJECT_TYPE can be either VALUE or PERCENTAGE. These will be discussed further in a future post about Error Handling.
Step 5: Create Table as Select (CTAS)
CREATE TABLE [dbo].[DimItem]
WITH
(
DISTRIBUTION REPLICATE,
HEAP
)
AS
SELECT DISTINCT
[ITEMNAME]
, [ITEMTYPE]
FROM [dbo].[DimItemExternal]
Explanation: CTAS command creates a table and imports the results of a Transact-SQL SELECT statement from the external table we created in the previous step. One of the important parameters to be specified with CTAS is the table distribution (which is REPLICATE in the above example). To know more about Azure Synapse Analytics table distributions, see this post.
References:
- https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-tables-external-tables?tabs=sql-pool
- https://docs.microsoft.com/en-us/sql/t-sql/statements/create-table-as-select-azure-sql-data-warehouse?view=azure-sqldw-latest
- https://docs.microsoft.com/en-us/sql/t-sql/statements/create-external-data-source-transact-sql?view=sql-server-ver15
- https://docs.microsoft.com/en-us/sql/t-sql/statements/create-external-table-transact-sql?view=azure-sqldw-latest
- https://www.youtube.com/watch?v=GFJnpWufuhU
2 thoughts on “SBS01 : Importing data into Azure Synapse Analytics using PolyBase”