SBS01 : Importing data into Azure Synapse Analytics using PolyBase

In the previous post, we learnt the basics of Polybase and how it makes data ingestion much faster. In this post we are going to look at the steps that we need to perform to ingest data into Azure Synapse Analytics.

Step 1 : Create a Database Master key

CREATE MASTER KEY; 
GO

Explanation: Creating a database master key helps us encrypt the login credentials later.

Step 2: Create external data source

CREATE EXTERNAL DATA SOURCE <Data _source_ name>
	WITH
(    LOCATION      = '<prefix>://<path>'
     [, CREDENTIAL = <database scoped credential> ]
    , TYPE = HADOOP
)

Explanation:  LOCATION parameter is the path of the stored data that we want to ingest

Step 3:  Create external file format

CREATE EXTERNAL FILE FORMAT TextDelimFormat
WITH
(
    FORMAT_TYPE = DELIMITEDTEXT
  , FORMAT_OPTIONS ( FIELD_TERMINATOR = '|'
  , STRING_DELIMITER = ''
  , DATE_FORMAT = 'yyyy-MM-dd HH:mm:ss.fff'
  , USE_TYPE_DEFAULT = FALSE

) ;

Explanation: The above code creates an external file format  for CSV file with the “|” symbol as the column separator (FIELD_TERMINATOR). Following format types are available PARQUET, ORC (Optimized Row Columnar), RCFILE, DELIMITEDTEXT (CSV), JSON. Arguments will change according to the file format selected.

Step 4: Create external table

CREATE EXTERNAL TABLE [dbo].[DimItemExternal]
 ( [ItemKey] [int] NOT NULL,
   [ItemType] nvarchar NULL,
   [ItemName] nvarchar NULL )
WITH
(
    LOCATION='/DimItem/' 
  , DATA_SOURCE = AzureDataLakeStore 
  , FILE_FORMAT = TextFileFormat 
  , REJECT_TYPE = VALUE
  , REJECT_VALUE = 0

) ;

Explanation: The above example connects to the CSV file stored in Azure Data Lake Store. REJECT_VALUE and REJECT_TYPE parameters are provided for error handling. REJECT_VALUE is the error count and the REJECT_TYPE can be either VALUE or PERCENTAGE.  These will be discussed further in a future post about Error Handling.

Step 5: Create Table as Select (CTAS)

CREATE TABLE [dbo].[DimItem]
WITH
(
     DISTRIBUTION REPLICATE, 
     HEAP
)
AS 
SELECT DISTINCT 
       [ITEMNAME]
     , [ITEMTYPE]
FROM [dbo].[DimItemExternal]	

Explanation: CTAS command creates a table and imports the results of a Transact-SQL SELECT statement from the external table we created in the previous step. One of the important parameters to be specified with CTAS is the table distribution (which is REPLICATE in the above example). To know more about Azure Synapse Analytics table distributions, see this post.

References:

2 thoughts on “SBS01 : Importing data into Azure Synapse Analytics using PolyBase

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: