This article is part 4 of the series for SQL Server 2019 Enhanced PolyBase. Let quickly recap the previous articles.
- Part 1: We installed SQL Server 2019 PolyBase feature along with Azure Data Studio and SQL Server 2019 preview extension to explore its features
- Part 2: In this part, we learned to create an External table using Azure Data Studio ‘External table wizard’ for the Oracle data source
- Part 3: We learned the useful features of External tables like joins and created an external table using t-SQL instead of the GUI mode for Oracle database in this series article
We have learned earlier that PolyBase in SQL Server 2019 Preview allows access to various data sources such as SQL Server, Oracle, MongoDB, Teradata, and ODBC based sources etc. Azure Data Studio SQL Server 2019 preview extension currently supports for SQL Server and Oracle data sources only from the External table wizard.
In this series, we will create an external table for SQL Server and explore some more features around it.
Launch Azure Data Studio and connect to the SQL Server 2019 preview instance. Right click on the database and launch ‘Create External Table’.
This opens up the wizard to create the external tables. Recently, I faced an issue where the wizard stuck in the ‘step 1’. Progress bar icon keeps rotating and does not show any error message or any progress.
After some time, we get the error message ‘Timeout expired. The timeout period elapsed prior to completion of the operation or the server is not responding.’ This is the general error message and does not point to clear error message.
I tried to close the wizard multiple times and launch it again but it remains the same. It does not show any error message to troubleshoot it further.
Later, during the investigation, I figured out that PolyBase services were in stopped condition.
- SQL Server PolyBase Data Management Service
- SQL Server PolyBase Engine
Now again launch the ‘Create External Table’ wizard in Azure Data Studio. This starts the wizard successfully. Therefore, monitor the service status before launching the wizard to avoid any issues.
External table for SQL Server
In this section, we will use the below source and destination instances.
- Source Instance (here we will create external table): SQL Server 2019 (Named instance – SQL2019)
- Destination Instance (External table will point here): SQL Server 2019 (Default instance – MSSQLSERVER)
Click on the ‘SQL Server’ in the data source type of wizard and proceed to the next step. In the next step, create the Database Master Key to secure the credentials used by the external data source. We should use a complex password with a combination of lower case, upper case, alphanumeric and special characters.
Go to the next step and create data source connection.
Server Name should in the format of [Instance Name IP Address].[Port]
This credential should have permission onto the SQL Server where we will point out external tables.
Connect to the instance, create a login, and provide read permission to the user on WideWorldImporters database.1234567CREATE LOGIN [DemoSQL2019] WITH PASSWORD=N'f3EzbtBSXu7iNaKdtRXd+soU0ab6Pwu6BSfMOI7jqms=',DEFAULT_DATABASE=[master], DEFAULT_LANGUAGE=[us_english], CHECK_EXPIRATION=OFF, CHECK_POLICY=OFFGOCREATE USER [DemoSQL2019] FOR LOGIN [DemoSQL2019]GOALTER ROLE [db_datareader] ADD MEMBER [DemoSQL2019]GO
Now we can go further and see that no login failure issue occurred. Select the table from the database.
We see here that the mapping as below:
- Source Table: Sales.Invoices
- Destination table: dbo.invoices
It automatically selects the destination schema as dbo since sales schema does not exist in our database. Therefore, let us create the schema and refresh the schema list to show it here:
Select the sales schema from the drop-down. Click Next to view summary:
Click on Create to configure an external table. Once the external table is created, we can access the data from it.
Catalog views for PolyBase
We can view the external tables in using the catalog view sys.external_tables. It shows all the external tables in the current database:
SELECT execution_id, status,st.text, dr.total_elapsed_time
FROM sys.dm_exec_distributed_requests dr
cross apply sys.dm_exec_sql_text(sql_handle) st
ORDER BY total_elapsed_time DESC;
We can also get the information about the data source using the catalog view sys.external_data_sources.
Using below query, we can see the name of the data source, location (location contains database and instance IP address along with the instance port address):
SELECT name, location, type FROM sys.external_data_sources
Understand the script generated by the wizard
The Create an External table wizard can also be used to create scripts for the whole process. Let us understand the script generated by the Azure Data Studio for creating an external table in the above example by breaking the query into multiple parts.
Below query creates the Master Key Encryption
In this step, it created the database scoped credential.
Create external data source pointing to SQL Server. In this external data source query, we need to specify the location in the format of <vendor>://<server>[:<port>].
Since we are creating the external table for SQL Server, we need to specify the vendor as ‘sqlserver’.
We also need to specify the port address with the colon. For example, in below query, we specified port number as 5290.
In the below section, we will create an external table. We need to create an external table similar to the relational database table with the column properties. We also need to specify a location for the object along with the data source.
For example, in below query, we specify the location as [WideWorldImporters].[Sales].[Invoices] and DATA_SOURCE as [SQLServer]
PUSHDOWN in PolyBase
We normally use predicates in the query in order to get a subset of the rows from the table. This subset allows pulling the records based on the conditions defined with where clause. These predicates can be as following as per the docs.
- Binary comparison operators ( <, >, =, !=, <>, >=, <= ) for numeric, date, and time values.
- Arithmetic operators ( +, -, *, /, % ).
- Logical operators (AND, OR).
- Unary operators (NOT, IS NULL, IS NOT NULL).
In PolyBase, we can use pushdown to improve the performance of the query for the external table. Mostly, we use this feature for the scale-out cluster cases where we can see significant improvement of the query performance. In this example, we will be using the standalone PolyBase configuration.
When we create an external data source for external table, we have the option to specify the value for PUSHDOWN as ON or OFF. The default value for pushdown is ON. Therefore, we do not need to specify a pushdown value if we want to enable it. Using PUSHDOWN, we can choose to move the computation to source system or not.
The syntax for an external data source with pushdown is as below:
CREATE EXTERNAL DATA SOURCE [DataSourceName]
LOCATION = sqlserver://SqlServer,
-- PUSHDOWN = ON | OFF,
CREDENTIAL = Credentials
We have already created a data source with a default value (Pushdown=ON) for the external table pointing to another SQL Server instance. Therefore, we will run the query with predicate with and without pushdown. To disable, pushdown we can use the predicate OPTION (DISABLE EXTERNAL PUSHDOWN) in the query.
Similarly, while creating the external data source if we disabled the pushdown, we can enable it while running the query as OPTION (FORCE EXTERNALPUSHDOWN);
Let us run the query and see the difference in performance.
Execute query with predicate and enabling Pushdown: In this query, we do not specify OPTION (FORCE EXTERNALPUSHDOWN)since it is by default enabled in the data source.12345678910SELECT [InvoiceID],[CustomerID],[BillToCustomerID],[OrderID],[DeliveryMethodID],[ContactPersonID],[AccountsPersonID],[SalespersonPersonID],[PackedByPersonID] ,[InvoiceDate],[CustomerPurchaseOrderNumber],[IsCreditNote],[CreditNoteReason],[Comments],[DeliveryInstructions],[InternalComments],[TotalDryItems] ,[TotalChillerItems] ,[DeliveryRun],[RunPosition],[LastEditedBy],[LastEditedWhen]FROM [SQLShackDemo].[sales].[Invoices]where InvoiceID>250order by CustomerID
Query with predicate without pushdown: In this query, we disabled the pushdown with predicates OPTION (DISABLE EXTERNALPUSHDOWN):1234567891011SELECT [InvoiceID],[CustomerID],[BillToCustomerID],[OrderID],[DeliveryMethodID],[ContactPersonID],[AccountsPersonID],[SalespersonPersonID],[PackedByPersonID] ,[InvoiceDate],[CustomerPurchaseOrderNumber],[IsCreditNote],[CreditNoteReason],[Comments],[DeliveryInstructions],[InternalComments],[TotalDryItems] ,[TotalChillerItems] ,[DeliveryRun],[RunPosition],[LastEditedBy],[LastEditedWhen]FROM [SQLShackDemo].[sales].[Invoices]where InvoiceID>250order by CustomerIDOPTION(DISABLE EXTERNALPUSHDOWN)
We can see here the query without pushdown took 30.524 seconds while query with pushdown took 19.754 seconds so there is a significant performance improvement with this approach. PUSHDOWN allows moving computation source, which we can see improvement in performance.
In this latest article in our series, we have learned to create an external table for SQL Server data source with the Azure Data Studio Create external table wizard along with T-SQL as well. We also learned about the PushDown approach for computation queries. In the next series of the article, we will explore more on PolyBase for different data sources.
Table of contents
|Enhanced PolyBase SQL 2019 – Installation and basic overview|
|Enhanced PolyBase SQL 2019 – External tables for Oracle DB|
|Enhanced PolyBase SQL 2019 – External tables using t-SQL|
|Enhanced PolyBase SQL 2019 – External tables SQL Server, Catalog view and PushDown|
|Enhanced PolyBase SQL 2019 – MongoDB and external table|
- SUBSTRING, PATINDEX and CHARINDEX string functions in SQL queries - March 1, 2021
- Deploy Azure Data Lake Analytics database using the U-SQL scripts - February 23, 2021
- Join database tables using U-SQL scripts for Azure Data Lake Analytics - February 17, 2021