Prashanth Jayaram
Create mongoDB external table

Data Virtualization with MongoDB using PolyBase in SQL Server 2019

October 9, 2020 by

In this article, you’ll learn the approach to integrate MongoDB data source using data virtualization technique in SQL Server 2019. In this article, you can see how SQL Server 2019 provides a platform to create a modern enterprise data hub using data virtualization technology and the PolyBase technique.

  1. Discuss Data Virtualization
  2. Pre-requisite to setup MongoDB
  3. Set up MongoDB external connection in SQL Server 2019
  4. Many more…

Introduction

The advent of Data virtualization in SQL Server 2019 allows us to solve modern and complex data challenges. Data virtualization with PolyBase in SQL Server 2019 is used as a data hub, and you can directly query the data from several heterogeneous data sources. These data sources include Azure Managed Instance, Oracle, Teradata, SAP HANA, MongoDB, Hadoop clusters, Cosmos DB, and SQL Server. We can query the data source using T-SQL and without separately installing driver software.

The data virtualization in SQL Server 2019 is an improvised solution to the ETL process. The other advantage of Data virtualization is that it allows the integration of data from different sources such as Azure MI, SQL Server, MongoDB, Oracle, DB2, Cosmos DB, and Hadoop-Distributed-File-System (HDFS) without the much data movement around the source and destination. This process is possible with the advent of PolyBase connectors.

    Note:
  • Using T-SQL, we can query heterogeneous data sources using PolyBase connectors. This provides the bridge to query the data from external data sources such as SQL Server, Oracle, Teradata, MongoDB, and ODBC data sources with external tables
  • It also supports the UTF-8 encoding format

Get started:

In this section, you will learn how to create secure data access from the underlying data source.

In this case, PolyBase uses the security model of the MongoDB model to access the data. In most cases, we need permission to read the data. However, the credentials used to read the data and it is stored inside the PolyBase data hub.

To set-up data virtualization, follow the below steps:

  1. Setup database master key
  2. Create database-scoped credentials
  3. PolyBase external tables
  4. Configure External data sources

To configure database virtualization, select the database. Right-click the database and select Create External Table that starts the data virtualization wizard. 

  1. Select a data source
  2. Create a database master key

In this section, we will see how to create a database master key. The master key is created inside the SQL Server database and it acts as a data hub.

The master key is providing a secure way to read data using the credentials in the external data source. It is always recommended to choose a complex password for the master key. In addition, use the BACKUP MASTER KEY command to back up the master key.

    Note:
  1. You can have one master key in a database instance
  2. Use a secure location to backup the master key
  3. If you already have the master key created in your SQL instance—re-use it. The same master key is used to secure data access for data virtualization
  4. To see whether the master key is already created run the following query

    Display Sysmetric_keys details

Syntax to create a database master key is as follows:

Database scoped credentials

Database scoped credentials provide the abstraction layer to secure the model to access the data from the data source. SQL Server 2019 supports the following security models when using PolyBase for data virtualization:

  1. Proxy authentication
  2. Kerberos authentication
  3. Storage access key—shared secret key

The following syntax is used to provide SQL Server with the security context for a connection to a MongoDB external data source. In addition, you will see a shared secret key in use to create a database scoped credential.

  • Note: The identity refers to the username of the MongoDB external data source

To validate database scoped credentials in the datahub run the following T-SQL:

Display database scope credentials

Configure External data sources

External data sources provide a path to virtualize data from other data sources. It collects all the meta-data to access data from the single database object.

Each external data source provides the below options required to access the external source:

  • Location
  • port
  • credential
  • connectivity

The LOCATION parameter defines the communication protocol, server name and port for connectivity.

In this case, the communication protocol is mongodb and server is 127.0.0.1(Localhost) and port is 27017

The CREDENTIAL argument uses the database scoped credential MyMongoDBDemo2019. The external data source will use this database-scoped credential to query the target data sources

  • Note: The Connection_options and Pushdown parameters will be explained later in the article
  • To validate external data sources run the following T-SQL:

    Display external_data_sources

    How to map an external data table

    In this section, we will see how to create external views. On selecting the parent tables, the child tables are also referred by default. In addition, you have the mapping table where you modify the columns to match the target data types. This also gives you the option to modify the name of the external table.

    Let us look at the details of the MongoDB instance. Create the user using the following commands. I will not go in-depth on MongoDB concepts.

    how to connect mongoDB

    PolyBase external tables

    External tables provide SQL Server with the schema to reason over data in the external data source.

    Error Message:

    The T-SQL create external table statement errors out with the following details. You can see that the TLS handshake failed error reported. This can be handled in two ways. You can manage it at the Mongo DB startup parameters or you can turn off the SSL parameter or you can turn off the SSL=false in the T-SQL statement.

    “Msg 105082, Level 16, State 1, Line 16
    105082;Generic ODBC error: [Microsoft][MongoDBODBC] (110) Error from MongoDB Client:
    No suitable servers found (serverSelectionTryOnce set):
    [TLS handshake failed: error:00000000:lib(0):func(0):reason(0) calling ismaster on ‘127.0.0.1:27017’]
    (Error Code: 13053) Additional error <2>:
    ErrorMsg: [Microsoft][MongoDBODBC] (110) Error from MongoDB Client: No suitable servers found (serverSelectionTryOnce set)
    : [TLS handshake failed: error:00000000:lib(0):func(0):reason(0) calling ismaster on ‘127.0.0.1:27017’]
    (Error Code: 13053), SqlState: HY000, NativeError: 110 “

    You may receive this error due to the SSL issue. Now, the CONNECTION_OPTIONS parameter plays a role to connect MongoDB data source.

    In this case, set the connection_options to ssl=false and then run the create external statement.

    The CONNECTION_OPTION setting suppresses the connection over SSL, which is a default requirement for communicating with many other data sources. I will discuss more about it in the next article.

    Let us run create external table statement again:

    We can see in the below snippet, the external table created successfully.

    Create mongoDB external table

    Let us validate the external table using the following T-SQL:

    Display the external tables

    Now, Let us browse object explorer to validate the existence of the External Tables and External Data Sources.

    Display external employee table

    Finally, query the external table

    MongoDB data validation in SSMS and MongoDB console

    Summary

    In this article, we discussed how to set up a PolyBase in SQL Server for MongoDB external data source.

    To summarize the steps real quick:

    The data virtualization framework provides a platform to eliminate the use of redundant data copies. To maintain the physical copy always requires additional storage and hardware resources. It incurs to increase an additional overhead on costs. Using the data virtualization technique, organizations can create an abstract layer to create a virtual data copy from a heterogeneous or homogeneous data repositories. The virtual copies define the soft external table links to the physical data repository. As I mentioned, the data virtualization technique not only avoids extra storage but it provides strong consistency between the source and the target in the data retrieval process. In addition, it addresses the data issues which in turn improves the data quality and improves the data query performance.

    That’s all for now. I hope you like this article. Stay tuned for more updates…

    Prashanth Jayaram
PolyBase, SQL Server 2019

About Prashanth Jayaram

I’m a Database technologist having 11+ years of rich, hands-on experience on Database technologies. I am Microsoft Certified Professional and backed with a Degree in Master of Computer Application. My specialty lies in designing & implementing High availability solutions and cross-platform DB Migration. The technologies currently working on are SQL Server, PowerShell, Oracle and MongoDB. View all posts by Prashanth Jayaram

168 Views