In this article, we will learn how to integrate Azure Purview and Azure Synapse Analytics capabilities to access data catalog assets hosted in Purview from Azure Synapse.
Data exists in various formats on various types of repositories on different clouds as well as on-premises. With the growing data landscape, two of the most common capabilities required to manage as well as extract value out of data are data cataloging and data warehousing. Data cataloging or metadata cataloging enables to keep track of the metadata evolution as well acts as a guiding beacon for all data pipelines that move data from source to destination. Data warehousing provides an approach and capabilities to process large volumes of data efficiently when data across the enterprise is collated for deriving insights. The gap between these two capabilities is that if these two capabilities are not integrated, the teams managing these two capabilities would not have any view of each other’s landscape. Typically, the data warehousing capability acts as one of the biggest consumers of data catalogs like many other data capabilities. Azure provides Purview for data cataloging and governance and Azure Synapse Analytics for data warehousing. In this article, we will see how to integrate these two capabilities to access data catalog assets hosted in Azure Purview from Azure Synapse.
As we are going to work with Azure Purview as well as Azure Synapse, we need a few things in place before we can start configuring these tools to integrate with each other. It is assumed that one has the required privileges to administer and operate Purview and Azure Synapse services on their Azure account.
First, we need an instance of Purview, which would provide access to the Purview Studio tool. Using this tool, some data repositories should be cataloged so that when we search for data assets cataloged in this tool, we would find some results. A good example would be creating an Azure SQL Database with the sample data that comes built-in and catalog it with Purview. It is assumed that this Azure Purview setup is already in place and data assets are already cataloged.
Next, we need an instance of Azure Synapse Workspace created, which would provide access to the Synapse Studio tool. This is the primary administrative console that facilitates operating the Synapse pool. Once this setup is in place, it would look as shown below and with this, we are ready to start our exercise of integrating Azure Synapse with Azure Purview.
Configuring Azure Purview for integrating with Azure Synapse Analytics
Open the Azure Synapse Studio by clicking on the Open Synapse Studio link from the dashboard page of Azure Synapse Workspace. Click on the Manage blade and you will see Azure Purview (Preview) under the External connections section as shown below. This feature is still in Preview as of the draft of this article. This feature allows us to integrate Synapse with Purview.
As shown above, we need to start by connecting our Azure Purview account here. Click on the button named Connect to a Purview account. It would pop-up a screen as shown below. If you have the Azure Purview account under the same Azure subscription in which the Azure Synapse Analytics account is created, when you select the “From Azure Subscription”, you will find the Purview account name as shown below.
Select the purview account and click on the Apply button. This will register the account with Azure Synapse as well as integrate it with Purview. Once done, you will receive a successful registration confirmation as shown below.
The benefit of connecting Azure Synapse with Azure Purview is that we can access the data assets from the catalog right in Azure Synapse Studio, and also use this information to initiate different actions supported by Synapse. To start accessing the Purview catalog from Synapse Studio, navigate to the Data tab and click on the search bar at the top of the screen as shown below. There would be a drop-down in the search bar which would have two options – Workspace and Purview. Ensure to select Purview as shown below. Now we are ready to start searching the catalog for data assets.
Type a full or partial name of the database object that we intend to search as shown below, and it would show a list of database objects that match the search criteria. These search results should not be confused with the database objects hosted in the Synapse pools which are part of the Synapse Workspace. As we are searching in the Purview catalog, the result would consist of data assets held in the specific purview account instance only. If we want to search for items within the workspace, we need to select the Workspace option in the drop-down which would list search results of objects in Azure Synapse.
The results are divided into two panes – the filters pane and the results pane. The filters pane shows the data asset type, classification and other such filters related to cataloged data assets. The results that meet the filter criteria as shown on the right pane. The results show the name of the data objects as well as the type of repository that holds the data object and address of the same.
Let’s say that we intend to explore the details of a particular data asset to understand whether it is suitable to be used as a source of data for data warehousing. We can click on the item in the results pane and it would show the results as shown below. In this case, it’s an Azure SQL Database table, so the details like Schema, Lineage, Data Classification, Related database objects, etc. are shown. On the right side of this screen, we can find the hierarchy under which this database object belongs.
Another interesting and useful feature of these results can be found in the related tab. At times, we may be searching for a database object but that may not be the exact match. Finding objects that are similar or related to the object being search can elevate the possibilities of finding the database object of interest. The related tab shows database objects like database, schemas, tables, or view depending on the hierarchy selected as shown below.
Once the data object of interest has been discovered, the next step is to take corresponding actions like creating a linked service, integration dataset, or a new data flow to source the data from the corresponding data repository. The Connect and Develop menu item provide links to initiate such actions as shown below. Clicking on these links would open a new pop-up window or wizard which would have the details of the data source and the data object already pre-populated. We can provide the credentials, build the corresponding artifact in Azure Synapse, and start sourcing the data from the targeted object.
The benefit of this integration is that we do not need to switch between two sets of services, gain access to the catalog which may be maintained by a data steward or data quality team, and port details back and forth from Azure Purview to Azure Synapse. The built-in integration eliminates all this overhead and provides the convenience of a catalog right within the operational console of a data warehousing environment.
In this article, we created an instance of Azure Synapse and Azure Purview. We cataloged data in Purview, integrated it with the instance of Synapse, searched for data sets from the Purview catalog using Synapse Studio and learned how to initiate actions in Synapse Studio based on the data asset of choice.