Rajendra Gupta

Join database tables using U-SQL scripts for Azure Data Lake Analytics

February 17, 2021 by

In this article, we will explore joining database tables using U-SQL scripts for Azure Data Lake Analytics.

Introduction

Azure Data Lake Analytics enables you to configure on-demand jobs using U-SQL scripts. These scripts can transform data and extract information without a predefined schema. You can extract data in any format stored in Azure data lake storage, Azure blob storage, Azure SQL database, or Azure SQL on virtual machines. In the article, Deploy Azure Data Lake database using the U-SQL scripts, we explored that ADLA uses a database structure similar to a SQL Server database. It is a container having folders for various objects such as tables, schema, procedures, packages, credentials, data sources. By default, it uses a master database and DBO schema for creating tables.

Azure Data Lake Analytics uses a database structure

We can also design schema, tables in the ADLA database similar to the SQL Server database using the T-SQL script. We use SQL joins to fetch data from multiple tables. It also supports the join operations in the Azure data lake analytics database similar to a SQL database.

We will explore these joins in this article.

Prerequisites

You can follow articles (see TOC at the bottom) to prepare the following environment:

  • Create the Azure Web portal credentials and configure the Azure Data Lake account
  • Install Visual Studio with Azure Data Lake and Stream Analytics Tools
  • Connect Visual Studio with your Azure subscriptions

Joins in U-SQL scripts for Azure Data Lake Analytics (ADLA)

The ADLA database supports the following joins:

  • Inner Join
  • Full | Left | Outer Join
  • Cross Join
  • Left | Right semijoin
  • Left | Right antisemijoin

Before we move forward and explore these joins, we need to prepare a database environment for demonstration purposes. In this article, I use the [Local-Machine] environment for script execution.

Create a new project in the Visual Studio for U-SQL script and do the following:

  • Create a database [ADLADEMODB]

    Submit the Job, and it creates a new database, as shown below

    Create database

  • Create a database schema [Lab] in the [ADLADemoDB]

    Azure Data Lake Analytics: Create a database schema

  • Create the tables [Lab].[Employee] and [Lab].[Departments] in the [ADLADemoDB]

    The table structure looks familiar to the SQL table. You can explore syntax for Azure Data lake Analytics in this article, Deploy Azure Data Lake database using the U-SQL scripts

    We have tables, as shown in the below image

    Create sample tables

  • Insert sample data into the [Lab].[Employee] and [Lab].[Departments]

      For inserting sample data, we extract data from the CSV file stored in the local file system. The high-level steps for the below script are:
    • Define a row variable and extract data into it using the EXTRACTOR.CSV() function
    • Insert data into a SQL table from stored the row-level variable

    To be familiar with the below U-SQL script, refer to Table of contents at the bottom

    The job graph for the above insert statement is as below:

    Azure Data Lake Analytics: Insert sample data

    Right-click on the table in Visual Studio and choose Preview

    Preview data

  • Similarly, insert data in the [Lab].[Departments] table

    The job graph for the above insert script is as below:

    job graph

    You can preview the records to validate the records in the table

    preview the records

Explore Joins in U-SQL scripts

The inner join, Outer join ( Left\Right\Full) and Cross join works similar to the SQL Server joins. You can understand and explore them in the article, SQL Join types overview and tutorial.

However, there is a small difference in the way you write join for the Azure Data Lake Analytics job. For example, let’s say we want to perform an inner join that returns the matching rows between both tables.

Azure Data Lake Analytics: Explore Joins

The inner join script looks like the below for ADLA:

Important points regarding the above script are:

  • The AS keyword is mandatory for table alias. For example, in the script, we used the alias using the Lab.Employee AS E1
  • For the table alias:
    • Either use a single letter such as Lab.Employee AS E
    • Use a numeric value after the first letter, such as Lab.Employee AS E1
    • If you use multiple characters, make sure the only first letter of an alias to be in capital else your script would fail. For example, Lab.Employee AS Ey
  • In the t-SQL script, we use the equality operator (=) in the join condition. Here, we use the C# equality symbol (==). For example, ON D1.DepartID == E1.DepartID;
  • It is a case-sensitive language
    • Use the keywords such as SELECT, FROM, AS, OUTPUT, USING in capital letters
    • Use the column names and table names similar to what you define while creating the table structure. For example, if you defined the table name as [Employee], you cannot refer to it as [EMPLOYEE]

Submit the script to perform the inner join between ADLA tables and write the output in a CSV file. Browse to your directory and open the CSV file to verify the join output.

Output data

In the job graph, it shows the input of 2 streams ( Employee and Department) table and output to the SQLJoin1.CSV file.

input of 2 streams

Semijoins

Semijoins does not perform a complete join. It filters data based on the rows in another row set. For example, in the t-SQL scripts, we write the semijoin script like below.

We have two types of semijoin: left or right semijoin.

  • Left semijoin: A left semijoin returns rows in the left rowset having a matching row in the right rowset. If we specify only semijoin keyword, it also refers to the left semijoin
  • Right semijoin: A right semijoin returns rows in the right rowset having a matching row in the left rowset

In the below U-SQL script, we define a left semijoin between @Product and @Category row variable.

Data in the output CSV file is as below. We do not have a row for the Ring product because it does not have any matching row in the @Category variable.

Semijoins output

Similarly, we use the below U-SQL script for the right semijoin.

It returns the product id 1, 2 in the output because it has a matching row in the left rowset.

right semijoin

Anti-Semijoin

As its name suggests, Anti-semijoin works opposite to semijoin. I

In the t-SQL terms, we saw semijoin returns data from Table A that is available in Table B.

The anti-semijoin, return data from Table A that is not available in table B. Similar to the semijoin, it has two variants left anti-semijoin (antisemijoin keyword) and right anti-semijoin.

The below query performs Left anti-semijoin between the @Product and @category row variable:

It returns the row from the @Product variable that is not available in the @Category variable. It returns the unmatched row – Ring as shown below.

Anti-Semijoin

Joins using Rowsets and Azure Data Lake Analytics tables

In the previous examples, we joined two tables in the ADLA database. Suppose you have another data in the CSV file, and you want to join it with the ADLA database tables. You do not require a predefined schema, the table in the U-SQL script. You can use the row set variable and join it directly with the table.

For example, in the below script, we do the following tasks.

  • Extract data into @employee rowset variable from the CSV file stored locally
  • Join the @employee rowset variable with the [Lab].[Departments] table
  • Apply transformation and save the join query output in a TSV format

In the job graph, you can validate the two input data streams.

  • employee.csv file
  • Database table [Lab].[Departments]

Output data stored in the SQLJoin1.TSV format

Azure Data Lake Analytics: Joins using Rowsets

Alternatively, you can write the EXTRACT statement directly in the FROM clause. For example, previously, we extract data into the row variable @employee and then used it in the JOIN clause.

Now, in the below script, we did not use the row variable and directly used the EXTRACT statement in the FROM clause.

Conclusion

In this article, we performed a U-SQL script join between Azure Data Lake Analytics tables, rowset variables. The query syntax for joins looks similar to T-SQL SQL Server joins. Therefore, you can quickly learn these joins and apply them data extraction, transformations for ADLA.

Table of contents

An overview of Azure Data Lake Analytics and U-SQL
Writing U-SQL scripts using Visual Studio for Azure Data Lake Analytics
Deploy Azure Data Lake Analytics database using the U-SQL scripts
Join database tables using U-SQL scripts for Azure Data Lake Analytics
Rajendra Gupta
Latest posts by Rajendra Gupta (see all)
168 Views