Rajendra Gupta

An overview of Azure Data Lake Analytics and U-SQL

February 10, 2021 by

In this article, we explore the Azure Data Lake Analytics and query data using the U-SQL.

Introduction to Azure Data Lake Analytics (ADLA)

Microsoft Azure platform supports big data such as Hadoop, HDInsight, Data lakes. Usually, a traditional data warehouse stores data from various data sources, transform data into a single format and analyze for decision making. Developers use complex queries that might take longer hours for data retrieval. Organizations are increasing their footprints in the Cloud infrastructure. It leverages cloud infrastructure warehouse solutions such as Amazon RedShift, Azure Synapse Analytics (Azure SQL data warehouse), or AWS snowflake. The cloud solutions are highly scalable and reliable to support your data and query processing and storage requirements.

The data warehouse follows the Extract-Transform-Load mechanism for data transfer.

  • Extract: Extract data from different data sources
  • Transform: Transform data into a specific format
  • Load: Load data into predefined data warehouse schema, tables

Extract-Transform-Load mechanism

The data lake does not require a rigorous schema and converts data into a single format before analysis. It stores data in its original format such as binary, video, image, text, document, PDF, JSON. It transforms data only when needed. The data can be in structured, semi-structured and unstructured format.

Azure Data Lake

A few useful features of a data lake are:

  • It stores raw data ( original data format)
  • It does not have any predefined schema
  • You can store Unstructured, semi-structured and structured in it
  • It can handle PBs or even hundreds of PBs data volumes
  • Data lake follows schema on the reading method in which data is transformed as per requirement basis

At a high-level, the Azure data platform architecture looks like below. Image reference: Microsoft docs

  • Ingestion: Data collection from various data sources and store into the Azure Data lake in its original format
  • Storage: Store data into Azure Data Lake Storage, AWS S3 or Google cloud storage
  • Processing: Process data from the raw storage into a compatible format
  • Analytics: Perform data analysis using stored and processed data. You can use Azure Data Lake Analytics(ADLA), HDInsight or Azure Databricks

The Azure data platform architecture

Creating an Azure Data Lake Analytics (ADLA) Account

We need to create an ADLA account with your subscription to process data with it. Login to the Azure portal using your credentials. In the Azure Services, click on Data Lake Analytics.

create an ADLA account

In the New Data Lake Analytics account, enter the following information.

  • Subscription and Resource group: Select your Azure subscription and resource group, if it already exists. You can create a new resource group from the data lake analytics page as well
  • Data Lake Analytics Name: Specify a suitable name for the analytic service
  • Location: Select the Azure region from the drop-down
  • Storage subscription: Select the storage subscription from the drop-down list
  • Azure Data Lake Storage Gen1: Create a new Azure Data Lake Storage Gen1 account
  • Pricing package: You can select a pay-as-you model or monthly commitment as per your requirement

Subscription and Resource group

Click on Review+Create. You can review your configurations and create the Azure Data Lake Analytics Account.

Review your configurations

It starts deploying the Azure resources, as shown below.

Deploying the Azure resources

As shown below, it created the Azure Data Lake Analytics Account and data lake storage during the deployment.

data lake storage

Click on Go to Resource, and it opens the analytics account dashboard, as shown below.

account dashboard

Create a sample job for Azure Data Analytics

Azure Data Lake Analytics uses the U-SQL for query and processing language. The U-SQL language has a combination of SQL and programming language C#. The SQL Server database professionals can quickly become familiar with the U-SQL language.

In the example, we use the U-SQL script to define a dataset with few rows and store the Output into the Azure Data Lake Storage Gen1 account. It writes the output in a CSV format file Sampledata.CSV

Click on New Job in the analytics dashboard. Specify a job name and paste your U-SQL script.

In the job configuration, you can choose the minimum Azure Data Lake Analytics Unit (AU) that defines the compute units for the job. It also shows the estimated cost per minute execution. For example, here, it shows the USD 0.03/minutes.

Create a sample job

Click on Submit. The job has four phases:

  • Preparing: It uploads script in the Azure cloud, compiles and optimize it for execution
  • Queued: if the job is waiting for the resources, it shows the status. Usually, if the resources are available, then it does not spend time on the Queued phase
  • Running: It starts job execution in the Azure Data Lake Analytics account
  • Done: Job execution finished

The job has four phases

Once the job execution finishes, you see a job graph with various steps. Azure data lake analytics divides the jobs into multiple pieces of work, and each piece is known as Vertex. The below image highlights the Vertex and Job graph. Reference: Microsoft docs.

Vertex and Job graph

In the Vertex, it gives information about the number of vertices, average execution time, number of rows written. The green color of the Vertex shows a successful status. You might get different colors Orange (Retired), Red(failed), Blue(running), White(waiting).

We will cover more about these job status, Vertex in the upcoming article. However, you can refer to this Microsoft documentation, Job View for Azure Data Lake Analytics for more details.

job status

Once the process completes, click on the Data tab and navigate to output. Here, it shows the output CSV file.

the Data tab

You can have an output file preview here or download it or open it in Storage Explorer. As shown below, the CSV file has the data set we specified in the U-SQL script.

output file preview

Sample Scripts in Azure Data Lake Analytics

Azure portal contains several examples for understanding the Azure Data Lake Analytics, the U-SQL script for the specific task. Click on the Sample scripts in the portal. At first, it asks your permission to download the sample data in the Azure storage.

Sample Scripts

We have several basic tasks such as Query a TSV file, Create database and table, Populate table, Query table, Create rowset in a script.

Let’s click on the Query a TSV file. It shows the analytics job and the U-script for the job.

Query a TSV file

Submit the analytics job, and once it completes, you can preview data in the Azure portal.

Submit the analytics job

Preview data

Create a data lake analytics job for query CSV file stored in the Azure data storage

In the previous example, we did not perform any data transformation. We read the source data and prepared an output in the CSV format.

For this example, download a sample CSV file from the URL

Sample CSV

Upload the files in the Azure storage associated with the data lake analytics account.

Query CSV file stored in the Azure data storage

Create a new analytics job and specify the U-SQL script.

Create a new analytics job

In this example, we apply a quick transformation in the extracted data. It uses a where clause to filter the records in the input CSV and writes the output in the CSV format file.

We can divide the overall query into three sections.

  • Extract process: It uses the EXTRACT statement. It is similar to a declare statement the t-SQL where we define the column and data types. For simplicity, I use string type for all columns. It uses the predefined extractor for CSV using the Extractors.Csv()

  • Transforming Data: In this step, we access the columns and do the transformation. In the below command, we filter the records satisfying a specific condition in the Where clause

  • Output Transformed Data in CSV format: Next, we save the output in the CSV format and specify the directory and file name. It uses the OUTPUT statement in the U-SQL. The Using statement specifies the extractor function for a CSV file in the U-SQL

Submit the job and view the progress. You can view the job graph on completion.

View the job graph on completion

It creates the output file, as shown in the data tab.

Output data tab

Below is the file preview and you can see it filters records for category defined in the where clause ( column 2).

Records for category

Conclusion

In this article, we explored Azure Data Analytics and U-SQL scripts for external query data stored in the Azure data lake. The next articles in this series will explore more features of U-SQL for Data Analytics.

Table of contents

An overview of Azure Data Lake Analytics and U-SQL
Writing U-SQL scripts using Visual Studio for Azure Data Lake Analytics
Deploy Azure Data Lake Analytics database using the U-SQL scripts
Join database tables using U-SQL scripts for Azure Data Lake Analytics
Rajendra Gupta
Latest posts by Rajendra Gupta (see all)
168 Views