Azure Data Factory Interview Questions and Answers

In this article, we will discuss a number of questions about the Azure Data Factory service that you may be asked when applying to an Azure Data Engineer role.

Q1: Briefly describe the purpose of the ADF Service

ADF is used mainly to orchestrate the data copying between different relational and non-relational data sources, hosted in the cloud or locally in your datacenters. Also, ADF can be used for transforming the ingested data to meet your business requirements. It is ETL, or ELT tool for data ingestion in most Big Data solutions.

For more information, check Starting your journey with Microsoft Azure Data Factory

Q2: Data Factory consists of a number of components. Mention these components briefly

Pipeline: The activities logical container
Activity: An execution step in the Data Factory pipeline that can be used for data ingestion and transformation
Mapping Data Flow: A data transformation UI logic
Dataset: A pointer to the data used in the pipeline activities
Linked Service: A descriptive connection string for the data sources used in the pipeline activities
Trigger: Specify when the pipeline will be executed
Control flow: Controls the execution flow of the pipeline activities

For more information, check Starting your journey with Microsoft Azure Data Factory

Q3: What is the difference between the Dataset and Linked Service in Data Factory?

Linked Service is a description of the connection string that is used to connect to the data stores. For example, when ingesting data from a SQL Server instance, the linked service contains the name for the SQL Server instance and the credentials used to connect to that instance.

Dataset is a reference to the data store that is described by the linked service. When ingesting data from a SQL Server instance, the dataset points to the name of the table that contains the target data or the query that returns data from different tables.

For more information, check Starting your journey with Microsoft Azure Data Factory

Q4: What is Data Factory Integration Runtime?

Integration Runtime is a secure compute infrastructure that is used by Data Factory to provide the data integration capabilities across the different network environments and make sure that these activities will be executed in the closest possible region to the data store.

For more information, check Copy data between Azure data stores using Azure Data Factory

Q5: Data Factory supports three types of Integration Runtimes. Mention these supported types with a brief description for each

Azure Integration Runtime: used for copying data from or to data stores accessed publicly via the internet
Self-Hosted Integration Runtime: used for copying data from or to an on-premises data store or networks with access control
Azure SSIS Integration Runtime: used to run SSIS packages in the Data Factory

For more information, check Copy data between Azure data stores using Azure Data Factory

Q6: When copying data from or to an Azure SQL Database using Data Factory, what is the firewall option that we should enable to allow the Data Factory to access that database?

Allow Azure services and resources to access this server firewall option.

Q7: If we need to copy data from an on-premises SQL Server instance using Data Factory, which Integration Runtime should be used and where should it be installed?

Self-Hosted Integration Runtime should be installed on the on-premises machine where the SQL Server instance is hosted.

For more information, check Copy data from On-premises data store to an Azure data store using Azure Data Factory.

Q8: After installing the Self-Hosted Integration Runtime to the machine where the SQL Server instance is hosted, how could we associate the SH-IR created from the Data Factory portal?

We need to register it using the authentication key provided by the ADF portal.

For more information, check Copy data from On-premises data store to an Azure data store using Azure Data Factory

Q9: What is the difference between the Mapping data flow and Wrangling data flow transformation activities in Data Factory?

Mapping data flow activity is a visually designed data transformation activity that allows us to design a graphical data transformation logic without the need to be an expert developer, and executed as an activity within the ADF pipeline on an ADF fully managed scaled-out Spark cluster.

Wrangling data flow activity is a code-free data preparation activity that integrates with Power Query Online in order to make the Power Query M functions available for data wrangling using spark execution.

For more information, check Transform data using a Mapping Data Flow in Azure Data Factory

Q10: Data Factory supports two types of compute environments to execute the transform activities. Mention these two types briefly

On-demand compute environment, using a computing environment fully managed by the ADF. In this compute type, the cluster will be created to execute the transform activity and removed automatically when the activity is completed.

Bring Your Own environment, in which the used compute environment is managed by you and ADF.

For more information, check Transform data using a Mapping Data Flow in Azure Data Factory

Q11: What is Azure SSIS Integration Runtime?

A fully managed cluster of virtual machines hosted in Azure and dedicated to run SSIS packages in the Data Factory. The SSIS IR nodes can be scaled up, by configuring the node size, or scaled out by configuring the number of nodes in the VMs cluster.

For more information, check Run SSIS packages in Azure Data Factory

Q12: What is required to execute an SSIS package in Data Factory?

We need to create an SSIS IR and an SSISDB catalog hosted in Azure SQL Database or Azure SQL Managed Instance.

For more information, check Run SSIS packages in Azure Data Factory

Q13: Which Data Factory activity is used to run an SSIS package in Azure?

Execute SSIS Package activity.

For more information, check Run SSIS packages in Azure Data Factory

Q14: Which Data Factory activity can be used to get the list of all source files in a specific storage account and the properties of each file located in that storage?

Get Metadata activity.

For more information, check How to use iterations and conditions activities in Azure Data Factory

Q15: Which Data Factory activities can be used to iterate through all files stored in a specific storage account, making sure that the files smaller than 1KB will be deleted from the source storage account?

ForEach activity for iteration
Get Metadata to get the size of all files in the source storage
If Condition to check the size of the files
Delete activity to delete all files smaller than 1KB

For more information, check How to use iterations and conditions activities in Azure Data Factory

Q16: Data Factory supports three types of triggers. Mention these types briefly

The Schedule trigger that is used to execute the ADF pipeline on a wall-clock schedule
The Tumbling window trigger that is used to execute the ADF pipeline on a periodic interval, and retains the pipeline state
The Event-based trigger that responds to a blob related event, such as adding or deleting a blob from an Azure storage account

For more information, check How to schedule Azure Data Factory pipeline executions using Triggers

Q17: Any Data Factory pipeline can be executed using three methods. Mention these methods

Under Debug mode
Manual execution using Trigger now
Using an added scheduled, tumbling window or event trigger

Q18: Data Factory supports four types of execution dependencies between the ADF activities. Which dependency guarantees that the next activity will be executed regardless of the status of the previous activity?

Completion dependency.

Q19: Data Factory supports four types of execution dependencies between the ADF activities. Which dependency guarantees that the next activity will be executed only if the previous activity is not executed?

Skipped dependency.

For more information, check Dependencies in ADF

Q20: From where we can monitor the execution of a pipeline that is executed under the Debug mode?

The Output tab of the pipeline, without the ability to use the Pipeline runs or Trigger runs under ADF Monitor window to monitor it.

Author
Recent Posts

Ahmad Yaseen

Ahmad Yaseen is a Microsoft Big Data engineer with deep knowledge and experience in SQL BI, SQL Server Database Administration and Development fields.

He is a Microsoft Certified Solution Expert in Data Management and Analytics, Microsoft Certified Solution Associate in SQL Database Administration and Development, Azure Developer Associate and Microsoft Certified Trainer.

Also, he is contributing with his SQL tips in many blogs.

View all posts by Ahmad Yaseen

Related posts: