Ahmad Yaseen

Azure Data Factory Interview Questions and Answers

February 11, 2021 by

In this article, we will discuss a number of questions about the Azure Data Factory service that you may be asked when applying to an Azure Data Engineer role.

Q1: Briefly describe the purpose of the ADF Service

ADF is used mainly to orchestrate the data copying between different relational and non-relational data sources, hosted in the cloud or locally in your datacenters. Also, ADF can be used for transforming the ingested data to meet your business requirements. It is ETL, or ELT tool for data ingestion in most Big Data solutions.

Q2: Data Factory consists of a number of components. Mention these components briefly

  • Pipeline: The activities logical container
  • Activity: An execution step in the Data Factory pipeline that can be used for data ingestion and transformation
  • Mapping Data Flow: A data transformation UI logic
  • Dataset: A pointer to the data used in the pipeline activities
  • Linked Service: A descriptive connection string for the data sources used in the pipeline activities
  • Trigger: Specify when the pipeline will be executed
  • Control flow: Controls the execution flow of the pipeline activities

Q3: What is the difference between the Dataset and Linked Service in Data Factory?

Linked Service is a description of the connection string that is used to connect to the data stores. For example, when ingesting data from a SQL Server instance, the linked service contains the name for the SQL Server instance and the credentials used to connect to that instance.

Dataset is a reference to the data store that is described by the linked service. When ingesting data from a SQL Server instance, the dataset points to the name of the table that contains the target data or the query that returns data from different tables.

Q4: What is Data Factory Integration Runtime?

Integration Runtime is a secure compute infrastructure that is used by Data Factory to provide the data integration capabilities across the different network environments and make sure that these activities will be executed in the closest possible region to the data store.

Q5: Data Factory supports three types of Integration Runtimes. Mention these supported types with a brief description for each

  • Azure Integration Runtime: used for copying data from or to data stores accessed publicly via the internet
  • Self-Hosted Integration Runtime: used for copying data from or to an on-premises data store or networks with access control
  • Azure SSIS Integration Runtime: used to run SSIS packages in the Data Factory

Q6: When copying data from or to an Azure SQL Database using Data Factory, what is the firewall option that we should enable to allow the Data Factory to access that database?

Allow Azure services and resources to access this server firewall option.

Q7: If we need to copy data from an on-premises SQL Server instance using Data Factory, which Integration Runtime should be used and where should it be installed?

Self-Hosted Integration Runtime should be installed on the on-premises machine where the SQL Server instance is hosted.

Q8: After installing the Self-Hosted Integration Runtime to the machine where the SQL Server instance is hosted, how could we associate the SH-IR created from the Data Factory portal?

We need to register it using the authentication key provided by the ADF portal.

Q9: What is the difference between the Mapping data flow and Wrangling data flow transformation activities in Data Factory?

Mapping data flow activity is a visually designed data transformation activity that allows us to design a graphical data transformation logic without the need to be an expert developer, and executed as an activity within the ADF pipeline on an ADF fully managed scaled-out Spark cluster.

Wrangling data flow activity is a code-free data preparation activity that integrates with Power Query Online in order to make the Power Query M functions available for data wrangling using spark execution.

Q10: Data Factory supports two types of compute environments to execute the transform activities. Mention these two types briefly

On-demand compute environment, using a computing environment fully managed by the ADF. In this compute type, the cluster will be created to execute the transform activity and removed automatically when the activity is completed.

Bring Your Own environment, in which the used compute environment is managed by you and ADF.

Q11: What is Azure SSIS Integration Runtime?

A fully managed cluster of virtual machines hosted in Azure and dedicated to run SSIS packages in the Data Factory. The SSIS IR nodes can be scaled up, by configuring the node size, or scaled out by configuring the number of nodes in the VMs cluster.

Q12: What is required to execute an SSIS package in Data Factory?

We need to create an SSIS IR and an SSISDB catalog hosted in Azure SQL Database or Azure SQL Managed Instance.

Q13: Which Data Factory activity is used to run an SSIS package in Azure?

Execute SSIS Package activity.

Q14: Which Data Factory activity can be used to get the list of all source files in a specific storage account and the properties of each file located in that storage?

Get Metadata activity.

Q15: Which Data Factory activities can be used to iterate through all files stored in a specific storage account, making sure that the files smaller than 1KB will be deleted from the source storage account?

  • ForEach activity for iteration
  • Get Metadata to get the size of all files in the source storage
  • If Condition to check the size of the files
  • Delete activity to delete all files smaller than 1KB

Q16: Data Factory supports three types of triggers. Mention these types briefly

  • The Schedule trigger that is used to execute the ADF pipeline on a wall-clock schedule
  • The Tumbling window trigger that is used to execute the ADF pipeline on a periodic interval, and retains the pipeline state
  • The Event-based trigger that responds to a blob related event, such as adding or deleting a blob from an Azure storage account

Q17: Any Data Factory pipeline can be executed using three methods. Mention these methods

  • Under Debug mode
  • Manual execution using Trigger now
  • Using an added scheduled, tumbling window or event trigger

Q18: Data Factory supports four types of execution dependencies between the ADF activities. Which dependency guarantees that the next activity will be executed regardless of the status of the previous activity?

Completion dependency.

Q19: Data Factory supports four types of execution dependencies between the ADF activities. Which dependency guarantees that the next activity will be executed only if the previous activity is not executed?

Skipped dependency.

  • For more information, check Dependencies in ADF

Q20: From where we can monitor the execution of a pipeline that is executed under the Debug mode?

The Output tab of the pipeline, without the ability to use the Pipeline runs or Trigger runs under ADF Monitor window to monitor it.

Ahmad Yaseen
Latest posts by Ahmad Yaseen (see all)
Azure, Professional development

About Ahmad Yaseen

Ahmad Yaseen is a Microsoft Big Data engineer with deep knowledge and experience in SQL BI, SQL Server Database Administration and Development fields. He is a Microsoft Certified Solution Expert in Data Management and Analytics, Microsoft Certified Solution Associate in SQL Database Administration and Development, Azure Developer Associate and Microsoft Certified Trainer. Also, he is contributing with his SQL tips in many blogs. View all posts by Ahmad Yaseen