Hadi Fadlallah
First data flow task

Importing and Exporting data using SSIS Hadoop components

May 11, 2020 by

In the previously published article, we talked briefly about Hadoop, and we gave an overview of the SSIS Hadoop components added in the SQL Server 2016 release, and we focused on the Hadoop connection manager and the Hadoop file system task.

As illustrated, there are two SSIS Hadoop components at the package data flow level:

  • HDFS Files Source
  • HDFS File Destination

SSIS Hadoop components in the SSIS toolbox

Figure 1 – SSIS Hadoop components within the toolbox

In this article, we will briefly explain the Avro and ORC Big Data file formats. Then, we will be talking about Hadoop data flow task components and how to use them to import and export data into the Hadoop cluster. Then we will compare those Hadoop components with the Hadoop File System Task. Finally, we will conclude our work.

  • Note: To run an experiment, we will use the AdventureWorks2017 database, which can be downloaded from the following link.

Big data files formats

There are many file formats developed for Big Data, but we will talk only about Avro and ORC since they are supported by HDFS source and destination components.

Avro

Apache Avro is a row-oriented data serialization system developed within the Hadoop framework. It relies mainly on JSON to define data types, structure, and to serialize data. Also, it can be used to communicate between Hadoop nodes. You can read more about this technology from the official documentation.

Optimized Row columnar (ORC)

Apache ORC is a column-oriented data storage format developed for the Hadoop framework. It was announced in 2013 by HortonWorks in collaboration with Facebook. This format is mainly used with Apache Hive, and it has a better performance than row-oriented formats. You can read more about this technology from the official documentation.

HDFS File Destination

HDFS file destination is a component used to store tabular data within the Hadoop cluster. It supports text, Avro, and ORC files.

HDFS file destination editor

Figure 2 – HDFS file destination editor

If we open the HDFS file destination editor, we can see that it contains two tab pages:

  1. Connection Manager: Where we should specify the Hadoop connection and the destination file format:
    1. Hadoop Connection Manager: We should select the relevant Hadoop connection manager (Creating and configuring an SSIS Hadoop connection manager is illustrated in our previously published article in this series, SSIS Hadoop Connection Manager and related tasks)
    2. Location (File Path): The file path within the Hadoop cluster (it must start with a slash “/”. As an example: “/Test/Persons.avro”).; you don’t need to create directories manually since they are automatically generated before data is inserted.
    3. File Format: Text, Avro, or ORC
    4. Column delimiter character (Only available for the Text file format)
    5. Columns names in the first data row (Only available for the Text file format)
  2. Mappings: Where we should set the columns mappings

Additional properties

Besides these properties, there are some additional properties that are not shown in this component editor. You can find them in the properties tab (Select the HDFS File destination component and click F4).

HDFS File destination properties

Figure 3 – HDFS File destination properties

These properties are:

  1. IsBatchMode: Specify whether data is imported in batch mode
  2. BatchSize: Specify the batch size
  3. ValidateExternalMetadata: Select whether metadata is validated before data is inserted (This property is common between all SSIS source and destination) components

Expressions

There are some properties of the HDFS file destination that can be evaluated as an expression. These properties are:

  1. BatchSize
  2. ColumnDelimiter
  3. FileFormat
  4. FilePath
  5. FirstRowAsColumnHeader
  6. IsBatchMode

To set these properties, select the data flow task, in the properties tab, click on expression. Then, you will find a list of these properties within the Property Expressions editor form, as shown below:

HDFS file destination expressions

Figure 4 – HDFS file destination expressions

HDFS File Source

HDFS file source is a component used to read tabular data stored within the Hadoop cluster. It supports text and Avro files.

If we open the HDFS file source editor, we can see that it contains three tab pages:

HDFS File source editor

Figure 5 – HDFS File source editor

  1. Connection Manager: Where we should specify the Hadoop connection and the destination file format:
    1. Hadoop Connection Manager: We should select the relevant Hadoop connection manager
    2. Location (File Path): The file path within the Hadoop cluster (it must start with a slash “/”)
    3. File Format: Text or Avro
    4. Column delimiter character (Only available for the Text file format)
    5. Columns names in the first data row (Only available for the Text file format)
  2. Columns: Where we should select the input columns and add aliases
  3. Error output: Where we can configure the error output (Like other source components)

HDFS file source differs from the HDFS file destination component since it doesn’t have any custom properties that are not shown within the editor. Besides, only two properties can be evaluated as an expression, as shown in the screenshot below:

  1. FilePath
  2. FirstRowAsColumnName

Evaluating HDFS File Source  properties as expressions

Figure 6 – Evaluating HDFS File Source properties as expressions

Example

To test these components, we will create an SSIS package and add three connection managers:

  1. Hadoop Connection Manager: to connect with the Hadoop cluster (check the previous article)
  2. OLE DB Connection Manager: to connect to SQL Server instance where AdventureWorks2017 database is stored
  3. Flat File Connection Manager: We will use it to export data from HDFS Source:

Connection managers needed to test the SSIS Hadoop components

Figure 7 – Connection managers

We will add two data flow tasks:

Package control flow

Figure 8 – Package control flow

  1. Import data into HDFS: it will read data from the [Person].[Person] table into the following file path: “/TestSSIS/Persons.txt”:

First data flow task

Figure 9 – First data flow task

  1. Export data from HDFS” it will read data from a file in HDFS within the following path: “/Test/Persons/txt” into a flat-file:

Second data flow task

Figure 10 – Second data flow task

We run the SSIS package, and after execution is finished successfully, we can verify that the data is exported successfully from the Hadoop web interface.

File shown using the Hadoop web interface

Figure 11 -Data import verified using the Hadoop web interface

Also, the text file is exported successfully into the local file system.

File exported successfully into the local file system

Figure 12 – File exported successfully into the local file system

Hadoop File System Task vs. Data Flow Components

As we mentioned in the previously published article in this series, there are three SSIS Hadoop components at the package control flow level:

  • Hadoop File System Task
  • Hadoop Hive Task
  • Hadoop Pig Task

Since the data flow Hadoop components are used to import or export data from HDFS, one main question could be asked, When to use these components, especially that importing and exporting data can be done using the Hadoop file system task?

There are many differences between both Hadoop components concerning the supported data sources and other features, as shown in the following table.

Feature

Hadoop File System Task

SSIS Hadoop components

Package level

Control Flow

Data Flow

Data warehousing approach

ELT

ETL

Data Sources

File Connection

All connection

Supported connection manager

Hadoop connection manager

Hadoop connection manager

Supported types

Files (any extension) + Directories

Text file, Avro, ORC

Available transformations

No transformation can be applied

All data flow transformation can be applied to the data (Filter, Adding and removing columns, aggregations)

Supported operations

Copy From HDFS, Copy to HDFS, Copy within HDFS

Copy From HDFS, Copy to HDFS, Copy within HDFS

Performance

Higher performance since operations are high-level, and data is not validated

Lower than Hadoop tasks

Figure 13 – Comparison table between Hadoop file system task and data flow components

Conclusion

In this article, we illustrated the SSIS Hadoop components in the data flow level and how to use them to import and export data from Hadoop on-premise cluster. Then we made a small comparison between them and the Hadoop File System Task.

Table of contents

SSIS Hadoop Connection Manager and related tasks
Importing and Exporting data using SSIS Hadoop components
Connecting to Apache Hive and Apache Pig using SSIS Hadoop components
Hadi Fadlallah
Importing, exporting, Integration Services (SSIS)

About Hadi Fadlallah

Hadi is an SQL Server professional with more than 10 years of experience. His main expertise is in data integration. He's one of the top ETL and SQL Server Integration Services contributors at Stackoverflow.com . Also, he published several article series about Biml, SSIS features, Search engines, Hadoop, and many other technologies. Besides working with SQL Server, he worked with different data technologies such as NoSQL databases, Hadoop, Apache Spark. He is a MongoDB, Neo4j, and ArangoDB certified professional. On the academic level, Hadi holds two master's degrees in computer science and business computing. Currently, he is a Ph.D. candidate in data science focusing on Big Data quality assessment techniques. Hadi really enjoys learning new things everyday and sharing his knowledge. You can reach him on his personal website. View all posts by Hadi Fadlallah

168 Views