Hadi Fadlallah

Connecting to Apache Hive and Apache Pig using SSIS Hadoop components

May 14, 2020 by

In our previously published articles in this series, we talked about many SSIS Hadoop components added in SQL Server 2016, such as the Hadoop connection manager, Hadoop file system task, HDFS file source, and HDFS file destination.

In this article, we will be talking about Hadoop Hive and Hadoop Pig Tasks. We will first give a brief overview of Apache Hive and Apache Pig. Then, we will illustrate the related SSIS Hadoop components and alternatives.

Apache Hive

Apache Hive is an open-source data warehousing software developed by Facebook built on the top of Hadoop. It allows querying data using a SQL-like language called HiveQL or using Apache Spark SQL. It can store data within a separate repository, and it allows building external tables on the top of data stored outside Hive repositories.

In general, these technologies work perfectly under Linux operating systems, but it can also be installed on Windows using the Cygwin tool (check the external links section for more information).

Apache Pig

Apache Pig is an open-source framework developed by Yahoo used to write and execute Hadoop MapReduce jobs. It is designed to facilitate writing MapReduce programs with a high-level language called PigLatin instead of using complicated Java code. It also can be extended with user-defined functions.

Apache Pig converts the PigLatin scripts into MapReduce using a wrapper layer in an optimized way, which decreases the need to optimize scripts manually to improve their efficiency.

Similar to Apache Hive and other software, this technology works better on Linux-based operating systems, while it can be installed on Windows (check the external links section for more information).

WebHDFS and WebHCat services

As we mentioned in the first article in this series, there are two types of connections within the Hadoop connection manager: (1) WebHDFS used for HDFS commands and (2) WebHCat used for apache Hive and Pig tasks.

It is worth to mention that WebHDFS and WebHCat are two REST APIs used to communicate with Hadoop components. These APIs allow us to execute Hadoop-related commands regardless of the current operating system and if Hadoop can be accessed via shell commands. Note that WebHDFS is installed with Hadoop, while WebHCat is installed with Apache Hive.

WebHCat and WebHDFS APIs

Figure 1 – WebHDFS and WebHCat APIs (Reference)

To start the WebHCat API, you should run the following command:

$HIVE_HOME/hcatalog/sbin/webhcat_server.sh start

Connecting to WebHCat using Hadoop connection manager

Connecting to WebHCat is very similar to WebHDFS, as explained in the first article. Noting that the default port is 50111.

Configuring a WebHCat connection to be used by Hadoop components

Figure 2 – Configuring WebHCat connection

To make sure that connection is well configured, we can use the “Test connection” button.

Testing WebHCat connection

Figure 3 – Testing connection

Hadoop Hive Task

The Hadoop component related to Hive is called “Hadoop Hive Task”. This component is designed to execute HiveQL statements. It uses a WebHCat Hadoop connection to send a statement to the Apache Hive server.

This Hadoop component is very simple, as shown in the screenshot below, its editor contains only a few parameters to configure:

Hadoop Hive task editor

Figure 4 – Hadoop Hive Task editor

  • Name: The task name
  • Description: The task description
  • HadoopConnection: We should select the related Hadoop connection manager
  • SourceType: There are two choices:
    • DirectInput: Write a HiveQL script manually
    • ScriptFile: Use a script file stored within Hadoop
  • InlineScript: (available for DirectInput Source), we should write here the HiveQL statement
  • HadoopScriptFilePath: (available for ScriptFile Source), we should specify the Hadoop file path
  • TimeoutInMinutes: The command timeout in minutes: If zero is entered, then the command will run asynchronously

Example

To run an example, we used to used the following HiveQL statement to create a Hive table:

Writing a HiveQL script within SSIS Hadoop component

Figure 5 – HiveQL script

After executing the package, we started the Hive shell from a command prompt and executed the following command to show available tables within the default database:

The result shows that the employee table is created:

Checking if table is created from Apache Hive shell

Figure 6 – Created table shown from the Hive shell

Synchronous vs. Asynchronous commands

To illustrate the difference between synchronous and asynchronous commands, we ran the following experiment:

First, we set the “TimeoutInMunites” property to 1440 (default) and executed the package. As shown in the screenshot below, the immediate window keeps showing the execution information sent from the Hive server.

Synchronous task execution

Figure 7 – Synchronous task execution

If we set the TimeoutInMinutes property to 0 and we execute the package, the task shows that it is completed successfully one a job is scheduled in the Hadoop cluster.

Asynchronous task execution

Figure 8 – Asynchronous task execution

Hadoop Pig Tasks

The Hadoop component related to Apache Pig is called the “Hadoop Pig task”. This component is almost the same as Hadoop Hive Task since it has the same properties and uses a WebHCat connection. The only difference is that it executes a PigLatin script rather than HiveQL.

Hadoop Pig task editor

Figure 9 – Hadoop Pig Task editor

SSIS Hive Hadoop component alternative: Microsoft Hive ODBC driver

There is another method available to connect with the Apache Hive server in SSIS other than using the SSIS Hadoop components, which is the Microsoft Hive ODBC Driver. This allows creating an ODBC connection with Apache Hive. ODBC Driver connects directly to the running Hive server (HiveServer1 or HiveServer2).

First, we should download the driver from the official Microsoft Download Link. Note that there are two drivers (32-bit and 64-bit). After downloading and installing the driver, we should add an ODBC source following these steps:

  1. Navigate to Control Panel > System and Security > Administrative Tools
  2. Open the ODBC Data Sources (32-bit or 64-bit)

    ODBC Data sources shortcuts

    Figure 10 – ODBC Data Sources shortcuts

  3. We should add a User or System DSN (note that Sample System DSN was created during installation)

    Microsoft Hive ODBC DSN samples

    Figure 11 – Sample Microsoft Hive DSN

  4. After clicking on Add button, we should select Microsoft Hive ODBC driver from the drivers’ list

    Selecting Microsoft Hive ODBC driver

    Figure 12 – Selecting Microsoft Hive ODBC driver

  5. Now we should configure the Hive ODBC driver DSN:
    1. Host: the Hive server host address
    2. Port: the Hive server port number
    3. Database: the database name

      Microsoft Hive ODBC DSN setup

      Figure 13 – Microsoft Hive ODBC DSN setup

  6. We should test the connection before creating the ODBC DSN

    Testing ODBC connection

    Figure 14 – Testing connection

  7. After creating the ODBC DSN, we should create a new ODBC connection manager in SSIS:

    Adding ODBC connection manager

    Figure 15 – Adding ODBC connection manager

  8. Then we should select the ODBC DSN we created while configuring the connection manager:

    Using the created DSN in the connection manager

    Figure 16 – Using the created DSN in the connection manager

Advantages

Using Microsoft Hive ODBC driver has many benefits:

  1. It can be used with earlier versions of SQL Server (before 2016)
  2. We can use apache Hive as a source or destination within the data flow task
  3. It can be used for cloud-based and on-premise Hadoop clusters
  4. Many SSIS components can use ODBC connections (example: Execute SQL Task)

External Links

Conclusion

In this article, we talked about the Apache Hive and Apache Pig. Then, we explained what WebHDFS and WebHCat services are. We illustrated the Hive and Pig related Hadoop components in SSIS. And Finally, we showed how to use Microsoft Hive ODBC driver as an alternative of the Hive Hadoop Component.

Table of contents

SSIS Hadoop Connection Manager and related tasks
Importing and Exporting data using SSIS Hadoop components
Connecting to Apache Hive and Apache Pig using SSIS Hadoop components
Hadi Fadlallah
Integration Services (SSIS)

About Hadi Fadlallah

Hadi is an SQL Server professional with more than 10 years of experience. His main expertise is in data integration. He's one of the top ETL and SQL Server Integration Services contributors at Stackoverflow.com . Also, he published several article series about Biml, SSIS features, Search engines, Hadoop, and many other technologies. Besides working with SQL Server, he worked with different data technologies such as NoSQL databases, Hadoop, Apache Spark. He is a MongoDB, Neo4j, and ArangoDB certified professional. On the academic level, Hadi holds two master's degrees in computer science and business computing. Currently, he is a Ph.D. candidate in data science focusing on Big Data quality assessment techniques. Hadi really enjoys learning new things everyday and sharing his knowledge. You can reach him on his personal website. View all posts by Hadi Fadlallah

168 Views