Hadi Fadlallah
Merge transformation Inputs and output rows count

SSIS Merge Join vs. Merge Transformation

September 28, 2021 by

SSIS Merge Join and Merge transformations are two similar components used to combine two input data sources into one output. This article explains both transformations and illustrates when and how to use each one of them.

This article is the fourteenth of the SSIS features face-to-face series that aims to compare similar SSIS components in order to remove any confusion for the data integration developers that use this technology.

  • Note: All examples in this article are made using the Stack overflow 2013 database, SQL Server 2019, Visual Studio 2019, SQL Server Integration Services projects extension version 3.4.

SSIS Merge Transformation

As mentioned in the SSIS toolbox, SSIS Merge transformation “combines rows from multiple sorted data flows into one sorted data flow”.

Merge Transformation description in the SSIS toolbox

Figure 1 – Merge Transformation description in the SSIS toolbox

Merge transformation is very similar to the Union All Transformation, which combines rows from different sources into one output. The main difference is that Union All doesn’t require that the data sources are sorted, nor does its output. Besides, Union All accepts more than two inputs while Merge transformation doesn’t.

Before illustrating how to use the Merge transformation, we will explain several approaches to sort data sources in SSIS.

Using Sort Component

Sort Component is a transformation that sorts input data in ascending or descending order “when it cannot be sorted at its source”. This means that – in general – sorting data at its source is preferable to using this component. The reason is that sorting data at the source level decreases the workload on the ETL server. Besides, it can benefit from indexes or other helpers found at the data source.

Sort Component description in the SSIS toolbox

Figure 2 – Sort Component description in the SSIS toolbox

One important note mentioned in the SSIS toolbox is that the Sort component is not recommended for large data since it requires loading all data in memory before generating the sorted output.

The Sort component is very simple and easy to use; you need to specify the columns used in the sort operation with the sorting order for each one of them.

Sort Component editor

Figure 3 – Sort Component editor

Caution: Removing duplicates can result in a data loss since duplicates are compared based on the sorting columns, not the whole data row.

Using the Sort component with the Merge transformation

Figure 4 – Using the Sort component with the Merge transformation

Using Advanced Editor

As we mentioned before, when it can sort the data at its source, there is no need to use the Sort component. Assuming that we are reading from a SQL Server table having a clustered index (which means that its primary key columns sort data), or we used an ORDER BY clause in the SQL command. SSIS engine does not automatically detect our data source is sorted, and the user must tell it so.

To handle a data source as it is sorted, we should open the source component’s advanced editor (Right-click > Show Advanced Editor…).

Opening source's advanced editor

Figure 5 – Opening source’s advanced editor

In the source component advanced editor, go to the “Input and Output Properties” tab. Click on the “OLE DB Source Output” and change the “IsSorted” property to “True”.

Changing the "IsSorted" property to "True"

Figure 6 – Changing the “IsSorted” property to “True”

The next step is to configure the sorting key position for all columns used in the sort operation.

Configuring the sort key position

Figure 7 – Configuring the sort key position

As mentioned in the official documentation, the value of the “Sort Key Position” property should be configured as follows:

  • The integer value must represent a numeric sequence, starting with 1 and incremented by 1
  • A positive integer value indicates an ascending sort order
  • A negative integer value indicates a descending sort order. (If set to a negative number, the absolute value of the number determines the column’s position in the sort sequence)
  • The default value of 0 indicates that the column is not sorted. Leave the value of 0 for output columns that do not participate in the sort

Using the SSIS Merge transformation

To illustrate the SSIS Merge transformation, let us assume that we need to insert all data from the “posts” and “comments” table into one table. To do so, we created an SSIS package, added a “Data Flow Task”, and configured an OLE DB connection to the Stack Overflow 2013 database. Next, we added two OLE DB Source components and a Sort component for each one. Then, we add a Merge transformation that takes both Sort components’ output as input.

As shown in the image below, the merge transformation editor is very simple. You only need to specify the component output columns and their corresponding column from each input.

Merge transformation editor

Figure 8 – Merge transformation editor

As shown in the image below, the “ID” output column is the sort key for the component output, and it is mapped to the “ID” columns of each data source (the sorting keys).

Data flow task screenshot

Figure 9 – Data flow task screenshot

If we open the OLE DB destination advanced editor, we can check that the Id column sorts the Merge transformation output.

OLE DB Destination Input properties

Figure 10 – OLE DB Destination Input properties

SortKeyPosition property for the ID column

Figure 11 – SortKeyPosition property for the ID column

After executing the package, we can note that the count of Merge transformation output rows is equal to the sum of both input rows.

Merge transformation Inputs and output rows count

Figure 12 – Merge transformation Inputs and output rows count

SSIS Merge Join Transformation

The SSIS Merge Join Transformation combines two sorted inputs data flows into one output using a join operation. Three join operations are supported: Full, Left, and inner join.

SSIS Merge Join transformation description in the SSIS toolbox

Figure 13 – SSIS Merge Join transformation description in the SSIS toolbox

The image below shows that the SSIS Merge Join transformation joins both inputs using the sorting key used in each data source. You can select the output columns from both sources.

SSIS Merge Join transformation editor

Figure 14 – SSIS Merge Join transformation editor

Since right join is not listed in the “Join Type” drop-down list, you can choose the left join option and click on the “Swap Inputs” button to swap the input from the right to the left side, which is equivalent to the RIGHT OUTER JOIN operation.

The SSIS Merge Join transformation output rows differ based on the join type selected:

  • INNER JOIN: This type of join returns those rows which have matching values in both data sources
  • LEFT OUTER JOIN: This type of join returns all rows from the left-side data source with all matching rows from the right-side source
  • FULL JOIN: This type of join returns all matching rows from left and right-side data sources

SSIS Merge Join vs. Merge transformation

The difference between the SSIS Merge Join and the Merge transformation is like the difference between the UNION and JOIN operators in SQL Server. Both are used to combine rows from two data sources, but each has its own way of merging them. While Merge transformation is used to combine rows (such as UNION operation), SSIS Merge Join transformation is used to combine columns between different rows (such as SQL Joins).

Summary

In this article, we explained both SSIS Merge Join and Merge transformations. We also illustrated how to configure them. Besides, we explained how to sort data sources since it is a requirement for merging transformations in SSIS. Finally, we explained the main difference between both merging components in SSIS.

Table of contents

SSIS OLE DB Source: SQL Command vs Table or View
SSIS Expression Tasks vs Evaluating variables as expressions
SSIS OLE DB Destination vs SQL Server Destination
Execute SQL Task in SSIS: SqlStatementSource Expressions vs Variable Source Types
Execute SQL Task in SSIS: Output Parameters vs Result Sets
SSIS Derived Columns with Multiple Expressions vs Multiple Transformations
SSIS Data types: Change from the Advanced Editor vs Data Conversion Transformations
SSIS Connection Managers: OLE DB vs ODBC vs ADO.NET
SSIS Flat Files vs Raw Files
SSIS Foreach Loop vs For Loop Container
SSIS: Execute T-SQL Statement Task vs Execute SQL Task
SSIS Lookup transformation vs. Fuzzy Lookup transformation
SSIS Pivot transformation vs. Unpivot transformation
SSIS Merge Join vs. Merge Transformation
Data Access Modes in SSIS OLE DB Destination: SQL Command vs. Table or View
SSIS XML Source vs XML task
SSIS Script task vs. Script Component
SSIS term extraction vs. term lookup
Hadi Fadlallah
ETL, Integration Services (SSIS)

About Hadi Fadlallah

Hadi is an SQL Server professional with more than 10 years of experience. His main expertise is in data integration. He's one of the top ETL and SQL Server Integration Services contributors at Stackoverflow.com . Also, he published several article series about Biml, SSIS features, Search engines, Hadoop, and many other technologies. Besides working with SQL Server, he worked with different data technologies such as NoSQL databases, Hadoop, Apache Spark. He is a MongoDB, Neo4j, and ArangoDB certified professional. On the academic level, Hadi holds two master's degrees in computer science and business computing. Currently, he is a Ph.D. candidate in data science focusing on Big Data quality assessment techniques. Hadi really enjoys learning new things everyday and sharing his knowledge. You can reach him on his personal website. View all posts by Hadi Fadlallah

168 Views