SSIS Merge Join and Merge transformations are two similar components used to combine two input data sources into one output. This article explains both transformations and illustrates when and how to use each one of them.
This article is the fourteenth of the SSIS features face-to-face series that aims to compare similar SSIS components in order to remove any confusion for the data integration developers that use this technology.
- Note: All examples in this article are made using the Stack overflow 2013 database, SQL Server 2019, Visual Studio 2019, SQL Server Integration Services projects extension version 3.4.
SSIS Merge Transformation
As mentioned in the SSIS toolbox, SSIS Merge transformation “combines rows from multiple sorted data flows into one sorted data flow”.
Figure 1 – Merge Transformation description in the SSIS toolbox
Merge transformation is very similar to the Union All Transformation, which combines rows from different sources into one output. The main difference is that Union All doesn’t require that the data sources are sorted, nor does its output. Besides, Union All accepts more than two inputs while Merge transformation doesn’t.
Before illustrating how to use the Merge transformation, we will explain several approaches to sort data sources in SSIS.
Using Sort Component
Sort Component is a transformation that sorts input data in ascending or descending order “when it cannot be sorted at its source”. This means that – in general – sorting data at its source is preferable to using this component. The reason is that sorting data at the source level decreases the workload on the ETL server. Besides, it can benefit from indexes or other helpers found at the data source.
Figure 2 – Sort Component description in the SSIS toolbox
One important note mentioned in the SSIS toolbox is that the Sort component is not recommended for large data since it requires loading all data in memory before generating the sorted output.
The Sort component is very simple and easy to use; you need to specify the columns used in the sort operation with the sorting order for each one of them.
Figure 3 – Sort Component editor
Caution: Removing duplicates can result in a data loss since duplicates are compared based on the sorting columns, not the whole data row.
Figure 4 – Using the Sort component with the Merge transformation
Using Advanced Editor
As we mentioned before, when it can sort the data at its source, there is no need to use the Sort component. Assuming that we are reading from a SQL Server table having a clustered index (which means that its primary key columns sort data), or we used an ORDER BY clause in the SQL command. SSIS engine does not automatically detect our data source is sorted, and the user must tell it so.
To handle a data source as it is sorted, we should open the source component’s advanced editor (Right-click > Show Advanced Editor…).
Figure 5 – Opening source’s advanced editor
In the source component advanced editor, go to the “Input and Output Properties” tab. Click on the “OLE DB Source Output” and change the “IsSorted” property to “True”.
Figure 6 – Changing the “IsSorted” property to “True”
The next step is to configure the sorting key position for all columns used in the sort operation.
Figure 7 – Configuring the sort key position
As mentioned in the official documentation, the value of the “Sort Key Position” property should be configured as follows:
- The integer value must represent a numeric sequence, starting with 1 and incremented by 1
- A positive integer value indicates an ascending sort order
- A negative integer value indicates a descending sort order. (If set to a negative number, the absolute value of the number determines the column’s position in the sort sequence)
- The default value of 0 indicates that the column is not sorted. Leave the value of 0 for output columns that do not participate in the sort
Using the SSIS Merge transformation
To illustrate the SSIS Merge transformation, let us assume that we need to insert all data from the “posts” and “comments” table into one table. To do so, we created an SSIS package, added a “Data Flow Task”, and configured an OLE DB connection to the Stack Overflow 2013 database. Next, we added two OLE DB Source components and a Sort component for each one. Then, we add a Merge transformation that takes both Sort components’ output as input.
As shown in the image below, the merge transformation editor is very simple. You only need to specify the component output columns and their corresponding column from each input.
Figure 8 – Merge transformation editor
As shown in the image below, the “ID” output column is the sort key for the component output, and it is mapped to the “ID” columns of each data source (the sorting keys).
Figure 9 – Data flow task screenshot
If we open the OLE DB destination advanced editor, we can check that the Id column sorts the Merge transformation output.
Figure 10 – OLE DB Destination Input properties
Figure 11 – SortKeyPosition property for the ID column
After executing the package, we can note that the count of Merge transformation output rows is equal to the sum of both input rows.
Figure 12 – Merge transformation Inputs and output rows count
SSIS Merge Join Transformation
The SSIS Merge Join Transformation combines two sorted inputs data flows into one output using a join operation. Three join operations are supported: Full, Left, and inner join.
Figure 13 – SSIS Merge Join transformation description in the SSIS toolbox
The image below shows that the SSIS Merge Join transformation joins both inputs using the sorting key used in each data source. You can select the output columns from both sources.
Figure 14 – SSIS Merge Join transformation editor
Since right join is not listed in the “Join Type” drop-down list, you can choose the left join option and click on the “Swap Inputs” button to swap the input from the right to the left side, which is equivalent to the RIGHT OUTER JOIN operation.
The SSIS Merge Join transformation output rows differ based on the join type selected:
- INNER JOIN: This type of join returns those rows which have matching values in both data sources
- LEFT OUTER JOIN: This type of join returns all rows from the left-side data source with all matching rows from the right-side source
- FULL JOIN: This type of join returns all matching rows from left and right-side data sources
SSIS Merge Join vs. Merge transformation
The difference between the SSIS Merge Join and the Merge transformation is like the difference between the UNION and JOIN operators in SQL Server. Both are used to combine rows from two data sources, but each has its own way of merging them. While Merge transformation is used to combine rows (such as UNION operation), SSIS Merge Join transformation is used to combine columns between different rows (such as SQL Joins).
In this article, we explained both SSIS Merge Join and Merge transformations. We also illustrated how to configure them. Besides, we explained how to sort data sources since it is a requirement for merging transformations in SSIS. Finally, we explained the main difference between both merging components in SSIS.