Hadi Fadlallah
data flow task screenshot after execution

SSIS Balanced Data Distributor Overview

November 22, 2019 by

In this article, we will give a brief overview of SSIS Balanced Data Distributor (BDD). In addition, we will give some examples and illustrate alternatives.

Introduction

SSIS Balanced Data Distributor is a transform tool added in SQL Server 2008 that takes a single input and distributes the incoming rows to one or more outputs uniformly via multithreading.

Before SQL Server 2016, this tool was not part of the SSIS toolbox; rather it was provided in a standalone installer hosted on the Microsoft website. As of SQL Server 2016, the tool is installed with SSIS.

When using BDD, it is recommended that all outputs be identical because if one output contains a transformation, it will affect the performance of all outputs. Having the same transformation components on each output will improve the performance.

Before deciding to use BDD, there are several things you should check:

  1. The destination supports parallelism: if you are importing data to one destination table, BDD may be useless. However, if you are importing data to a staging database, it may improve the data import performance
  2. Machine resources: The SSIS package must be executed on a multi-processor and multi-core server
  3. A large amount of data: This tool is useless when handling small amounts of data
  4. Ordering is not required: SSIS Balanced Data Distributor is used when data ordering is ignored since data is distributed without any consideration of row order

SSIS Balanced Data Distributor

Balanced Data Distributor is a very simple component that doesn’t need configuration. You can simply add it into your data flow task:

the description of SSIS balanced data distributor from the toolbox

Figure 1 – Description of SSIS Balanced Data Distributor from the toolbox

BDD does not have an editor form, so when a double-click is registered, nothing happens. On the other hand, there are some properties that can be edited from the properties tab:

the properties of SSIS balanced data distributor

Figure 2 – Properties of Balanced Data Distributor

From the image above, there are 4 properties that can be edited:

  • Description: A description of the component
  • Name: The name of the component
  • LocaleID: The locale the component uses to interpret locale-sensitive data, such as date and time data
  • ValidateExternalMetadata: Whether the component is validated using external data sources at design time

These properties are not related to the BDD, but they are found in any data flow component. In general, you don’t need to change them.

To add output from the BDD, drag the output arrow (connector) and link it to another component:

SSIS balanced data distributor component with the output connector

Figure 3 – Balanced Data Distributor component with connector

Example

In this section, I will provide an example of using SSIS Balanced Data Distributor to import data from a relational database (SQL Server AdventureWorks2017 database) into 5 flat files.

First, we have to add an OLE DB connection manager to connect to the SQL Server instance:

ole db connection manager used to connect to the source table

Figure 4 – OLE DB Connection Manager

Within the control flow, we add a Data Flow task. Within this data flow, we add an OLE DB Source and a balanced data distributor. In the OLE DB Source, we selected Table or View name as data access mode, then we selected [Person].[Person] table from the drop-down list (this table contains 19,972 rows):

ole db source configuration

Figure 5 – OLE DB Source Editor

In this example, uncheck AdditionalContactInfo and Demographics columns since they are binary columns. After configuring the OLE DB Source, we link it to the BDD.

Now, we add 5 Flat File destinations, and link these destinations to the BDD, as shown in the image below:

data flow task screenshot

Figure 6 – Data Flow task screenshot

Next, we open each Flat File Destination and create a Flat File connection manager for each one. As a result, we will have one OLE DB Connection manager for the source and 5 Flat File connection managers for the destination:

connection managers tab showing one ole db connection manager and 5 flat file connection managers

Figure 7 – Connection managers tab

After executing the package, note that rows are distributed into only 3 outputs while it should be 5 as shown in the image below:

data flow task screenshot after execution

Figure 8 – Data Flow task screenshot after execution

After checking the number of rows distributed in each output, we see that the first two outputs contain 9,947 rows while the third one contains only 3 rows and the other outputs contain no rows.

This behavior is due to the data flow task DefaultBufferMaxRows and DefaultBufferSize properties:

data flow task properties

Figure 9 – Data Flow task properties tab

Since the number of rows (9,947) is less than the DefaultBufferMaxRows property (10,000), this means that the size of the row has reached the DefaultBufferSize value. To change this behavior, we will change the DefaultMaxRows to 3,995 and re-execute this package.

The new behavior of the Balanced Data Distributor is shown in the image below:

data flow task screenshot after changing the DefaultBufferMaxRows  property

Figure 10 – Data Flow task screenshot after changing the DefaultBufferMaxRows property value

From this example, we conclude that to obtain an optimal situation, you must configure the data flow task-related properties such as DefaultBufferMaxRows, DefaultBufferSize, and EngineThreads based on the number of rows, row size, number of outputs and other factors.

Alternatives

If you are using a previous version of SQL Server 2016, and you need a solution to distribute rows into multiple outputs without using Balanced Data Distributor, you can use an alternative. In this section, I will provide two alternatives to SSIS Balanced Data Distributor. Note that all alternatives require that the data contains a row number column or it must be generated using a script component. Refer to the following article for more information: How to add a row number to a data flow task in SSIS 2012

Using conditional split

To split a single input into multiple outputs, you can add a conditional split, then add an expression for each output where you use a modulo operator to split rows using the following expression pattern:

[IdentityColumn] % <number of outputs needed> == <output number (zero based)>

As an example, if you need to split one input into 5 outputs then you must add the following expressions within the conditional split:

  • [IdentityColumn] % 5 == 0
  • [IdentityColumn] % 5 == 1
  • [IdentityColumn] % 5 == 2
  • [IdentityColumn] % 5 == 3
  • [IdentityColumn] % 5 == 4

See the image below:

conditional split used to distributed on input into different outputs

Figure 11 – Conditional split with 5 outputs

Or, you can add only 4 expressions and use the default output as the last output.

Using script component

Similar to the previous approach, you can use a script component to split one input into multiple outputs. There are two approaches that you can use:

  1. Add a multiple output buffer within the script component and use a script to distribute rows over outputs. More details can be found at: Conditional Multicast Script
  2. Add a new column in the output buffer and use a script to store the column index within this column. Then, outside the script component, use a conditional split to distribute the script component output into multiple outputs

Balanced Data Distributor installation error

We’ve often received questions about SSIS Balanced Data Distributor installation error since the standalone package throws the following error:

Problem with this Windows installer package. A program run as part of the installation did not finish as expected. Contact your support personnel or package vendor

In general, the package installation throws this exception in two cases:

  1. The installation requires administrative privileges (can be solved by running the package as an administrator)
  2. The assemblies cannot be registered in the global cache assembly and need to be registered manually. In this case, refer to this answer on Stack Overflow since it contains a step-by-step solution: SSIS Balanced Data Distributor Installation Error

Conclusion

SSIS Balanced Data Distributor is an efficient solution to implement multi-threading within SSIS packages introduced with SQL Server 2012. In order to maximize the BDD performance, we have to well configure DefaultBufferMaxRows and DefaultBufferSize properties in the Data Flow Task based on the machine resources, the data volume and number of BDD outputs.

When using the previous version of SQL Server 2012, there are many alternatives to implement multi-threading, such as using a conditional split or a script component transformation.

Hadi Fadlallah
ETL, Integration Services (SSIS)

About Hadi Fadlallah

Hadi is an SQL Server professional with more than 10 years of experience. His main expertise is in data integration. He's one of the top ETL and SQL Server Integration Services contributors at Stackoverflow.com . Also, he published several article series about Biml, SSIS features, Search engines, Hadoop, and many other technologies. Besides working with SQL Server, he worked with different data technologies such as NoSQL databases, Hadoop, Apache Spark. He is a MongoDB, Neo4j, and ArangoDB certified professional. On the academic level, Hadi holds two master's degrees in computer science and business computing. Currently, he is a Ph.D. candidate in data science focusing on Big Data quality assessment techniques. Hadi really enjoys learning new things everyday and sharing his knowledge. You can reach him on his personal website. View all posts by Hadi Fadlallah

168 Views