In this article, we will give a brief overview of SSIS Balanced Data Distributor (BDD). In addition, we will give some examples and illustrate alternatives.
SSIS Balanced Data Distributor is a transform tool added in SQL Server 2008 that takes a single input and distributes the incoming rows to one or more outputs uniformly via multithreading.
Before SQL Server 2016, this tool was not part of the SSIS toolbox; rather it was provided in a standalone installer hosted on the Microsoft website. As of SQL Server 2016, the tool is installed with SSIS.
When using BDD, it is recommended that all outputs be identical because if one output contains a transformation, it will affect the performance of all outputs. Having the same transformation components on each output will improve the performance.
Before deciding to use BDD, there are several things you should check:
- The destination supports parallelism: if you are importing data to one destination table, BDD may be useless. However, if you are importing data to a staging database, it may improve the data import performance
- Machine resources: The SSIS package must be executed on a multi-processor and multi-core server
- A large amount of data: This tool is useless when handling small amounts of data
- Ordering is not required: SSIS Balanced Data Distributor is used when data ordering is ignored since data is distributed without any consideration of row order
SSIS Balanced Data Distributor
Balanced Data Distributor is a very simple component that doesn’t need configuration. You can simply add it into your data flow task:
Figure 1 – Description of SSIS Balanced Data Distributor from the toolbox
BDD does not have an editor form, so when a double-click is registered, nothing happens. On the other hand, there are some properties that can be edited from the properties tab:
Figure 2 – Properties of Balanced Data Distributor
From the image above, there are 4 properties that can be edited:
- Description: A description of the component
- Name: The name of the component
- LocaleID: The locale the component uses to interpret locale-sensitive data, such as date and time data
- ValidateExternalMetadata: Whether the component is validated using external data sources at design time
These properties are not related to the BDD, but they are found in any data flow component. In general, you don’t need to change them.
To add output from the BDD, drag the output arrow (connector) and link it to another component:
Figure 3 – Balanced Data Distributor component with connector
In this section, I will provide an example of using SSIS Balanced Data Distributor to import data from a relational database (SQL Server AdventureWorks2017 database) into 5 flat files.
First, we have to add an OLE DB connection manager to connect to the SQL Server instance:
Figure 4 – OLE DB Connection Manager
Within the control flow, we add a Data Flow task. Within this data flow, we add an OLE DB Source and a balanced data distributor. In the OLE DB Source, we selected Table or View name as data access mode, then we selected [Person].[Person] table from the drop-down list (this table contains 19,972 rows):
Figure 5 – OLE DB Source Editor
In this example, uncheck AdditionalContactInfo and Demographics columns since they are binary columns. After configuring the OLE DB Source, we link it to the BDD.
Now, we add 5 Flat File destinations, and link these destinations to the BDD, as shown in the image below:
Figure 6 – Data Flow task screenshot
Next, we open each Flat File Destination and create a Flat File connection manager for each one. As a result, we will have one OLE DB Connection manager for the source and 5 Flat File connection managers for the destination:
Figure 7 – Connection managers tab
After executing the package, note that rows are distributed into only 3 outputs while it should be 5 as shown in the image below:
Figure 8 – Data Flow task screenshot after execution
After checking the number of rows distributed in each output, we see that the first two outputs contain 9,947 rows while the third one contains only 3 rows and the other outputs contain no rows.
This behavior is due to the data flow task DefaultBufferMaxRows and DefaultBufferSize properties:
Figure 9 – Data Flow task properties tab
Since the number of rows (9,947) is less than the DefaultBufferMaxRows property (10,000), this means that the size of the row has reached the DefaultBufferSize value. To change this behavior, we will change the DefaultMaxRows to 3,995 and re-execute this package.
The new behavior of the Balanced Data Distributor is shown in the image below:
Figure 10 – Data Flow task screenshot after changing the DefaultBufferMaxRows property value
From this example, we conclude that to obtain an optimal situation, you must configure the data flow task-related properties such as DefaultBufferMaxRows, DefaultBufferSize, and EngineThreads based on the number of rows, row size, number of outputs and other factors.
If you are using a previous version of SQL Server 2016, and you need a solution to distribute rows into multiple outputs without using Balanced Data Distributor, you can use an alternative. In this section, I will provide two alternatives to SSIS Balanced Data Distributor. Note that all alternatives require that the data contains a row number column or it must be generated using a script component. Refer to the following article for more information: How to add a row number to a data flow task in SSIS 2012
Using conditional split
To split a single input into multiple outputs, you can add a conditional split, then add an expression for each output where you use a modulo operator to split rows using the following expression pattern:
[IdentityColumn] % <number of outputs needed> == <output number (zero based)>
As an example, if you need to split one input into 5 outputs then you must add the following expressions within the conditional split:
- [IdentityColumn] % 5 == 0
- [IdentityColumn] % 5 == 1
- [IdentityColumn] % 5 == 2
- [IdentityColumn] % 5 == 3
- [IdentityColumn] % 5 == 4
See the image below:
Figure 11 – Conditional split with 5 outputs
Or, you can add only 4 expressions and use the default output as the last output.
Using script component
Similar to the previous approach, you can use a script component to split one input into multiple outputs. There are two approaches that you can use:
- Add a multiple output buffer within the script component and use a script to distribute rows over outputs. More details can be found at: Conditional Multicast Script
- Add a new column in the output buffer and use a script to store the column index within this column. Then, outside the script component, use a conditional split to distribute the script component output into multiple outputs
Balanced Data Distributor installation error
We’ve often received questions about SSIS Balanced Data Distributor installation error since the standalone package throws the following error:
Problem with this Windows installer package. A program run as part of the installation did not finish as expected. Contact your support personnel or package vendor
In general, the package installation throws this exception in two cases:
- The installation requires administrative privileges (can be solved by running the package as an administrator)
- The assemblies cannot be registered in the global cache assembly and need to be registered manually. In this case, refer to this answer on Stack Overflow since it contains a step-by-step solution: SSIS Balanced Data Distributor Installation Error
SSIS Balanced Data Distributor is an efficient solution to implement multi-threading within SSIS packages introduced with SQL Server 2012. In order to maximize the BDD performance, we have to well configure DefaultBufferMaxRows and DefaultBufferSize properties in the Data Flow Task based on the machine resources, the data volume and number of BDD outputs.
When using the previous version of SQL Server 2012, there are many alternatives to implement multi-threading, such as using a conditional split or a script component transformation.