This article explores Row Sampling Transformations in SSIS and Percentage Sampling Transformations in SSIS packages
Sometimes we have a large number of rows in database tables. We want to perform some calculations on data, but we do not require the complete data from the table. You can understand it like a blood test in the medical laboratory. It requires you to give a sample of your blood so that they can perform lab test on it and give you results.
Suppose we have customer’s sales record and it contains millions of records. You want to perform aggregations or calculations on those data. If you take complete data, it will consume many system resources and package execution will also take time. We want a mechanism that should fetch 20% records from the table, and it can be used for further work.
Another example you can consider is the lottery system. Suppose you want to select 20 customers from the entire data set randomly.
We can use data sampling transformation in SSIS to fulfil our purpose.
SSIS package provides several useful transformations. In the previous articles such as Term Extraction Transformation in SSIS, An Overview of the LOOKUP TRANSFORMATION in SSIS, we learned such transformations.
SSIS package provides data sampling transformations in SSIS that can solve the problem described earlier. We can see two kinds of data sampling transformations:
- Row Sampling Transformation
- Percentage Sampling Transformation in SSIS package
- SQL Server data tools
- AdventureWorks sample database
- SQL Server integration service
- You should have basic knowledge of SSIS. If you are a beginner, please follow the articles in the SSIS category.
Let’s create an SSIS package for data sampling in the SSIS package.
The following table in AdventureWorks database contains 19820 rows.
In the SSIS Control flow window, add a data flow task and rename it to Data Sampling Transformation in SSIS.
Right-click on this data flow task and Edit. It takes you to data flow page. In this page, you can see that we are in this particular data flow task.
Add an OLE DB source and rename the task as appropriate. This task should point to SQL instance and Sales.Customers table in AdventureWorks2014 database.
Right-click on a blank area in the data flow task and click on Add Annotation. Annotation is similar to a text box that does not execute, and we use to print messages that help to understand the SSIS package.
I specified 19820 rows in this annotation box.
We have prepared the base of the SSIS package in this step. Let’s move forward with data sampling transformations in SSIS.
Row Sampling Transformation in SSIS
We use Row sampling transformation to retrieve a specified random number of data rows from the source data table. It gives random data every time we execute the SSIS package. You get two outputs from this transformation.
- Random data based on a specified number of rows
- Rest of the data that is not available in step 1
Add a Row Sampling transformation from the SSIS toolbox and drag the blue arrow from source to transformation, as shown below.
Double click on Row Sampling, and it opens the row sampling transformation editor.
We have the following options in this editor window.
- The number of rows: We can specify the number of random rows we want from the transformation. The default value is 10. Let’s modify it to 1000 rows
- Sample Output name: It shows the name of the output that we get from the transformation as specified by the number of rows parameter. The default name is Sampling Selected Output. Let’s modify it to Row Sampling Match output
- Unselected output name: It is the name of the output that contains data excluded from row transformation. We get the total number of rows in the table – the number of rows specified in this output. Let’s modify the name to Excluded data
Let’s skip the option ‘User the following random seed’ as of now. We will cover it in the latter part of the article.
Click on Columns, and it shows all available source columns.
Now, add two SSIS Multicast transformations and rename them as follows.
- Multicast – Matched
- Multicast – Unmatched
Join the output from Row Sampling to Multicast – Matched, and it opens the input-output selection window. In the output column, select the output – Row Sampling Match output.
Similarly, take the second output from Row Sampling and join to Multicast – unmatched transformation. It will automatically take another available output, as shown below.
We added SSIS Multicast operator here to display the data. If you want to insert data into SQL table, configure OLE DB destination as well.
Right-click on the arrow between Row Sampling and Multicast- Matched and enable data viewer.
It shows the following symbol on the arrow.
Press F5 to execute the SSIS package. It opens the data viewer, and you can see the 1000 rows in the output.
Close this data viewer and package execution complete. In the output, we can see that
- Multicast – Matched gets 1000 rows
- Multicast – Unmatched gets 18,820 rows
Let’s do the following configurations in the data sampling
- Number of rows: 10
- Use single-column AccountNumber
Execute the SSIS package twice and note the output.
You can compare the output in both the executions. In both the executions, it gives random account numbers and it different in both execution. It might also pick certain account number again in second execution in random pick up.
Suppose we want to get similar records on each execution. It should give us the output as per specified record count, but records should not be random.
In the configured SSIS package, open the properties again for row sampling transformation and set the random seed value to 1. It is recommended only for testing purpose.
Execute the package again twice and observe the output.
You get same data in both the executions. It picks the random data at once and does not change data in the next execution.
Percentage sampling transformation in SSIS
In the previous section, we discussed the Row Sampling Transformation in SSIS. Percentage sampling configuration is similar to row sampling.
In row sampling, we specify the number of rows that we want in the output, such as 500 rows or 1000 rows. I Percentage sampling, we specify the percentage of rows. For example, if total incoming rows are 1000 and we specify 10% percentage sample, we get approximately 100 rows in the matched output. Remaining rows get to unmatched row output.
Similar to row sampling transformation, it picks random sampling data, and you might get a completely different result set on each execution. We can specify random seed value to get similar data on each execution.
Let’s do the configuration for Percentage sampling transformation in SSIS package.
- Remove the row sampling and underlying multicast operators
- Drag a percentage sampling from the SSIS toolbox and join arrow between source data and percentage sampling
- In percentage sampling, specify the percentage of rows, output column names
- Add two SSIS Multicast transformations
- First Multicast transformation gets the desired percentage sample of rows
- Other Multicast transformation gets unmatched rows
Execute the SSIS package. In the following screenshot, we can see that percentage sampling transformation in SSIS does the following tasks
- Total number of rows in source table – 19820
- Specify percentage sampling – 5%
- First, Multicast gets 1015 rows that are approximately 5% of the sample data
Let’s specify the random seed value 1 in the percentage sampling transformation and execute the SSIS package.
In both executions, it gets similar account number in the output.
In this article, we explored data sampling technique – Row Sampling transformation and Percentage Sampling transformation in SSIS package. You can use these transformations to test package against a different set of data and analyze results.