In this article, I am going to explain in detail about SSIS memory usage and how can we optimize out tasks and the data flow to leverage maximum benefits from the in-memory operating tool. As you might be aware, SSIS also known as SQL Server Integration Services is a data integration tool, provided by Microsoft which comes shipped with the SQL Server editions. SSIS is an enterprise-scale, in-memory data integration tool which can be used to move data between different databases or different servers in a comfortable yet manageable way.
How much memory is needed for SSIS?
In ETL design, the often most important question that I face is how much memory we should put in for our production environment. While we work on configuring the environments or creating a new instance of SSIS, we must understand the jobs that are going to be set up, how many times are these jobs going to be executed etc. These parameters will help a BI Engineer to calculate and decide a correct figure for setting up the required memory requirements for the instance.
Since SSIS is an enterprise tool, it can consume as much RAM as you can allocate to it. However, allocating more RAM also increases the budget on the infrastructure of your projects which might not be an ideal case always. So, you should consider the package design, whether it is ETL or ELT, if there is any parallelism involved within the packages, if there are any expensive operations like sort or merge within the packages and so on. All these factors contribute to the SSIS memory usage and we need to understand these in detail.
Memory consumption by Data Flow Tasks (DFT)
Data Flow Tasks or DFTs as most of us call it, are one of the most commonly used tasks and we need to understand its SSIS memory usage. Using a Data Flow Task, we can quickly load data from a flat file to a table in SQL Server or vice versa. SSIS packages, if executed from the SSIS catalog, are executed on a separate process known as ISServerExec.exe. This process allocates most of the available memory from the system for each of the package that is being executed and release this memory shortly after the execution has been completed.
For example, if a package uses a Data Flow Task which loads 10M records from a flat file to a SQL Server table, then all these 10M records need to pass through the memory space that has been allocated to the SSIS server during execution.
Figure 1 – SSIS Memory Usage by Data Flow Tasks
However, loading all the records into SSIS memory while execution does not mean that all the 1M records will be loaded into the memory at once. If this was the case, then we would require a lot more memory to support 1M records at a single time. Instead, what SSIS does is, it creates much smaller chunks of data from the 1M records and then loads these chunks one after another. These chucks of data, also known as a buffer in SSIS are used so that we can optimize the memory and load a smaller amount of data and process it to the database before loading the next buffer.
Figure 2 – SSIS Buffer Example
Memory Consumption by Lookup Transformations
As explained in the previous example about the buffers in SSIS which releases data as soon as they are loaded into the database, there are also scenarios in SSIS where data resides in the memory for the entire duration of the package execution. In a lookup transformation, the data is usually cached in the memory during the package execution. This enables SSIS to execute the package much faster because the lookup data is directly available in the memory and the SSIS engine does not need to make queries to the source data to fetch the lookup values.
Figure 3 – SSIS Memory Usage by Lookup Transformation
We can alter this behavior of the lookup transformation by modifying the cache mode of the transformation while designing the package. There are three modes for defining the cache in a lookup transformation – full cache, partial cache, or no cache. By default, the full cache mode is active, which loads all the data into memory during the execution phase. Setting this to partial or no cache may reduce the load on the memory, however, it might increase the number of queries generated to fetch the lookup values from the data source every time.
Fully blocking SSIS Memory Usage
There are a few operations in SSIS like the Sort transformation, aggregate, the fuzzy lookup, etc., which loads the entire set of records into memory before performing the operation. These are exceptions to the buffering technique as I have explained in the first example. This is because the operation is performed on a set of data rather than the individual records. For example, in a sort operation, the SSIS engine needs to know what are all the values of a possible column on which the records are to be sorted before moving forward with the operation. For this reason, all the records are loaded into memory at the same time and the transformation is applied in memory. Once the transformation is completed, the data is then written to a database and the memory is released.
Figure 4 – Fully blocked memory consumption by Sort operator
Although the above-explained sort operation is good for small-sized flat files, it is advised not to apply the same principle to a large data source. For large data sources, which require a hefty amount of memory to process such information, a good idea would be to use an ELT architecture instead of an ETL, where the data will be first loaded into a staging area and from there it can be finally loaded into the database. In such an operation, the sort transformation will be handled by the database engine and not by SSIS. This will also help to reduce some load on the memory as the processing will be shifted to the database.
Memory allocation in SQL Server and SSIS
Most of us, developers, and database administrators (DBAs) might be already aware of the maximum and minimum memory options that can be configured for a SQL Server instance. Please note, that this memory which is configured for SQL Server has got nothing to do with the SSIS memory. SSIS packages are executed on the different processes from the database engine and the memory is allocated from the available system memory during runtime. For example, let us consider an environment which has 150GB of RAM, out of which 128GB is being allocated to the SQL Server instance. Out of the remaining 22GB, let us assume that 8GB will be consumed by the Windows process which leaves us to 14GB of available memory. When the SSIS process starts, it will allocate this 14GB of memory to the SSIS engine and do all the processing using this.
If you consider increasing the performance of your SSIS packages on such a machine without increasing the memory, you can deallocate some memory from the SQL Server instance, and then it will be available for the SSIS engine to use when required.
In this article, we have understood the concept of SSIS memory usages. I have explained about the various execution tasks that are available within SSIS and how these tasks consume memory. There is no hard and fast rule to define how much memory is optimal for SSIS. It always depends on the volume of data that is being moved across different systems.
Usually, higher memory helps to achieve faster transformation as much more data can be loaded into memory while processing and thus helps in completing the job faster. As you grow as a developer, with your experience, you will be able to determine the correct required memory for an SSIS instance and configure the environment accordingly.