Bhavesh Patel

Hash partitions in SQL Server

June 28, 2018 by

In SQL Server, when talking about table partitions, basically, SQL Server doesn’t directly support hash partitions. It has an own logically built function using persisted computed columns for distributing data across horizontal partitions called a Hash partition.

For managing data in tables in terms storage, performance or maintenance, there is a need to keep data allocation units in large chunks. For the purpose of speeding up loading and archiving of data, tables may need to be divided into multiple units that can reside in multiple filegroups. When referring to partitions, there are two types; Vertical and Horizontal partitions.

Note: This concept was explained nicely by Milica Medic in the article Database table partitioning in SQL Server.

First, I will describe partitions at a high level.

Vertical partitions

Generally, Vertical partitioning means a semi data normalization process based on data use vs meaning within the database design. It involves removing redundant columns from the table and organizing a second table connected via a relationship. Suppose, we have a “heavy” table in terms of size; large columns may exist in the tables for an example, nvarchar, xml etc. Data might be rarely accessed in those large columns. When performing an operation in heavy tables, issue may arise with related IO cost, locking, latch contention etc. Vertical portioning is meant to address this by offloading lesser used columns and their resulting data to another table.

Vertical partitioning should be considered carefully, though, because it might affect performance if the partition is very large because there may be an increased number of joins in queries used for getting the records, when all of the columns (vs just a sub-set) are required.

Horizontal partitions

In this approach, the tables are divided into multiple tables, but with each table having the same columns. The partitioning, in this case, is that the rows are divided, amongst these tables, as per applied rules.

Hash partitions in SQL Server

In simple terms, a Hash partition is a Horizontal partition. It’s needed to logically build a hashing function and with it, ensure data is randomly distributed across the partition. This means that the partition key functions as an identifier and any new row are allocated to a particular partition based on the result of passing the partition key into a hashing algorithm. Due to this, data will be inserted randomly across the partition. Compared with a Horizontal Partition the only difference is that, here there is a need to add persisted computed columns into the index keys to deal with the partitioning scheme.

Why we need Hash partitions in SQL Server

In my previous article, I have discussed about Hot latches. You can refer this for more clarification about “The last page insert contention issue”.

When concurrency insertion happens then query queues are generated on the last page. This means, in attempting to perform this insertion request latch contention issue is generated. For the solution of this latch contention, there are multiple options available. Out of those, a Hash Partition is one of the best technique for reducing latch contention.

Building Hash partition with the help of computed column

As a part of the horizontal partition, let’s discuss in steps, how to organize and distribute data using a hashing function. Currently, for the demonstration purposes, I am going to create a new database with steps.

Step1: Create a new DB

We need an additional filegroup and file for holding partitions and to ensure that the filegroup resides in an optimal drive location in term of IO cost. We can add filegroup at the time of database creation.

Now we have the filegroup DistData with the new file DistDatafile added into the database Hashpartition.

Step2: Creating a new partition function

Partition function need to create in the current database. it maps the rows of a table or index into partitions based on the values of a specified column.

Syntax

A boundary value needs to be assigned. it specifies the boundary value of each partition of a partition table or index that uses partition_function_name. If the boundary_value is empty the partition function maps the whole table or index using partition_function_name into a single partition. There is only one partitioning column used which is specified in a CREATE_TABLE OR CREATE_INDEX statement. It needs to specify which side of each boundary value interval, left or right, the boundary_value [ ,..n ] belongs, when interval values are sorted by the Database Engine in ascending order from left to right. If not specified then LEFT is the default.

For an example, Here I will create table “customer_invoice”. Using this function, I will distribute the value in 8 boundaries.

For introducing the function, execute as follows.

Step3: introduce a new partition scheme and integrate to object in steps.

A Partition Scheme need to be created in the current database. It maps the partitions of a partitioned table or index to filegroups. The number and domain of the partitions of a partitioned table or index are determined in a partition function. It means, first need to create partition function Before creating a partition scheme.

Syntax

We have already created partition function “Hashing”. As per mentioned above syntax, I will go for creating a new partition scheme with using partition function “Hashing”. Partitions will be created by the partition function are mapped to the filegroups specified in the partition scheme. A single partition cannot contain both FILESTREAM and non-FILESTREAM filegroups.

For introducing the partition scheme, execute as follows.

Scheme “ps_hashing” is introduced. Currently, its bounded with function Hashing and storage is residing with this filegroup “DistData”. Now I am going to integrate table object “customer_invoice” which is available for me in my environment, but here for demonstration purpose, I will mention object script here, execute as follows.

Still, all steps are performed with aligning horizontal partition steps. As a part of scatter, the step is distributing data with generating dynamic key value; “Hash value”. For the purpose of dynamic distribution, Here I am going to add persisted computed column into this table “customer_invoice”. Execute as follows.

After adding this step, a Computed column is added. We need to add index part latter on. As for its functionally, it’s getting the checksum value of the column invoiceDetail and finalize the value with modulo operation. We need to make sure for calculating the hash value as per of data flow. Here For demonstration purposes, I introduced an example for distributing key part of the data.

In addition, I need to add a unique index with the leading column key hashvalue in this table; “customer_invoice”. I will go to add with the combination of the unique value column id with aligning partition scheme “ps_Hashing”. Execute as follows.

After adding this index, now Data distribution is organized randomly on this table. Now the data insert is going to the end of the logical range, but the hash value module’s operation is divided this across the B-tree structure though we can solve out last page insertion contention/Hot latch issue.

Step4: Demonstration of distributing the data flow.

Now, I have a database ready for execution, I will add sample chunk records with the help of the script, execute as follows.

Data is inserted across the B-Tree structure randomly, as a part of viewing range wise distribution, I have prepared script. Execute as follows.

Kindly review above mentioned result, data is distributed randomly as we insert before. Here we have created own filegroup for demonstration purpose, we can organize data in multiple file group wise as per business use case. Kindly insuring the index should be the light weight in terms of key-length because of, due to heavy concurrency, it might be the case Page-split found. I suggest, need to study about hash value and observe your business use-case, then initiate appropriate action into it.

Trade-offs when using hash partitioning

  • Due to random insertions, within a concurrent environment, it might be the case that heavy page splitting occurs.
  • SELECT query performance might be affected due to hash partition bad query plan estimation.
  • It is more difficult to maintain referential integrity while increasing key combinations of unique CLUSTERED index.
  • Due to large size of the index, it is difficult to maintain maintenance as well.

Benefits of partitioning

  • You can transfer or get subsets of data in a quick and efficient manner, while maintaining the integrity of data collection. For OLTP workloads you can get the data in quickly if data is available in a partitioned manner.
  • The Data buffer is made smaller while data is screened with an appropriate partition, due to the fact that will often reduce the IO operational and data sorting cost.
  • You may improve query performance, based on the types of queries for an example, the query optimizer can process equi-join queries between two or more partitioned tables faster when the partitioning columns in the tables are the same, because the partitions themselves can be joined.
  • In addition, you can improve performance by enabling lock escalation at the partition level in lieu of a whole table. This can reduce lock contention at the table level.
  • We can solve the concurrency insertion Hot-latches (PAGELATCH-EX) issue with Hash Partitions.

References



Bhavesh Patel
General database design, Partitioning

About Bhavesh Patel

Bhavesh Patel is a SQL Database professional with 9+ years of experience. During his career, he strongly workes on his DBA development and has been working with SQL Server 2000, 2005, 2008, 2012, 2014 and 2016. Currently, his main role is query tuning with respect to optimization and server performance. He is interested in constant learning and likes to face challenges. In his spare time, he spends time with his family.

168 Views