Ed Pollack

The benefits, costs, and documentation of database constraints

April 7, 2016 by

Constraints exist as a way to enforce or document rules within the database. How do they affect performance, and what are the benefits of using them?

Introduction

There are a variety of constraints that can be implemented in a SQL Server database that can provide a variety of new functionality. Some are so familiar that we forget their presence, such as NULL constraints, while others may not be used as often, such as check constraints or default constraints.

This is an opportunity to dive into each type of constraint, and after introducing their purpose and usage, dive into their storage, performance, and notes about their optimal use, as well as pitfalls. Please note that all demos were created and tested in SQL Server 2016 RC1.

Purpose

Constraints may be used for a variety of purposes. The following is a short list of what I consider the most common reasons that constraints are created:

  • Prevent bad data from being entered into tables of interest.
  • Enforce business logic at the database-level.
  • Documentation of important database rules.
  • Enforce relational integrity between any number of tables.
  • Improve database performance.
  • Enforce uniqueness.

There are others, of course, but the reasons for using constraints are quite varied, and can be a helpful tool for database administrators and developers alike. Typically, it is recommended to use constraints over more complex, procedural logic-enforcement, such as triggers, rules, stored procedures, or jobs.

Let’s review one-at-a-time each type of constraint and some examples of their usage.

NOT NULL constraints

This is so common that we may forget that a column that does not allow NULLs is technically in possession of a NOT NULL constraint. Its function is straightforward: a column with this constraint will not allow a NULL to be entered, and will throw an error when anyone attempts to enter a NULL into a column that doesn’t allow NULLs.

NULL may be defined at the time a table is defined, such as like this:


The syntax is simple, and in this table, Null_Id and Not_Null_Data will not allow any NULLs to be entered into either. Null_Data, on the other hand, will allow NULL, in addition to standard string values. When creating a table, you can choose to not specify NULL or NOT NULL, and if you do, the column will default to being NULLable. Since NOT NULL is an explicit constraint, the default setting for any column is to allow NULLs.

Using this table, let’s insert a few rows into it:


Both of these inserts succeed without incident, adding two dinosaurs into our demo table.


This also works without a problem. Inserting a NULL is OK since we explicitly defined Null_Data to allow NULLs.


Our poor dino-flyer never makes it into the database as we tried to insert a NULL into the last column, Not_Null_Data, which we applied the NOT NULL constraint to. The result is the expected error:

Msg 515, Level 16, State 2, Line 27
Cannot insert the value NULL into column ‘Not_Null_Data’, table ‘AdventureWorks2014.dbo.NullDemo’; column does not allow nulls. INSERT fails.
The statement has been terminated.

NOT NULL is exceptionally useful when we have a table with columns that are critical to the definition of that data structure. For example, an account without a name would be potentially meaningless, whereas an employee with no start date would cause us to scratch our heads and ask why no one entered it in the first place.

NOT NULL corrects mistakes by the user as well as mistakes by a software or database developer. If a column is required, then we want to ensure that it is always entered, no matter what. The application should tell a user that they need to enter a required field, returning a friendly message if they forget. In the event that the application somehow allows a NULL where one does not belong, then an error would be returned that the developers could easily troubleshoot and fix. This scenario is preferable to allowing data to be created with critical missing elements.

There are potentially dangerous or inappropriate uses for NOT NULL. While we tend to prefer not having NULLs to deal with in our code, forcing arbitrary values into our data can be dangerous. A NOT NULL string that defaults to a blank is relatively harmless, but what about a NOT NULL date or a NOT NULL integer? How many times have we run into a date column that was peppered with “1/1/1900” within its data? Or the integer that uses “-1” to fill in the blanks? This data can become confusing as the columns appear to have valid values, when in fact they contain dummy data. Special application or database code needs to check for these values and act appropriately when they exist. In addition, what happen if a user deliberately enters a value that matches a dummy value? Will the application disallow this? Will the database consider their entry irrelevant and the same as taking some default instead? What if your great grandfather was actually born on 1/1/1900? As of the writing of this article, there are people older than 116 years old and have birthdays that fall around this oft-used dummy date. Improbable — sure, but if I was 116 years old and got an error while signing up for Snapchat, I’d be pretty pissed off 🙂

We could avoid this problem in a variety of ways:

  • Allow NULL in those columns
  • Create a BIT column that signifies the special case of when data doesn’t exist, such as if a foreign national doesn’t have a US driver’s license, or if a newborn doesn’t have a social security number (yet).
  • Create a separate table to store this data when it exists, thereby eliminating the need for odd/NULL data in the interim. This is a potentially good solution when a particular column is very sparse, ie: it is very often not entered or rarely required or present.
  • Concatenate the oft-missing data with another column. For example, address lines 1, 2, and 3 could be combined in order to avoid the need for the additional address lines that may or may not have valid data in them.

A column can be changed to NULL or NOT NULL at any time in the future using an ALTER TABLE ALTER COLUMN statement. If we attempt to change a column to NOT NULL that contains any NULLs, then that TSQL will fail:

We previously entered a single row into this table with a NULL in this column. When we try to change it to NOT NULL, the error we receive is similar to what we saw earlier:

Msg 515, Level 16, State 2, Line 32
Cannot insert the value NULL into column ‘Null_Data’, table ‘AdventureWorks2014.dbo.NullDemo’; column does not allow nulls. UPDATE fails.
The statement has been terminated.

The schema change of the column to NOT NULL is seen as an update to all values within, which fails due to a single row containing a NULL.

NULL is interesting because it is not stored in the same manner as the rest of your data. Since NULL is not a value, but an indicator of non-existence, it is stored in a separate part of each row in a structure called the NULL bitmap. This structure allows SQL Server to determine the NULLability of a column without the need to read the data itself in order to figure it out. The NULL bitmap exists for all columns in a table, NULLable or not, and therefore a table full of NOT NULL columns will not use less disk space via the absence of the NULL bitmap. The exception to this are sparse columns—if a table is defined entirely by sparse columns, then a NULL bitmap will not be present. We’ll call that an edge case, but there is no harm in being thorough here.

The performance benefits of NOT NULL are subtle, but do exist. The most straight-forward benefit is if you query a NOT NULL column for NULL:


It turns out that ReorderPoint is a NOT NULL column, and since SQL Server knows this via the table metadata, it can bypass the need to read any actual data at all and return an empty result set very efficiently. When run, this query returns no results (no surprise there), but can do so with zero logical reads and practically no query cost:

The constant scan operator in the execution plan indicates that no tables were accessed, and scanning internal metadata about Production.Product was all that was needed to complete the query. The query cost is tiny, as is usage of CPU and memory when performing what is a very simple task.

Otherwise, reads on NULL or NOT NULL columns will be similar. If we were to create two versions of a column, one that allows NULLs and another that does not and query against each one, the results would be virtually the same. Some experiments I ran with a variety of queries against NULL and NOT NULL columns produced a very small amount of variation, but I was able to ultimately chalk it up to index fragmentation.

Unique constraints

A unique constraint allows us to take any column or set of columns and enforce uniqueness on their contents. This allows us to apply business or common-sense rules to our data, for example:

  • Ensure that no two people have the same social security number.
  • Prevent any accounts from sharing a name.
  • Verify that a set of data elements are unique, such as credit card type, credit card number, and expiration date.
  • Make sure that a relationship between two entities is never repeated.

Unique constraints are physically implemented as unique indexes, and provide the performance benefit of an index on the unique columns, while also enforcing uniqueness. There are a number of ways to define a unique index. If we were looking to add a unique index on Null_Id in our previous example, we could do so with the following syntaxes:


These all accomplish similar goals, but there are differences. The 1st and 2nd options will create unique constraints and do so as non-clustered indexes. The 3rd will do the same, but as a clustered index. Remember that you can only have a single clustered index per table, so if one already exists, creating another will throw an error. The 4th and 5th code snippets also creates unique indexes, but there is an interesting, subtle difference in how these constraints are represented in SQL Server. After executing the 1st code snippet, Management Studio shows the following objects within the NullDemo table:

Note that a non-clustered index was created, in addition to a unique key. Now, let’s drop this index using the standard syntax:


The result is an error:

Msg 3723, Level 16, State 5, Line 38
An explicit DROP INDEX is not allowed on index ‘dbo.NulLDemo.UX_NullDemo_Null_Id’. It is being used for UNIQUE KEY constraint enforcement.

Hmmm, my attempt to drop a unique index failed as it’s considered a constraint. Let’s try the drop constraint syntax:

This time, the constraint drops as expected:

Command(s) completed successfully.

Strange, but we’ll move on for now and come back to this in a bit. Let’s create the index using the 4th snippet and examine the contents of the table via SQL Server Management Studio:

After refreshing the tree above, we notice that there is nothing under the “Keys” section! We were being lead to believe that a unique index and unique constraint are the same, but clearly there are differences. Let’s drop the constraint as we did before:


Running this yields an error (again!?):

Msg 3728, Level 16, State 1, Line 41
‘UX_NullDemo_Null_Id’ is not a constraint.
Msg 3727, Level 16, State 0, Line 41
Could not drop constraint. See previous errors.

SQL Server can’t seem to make up its mind! Let’s use the alternate DROP INDEX syntax:


Once again, we have success:

Command(s) completed successfully.

Let’s check out entries in SQL Server’s catalog views for constraints and indexes to verify how these objects are perceived there:


For the first constraint creation, using the ALTER TABLE…ADD CONSTRAINT syntax, a row is returned from each view:

That’s as expected, now let’s run the exact same queries after we use the CREATE UNIQUE INDEX syntax:

Sys.key_constraints returns nothing, implying that SQL Server does delineate between a unique index and a unique constraint, despite the fact that their functionality appears identical.

The only difference (aside from the various syntax above) that I have arrived at between unique constraints and unique indexes is that unique constraints fit the ANSI standard for unique constraints, whereas a unique index technically does not. Performance between these alternatives is identical in all tests I could dream up, and the behavior when attempting to violate uniqueness is the same. To test this, let’s create a unique constraint and try to insert a duplicate value into our table:


The result is the expected error:

Msg 2627, Level 14, State 1, Line 80 Violation of UNIQUE KEY constraint ‘UX_NullDemo_Null_Id’. Cannot insert duplicate key in object ‘dbo.NullDemo’. The duplicate key value is (3). The statement has been terminated.

Let’s repeat the experiment with a unique index:


Again, we get an error, but the text is slightly different, further differentiating between a unique index and a unqiue constraint:

Msg 2601, Level 14, State 1, Line 91 Cannot insert duplicate key row in object ‘dbo.NullDemo’ with unique index ‘UX_NullDemo_Null_Id’. The duplicate key value is (3). The statement has been terminated.

To summarize everything above, a unique constraint is a constraint and an index, whereas a unique index is only an index. Despite the differences, the behavior of each is identical. Given the choice, I prefer to stick to standards and increased documentation, and would lean towards implementing unique constraints instead of unique indexes. Neither is incorrect and Microsoft makes no formal declaration for or against either, but the differing visibility of each is meaningful and inclusion in two sets of useful metadata can be handy when managing schema and metadata in the future.

As an aside, creating a unique clustered index serves similar functionality to a clustered primary key, whereas a unique non-clustered index is very similar to a non-clustered primary key. Keep this in mind as we move forward.

Unique constraints can also be created in-line, such as in this example:


In this similar table, we define a unique (non-clustered) constraint on Null_Id2 within the table definition. If you know that a unique constraint will be needed up-front when the table is defined, you can take advantage of this simplified syntax and include it, instead of defining the constraint later on in your script.

How do unique constraints affect performance? The most significant way is in how they affect cardinality. For a column with a unique constraint, we know that there can never be more than one of any given value. Statistics will often provide the necessary information in order to determine this, though, but in the event that statistics are unavailable or incomplete, this little bit of information provided by the unique index can assist in building a more efficient execution plan.

The work of a unique index occurs on an insert or update, when it is necessary to check and see if a value already exists prior to making the change. What does this cost? To investigate, we’ll look at the unique index AK_SalesOrderDetail_rowguid on Sales.SalesOrderDetail. Let’s run a simple update on the table:


After executed, we can review the STATISTICS IO output and execution plan in order to see the work that was performed:

Table ‘SalesOrderDetail’. Scan count 0, logical reads 22, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

We can see that the entire index had to be checked (via an index seek) in order to verify that the value I am assigning does not already exist. Of course, the index had to be read anyway in order to locate the row that needed updating. Let’s replace the unique index with a standard non-clustered index:

Now let’s run a similar update to above:


The results are similar to before:

Table ‘SalesOrderDetail’. Scan count 1, logical reads 25, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

It turns out that the effort required to enforce uniqueness is the same effort needed to locate the rows for update anyway. Let’s perform the same experiment with an INSERT, which will not require us to locate any rows prior to the operation:


With the unique constraint in place, the performance is as follows:

Table ‘SalesOrderDetail’. Scan count 0, logical reads 9, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Now we’ll replace the unique constraint with a non-unique non-clustered index and repeat the experiment:


The results end up being exactly the same as before. The presence of an index is all that is required in order to validate uniqueness, and since a unique constraint always creates an underlying index, this requirement is fulfilled automatically.

While we may have thought that the need to validate uniqueness would be expensive or a burden to our write operations, these experiments (as well as any more we can dream up) show that the performance difference of writes with a unique constraint versus a non-clustered index are insignificant. Of course, the addition of a unique constraint to a column that previously had no index at all would benefit from the addition of the index in the same way that any indexed column will speed up reads at the expense of the write operations needed to maintain it.

Primary Key constraints

A primary key is THE uniquely identifying metadata for a table. This is a critical part of database design and it is generally important to define a primary key on all tables within a database. Primary keys serve as a unique constraint, a unique index, and as the ANSI standard that defines uniqueness in a table. You may only define a single primary key per table, regardless of whether it is clustered or non-clustered. A primary key may be created in-line with table creation or via an ALTER TABLE statement.

Primary keys carry special significance in database design. Anyone that is looking at a table for the first time will be checking to see what the primary key is defined as for the table. Is it an integer ID? Is it an account name? Is it a combination of several important data elements? Does it auto-number as an IDENTITY, or are values inserted somehow via a programmatic or user-driven process?

A primary key by definition is a NOT NULL column. Attempts to insert NULL into a primary key column will result in an error. Similarly, adding a primary key on a column with duplicate values or NULLs will also generate an error.

In our previous example, we could have created the table with a primary key on Null_Id:


This table definition declares, up-front, that Null_Id is the logical unique identifier for any row within the NulLDemo table. Any other tables that reference NullDemo via a foreign key will need to reference this primary key. It is also likely that most joins to NullDemo will do so via Null_Id as it guarantees a predictable cardinality (one row per Null_Id).

A primary key can also be added to a table later on, assuming that the contents of the column fit the prerequisites of a primary key: Uniqueness and no NULLs:


Within SQL Server Management Studio, a primary key will be given a special notation, separate from all other indexes, keys, and constraints:

The yellow key signifies a primary key, and as discussed earlier, there can only be one per table. As with a unique index, a primary key will require an index to be maintained in order to enforce uniqueness. A primary key may be clustered or non-clustered, depending on whether this column is the most common or efficient organizing column for the table.

Typically, the clustered index on a table is also the primary key, though there is no requirement or enforcement of this convention. Special cases may arise in which it makes sense to define a primary key separately from a clustered index. A plausible scenario for this would be when the primary key on a table is an auto-numbering integer identity, but the clustered index is on a pair of critical identifiers. When designing a table, consider carefully whether it makes sense for the clustered index to be separate from the primary key, or if having them be one and the same is simpler and more efficient.

Note that in order for a primary key to be dropped, any foreign keys on other tables that reference it must also be dropped first. In addition, if the primary key is also the clustered index, then all other non-clustered index will need to be dropped before SQL Server will allow you to drop the clustered primary key. This is because non-clustered indexes reference the clustered index and you are not given the liberty of pulling the rug out from under a table’s logical storage.

The performance of primary keys is based on the performance of a handful of components:

  • The unique index created for use by the primary key.
  • Any foreign keys that can benefit from referencing the primary ley.
  • If the primary key is also the clustered index, then all benefits of a clustered index will also be realized.
  • The NOT NULL constraint that is required by a primary key.

A primary key in of itself does not provide any performance magic, but the building blocks that are created for and alongside a primary key each provide their own benefits. In the same way that all tables should have a primary key, tables should also have a carefully chosen clustered index. Reading and writing to tables without a clustered index will be expensive and incur a variety of penalties due to the inefficient nature of heaps.

When chosen wisely, a clustered primary key can provide the benefits of a variety of constraints: NOT NULL, uniqueness, clustered index, and foreign keys. If you’re omitting a primary key (clustered or otherwise) from a table and believe it to be the right choice, be sure to thoroughly test and verify that such a hypothesis is indeed true. I’ve heard many arguments over the years as to why certain types of tables benefit from the omission of a primary key or clustered index, but rarely have those debates been based on fact. If you can prove that either does not help performance, and never will in the future, then you may consider omitting them, though I suspect that to be a very rare scenario!

Default Constraints

A default constraint allows you to automatically populate a predetermined value for a column when none is provided upon insert. This is handy when a column is optional, but in the absence of a user-defined value, another is used instead. Some common uses of default constraints include:

  • An active bit for accounts that always defaults to 1 for new rows.
  • A notes column is populated with an empty string when no notes are entered.
  • The current date and time are entered into a last updated date/time column.
  • The user performing an action is logged whenever a row is inserted.
  • To differentiate between data that is user-provided and that which is system generated.

Default constraints allow you to maintain a NOT NULL constraint on a column, even when a user or application may not provide a value for it. Be careful not to use default constraints to eliminate NULLable columns when they make the most sense for your data model. For example, a default of “1/1/1900” for a date column may seem nonsensical enough to avoid confusion, but could result in application complexity in the future if any developers forget about this default, or if they attempt to perform date math on the column without checking for the default dummy value.

In order to demonstrate default constraints, we’ll create a new table to experiment on:


I’ve included plenty of default constraints here, using a variety of data types. Note that for these and all other constraints created in this article, I’ve consistently included constraint names. It is a good practice to ALWAYS name any objects that you create. If you create one without a name, SQL Server will automatically generate a name that will be neither elegant or helpful. For example, if I left the name off of the default on Created_By_User, it automatically gets named DF__Accounts__Create__6521F869. If I drop and recreate the table, it is named DF__Accounts__Create__6ADAD1BF. These names are not very meaningful, nor are they consistent, so always name everything, your life will be easier that way 🙂

Let’s say I insert a row into the table using this TSQL:


While I only provided the account name and notes, the remaining columns were populated for me using the defaults that we expect and would want. This is a great way to audit data as it is created and simplify INSERT statements.

Default constraints will automatically fill-in-the-blanks when you do not provide a value, but will not override an explicit NULL. If you insert a row and intentionally attempt to put a NULL into a NOT NULL column with a default, then you’ll get an error.


The TSQL above will throw the following error:

Msg 515, Level 16, State 2, Line 152 Cannot insert the value NULL into column ‘Is_Active’, table ‘AdventureWorks2014.dbo.Accounts’; column does not allow nulls. INSERT fails.

Default constraints do not override explicit values, even if they are NULL. Similarly, an attempt to insert a NULL into a column that allows NULLs will do so and ignore the default:


This TSQL executes successfully, resulting in the following data:

Note that the NULL in the INSERT statement overrides the default constraint on that column.

Any constant may be used in a default constraint so long as its value is deterministic as of the time it is assigned. The current date or time will default to the date/time at the moment the row is created, whereas the login of the user will default to the user at that same moment, and will not change later. Defaults cannot reference other tables, contain subqueries, reference other columns in the same table, or otherwise require any information beyond a static constant. Computed columns are a nice way to gain some of those benefits, if there is no need for a combination of user/system generated values.

The performance impact of default constraints will only be felt if a new column is added and there is a need to backfill all existing rows with a default constraint for that new column. Otherwise, you will not see any increased IO as a result of the use of default constraints. The cost to insert a value yourself vs. the cost to have the default constraint do so are identical. The cost to insert an explicit value into a column with a default constraint is the same as the cost to let the constraint do the work for you.

Check Constraints

Sometimes we want to enforce specific logic on our data, but doing so through stored procedures or application logic would be extremely complex or risky. Check constraints allow us to validate data whenever it is written, thus allowing us to check for specific data or compare different columns in a table. These constraints can verify the value of a single column or they can compare any number of columns to ensure data integrity.

For example, here is a table with some check constraints defined on it:


Here, we define a check constraint on a single column in the Dinosaur table, and then create an additional one afterwards that references multiple columns. If we insert a row into the table and it meets both conditions, then we’ll have no problem:


This dinosaur meets the criteria we imposed on the table, so it is inserted normally:

How cute! Let’s say we tried to add a row for T-Rex and accidentally set him as an herbivore…


When you violate a check constraint, the result is an error that tells you specifically which constraint was the cause of the error, as well as where it is located:

Msg 547, Level 16, State 0, Line 196 The INSERT statement conflicted with the CHECK constraint “CK_Dinosaur_Is_Herbivore”. The conflict occurred in database “AdventureWorks2014”, table “dbo.Dinosaur”.

We immediately know that the error was caused by CK_Dinosaur_Is_Herbivore and can quickly look at the definition of that constraint and verify if my insert statement was no good or if the check constraint was erroneous. This raises an important point with regards to the maintenance of check constraints in that they require it. Any check constraint that references specific data that could change will require review in the same way that application or database code is reviewed. If a new dinosaur is discovered tomorrow that comprises a type not shown here, we would need to update the check constraint to include it. Alternatively, if a new variety of herbivorous cryolophosaurus is discovered, we’d need to adjust our rules to account for it. This can result in check constraints becoming very complex, hard to maintain, and error-prone. For scenarios where lots of metadata is involved or it changes often, consider creating a table to store valid types, rather than trying to squeeze all that logic into a check constraint.

One additional note on check constraints: NULL is not a value in the same way that a number or string is. As a result, if a check constraint exists on a NULLable column, and the constraint doesn’t explicitly validate NULL in some fashion, then NULL will not fail. For example, if our Dinosaur_Type column from above were NULLable, then an operation that puts NULL into that column will not violate the check constraint. Remember that any comparison that operates against a NULL evaluates to unknown, which is not false. This is a similar effect to when you use an equality in a WHERE clause against a column with NULLs in it. If a check constraint exists on a column that allows NULL, be sure to account for it by either including a NULL check in the constraint, or by changing the column to NOT NULL.

There are other reasons to be cautious with check constraints. While they are an effective way to bolster data integrity, it is easy to go a bit wild and apply excessive amounts of business logic directly to tables within the database. This can result in lazy development and failure to vigorously check and sanitize inputs before those values reach the database. A well-written application will validate data thoroughly in code and then in the database for critical scenarios. We cannot check for every possible bit of bad data, so it is up to us to determine the most important scenarios and protect against those, if needed. Sometimes excessive use of check constraints is indicative of design flaws elsewhere that we are trying to patch over after-the-fact, rather than fixing the underlying design.

In no way am I discouraging the use of check constraints, but be mindful of the best way to enforce data integrity without creating technical debt. Definitely encourage the use of dinosaurs in your data as needed, though!

How do check constraints affect performance? In terms of write operations, the impact is generally negligible. I attempted to create a variety of tables and apply all sorts of check constraints to them in order to generate latency, but was unable to move the needle much. In general, constraints are checked against the explicit scalar values prior to writing to the database. In other words, set-based logic does not revert to iterative approaches as a result of check constraints, even if they are complex, for example using math or string functions. If a check constraint includes a function that accesses other tables/views/objects, then the cost of those operations will weigh heavily on write operations, though.

Read operations can benefit from check constraints by providing the query optimizer valuable information about a table and its contents. For example, what if I write a query against our dinosaur table that looks for dinosaurs of a type that are not allowed by the CK_Dinosaur_Dinosaur_Type constraint? To mimic a more realistic query scenario, I’ll add an index on the Dinosaur_Type column first:


Well, there are no Ceratopsian dinosaurs in our table because the check constraint simply doesn’t allow them. The column is NOT NULL, so NULL isn’t valid either. The result set is empty, but the execution plan and IO statistics are interesting:

For this query, there were zero reads against the table and the execution plan shows a constant scan. Similar to when we tried to pull NULL from a NOT NULL column, the check constraint provided valuable information to SQL Server about the Dinosaur_Type column and what values it is allowed to contain. Without accessing the table, SQL Server was able to deduce from table metadata and the WHERE clause of my query that what I was looking for was not possible, and knowing that information, it could complete query optimization without any significant load on the server.

Check constraints are utilized early in the query optimization process and provide a very fast way to rule out invalid values. If a column only allows four dinosaur types, then we can deduce that all other types will not exist in the table. In a very large table, this performance benefit can be significant. Be sure to weigh the documentation and maintenance costs of a check constraint versus data integrity and performance to make a smart decision as to whether or not a given check constraint makes sense.

Foreign Key constraints

An important part of data modelling and enforcing data integrity comes via foreign key constraints. These constraints link a column within one table to a primary key in another table. The foreign key column must always contain a valid value from the parent primary key column. Like with check constraints, whenever a scenario arises in which data is changed in which the foreign key column contains invalid data, an error will be thrown. Foreign keys can reference multiple columns, if the parent table has a compound primary key.

To demo foreign keys, we can use the table Production.Product in Adventureworks, which contains four different foreign keys:

The definitions of these keys are as follows:


For each product, the product model, subcategory, size, and weight are all tied directly to other tables. Whenever new rows are inserted or existing ones are updated, the values for these columns will be validated against their parent tables. Consider the following UPDATE statement:


This results in an error:

Msg 547, Level 16, State 0, Line 1
The UPDATE statement conflicted with the FOREIGN KEY constraint “FK_Product_UnitMeasure_WeightUnitMeasureCode”. The conflict occurred in database “AdventureWorks2014”, table “Production.UnitMeasure”, column ‘UnitMeasureCode’.
The statement has been terminated.

The code “UM” is not present in the Production.Unitmeasure table, and therefore any attempt to set the WeightMeasureCode to this value will immediately fail. On the other hand, assigning a value that is present will work normally:


A foreign key column can allow NULL, and if so, will let you enter NULL instead of a valid value, if that makes sense:


This is completely valid, but only if the column does not possess the NOT NULL constraint.

Foreign keys can be created with the NOCHECK keyword, which will cause the key to enforce data integrity in the future, but will allow any existing bad data to continue to reside in the table. This is generally frowned upon as it allows bad data to exist in a column where we clearly want to maintain a certain level of integrity. While there are some valid use-cases for NOCHECK, typically temporary in nature, I’d advise against using it. Constraints are a form of database documentation, and the presence of a NOCHECK constraint provides mixed information about the contents of the column(s) it applies to.

There is a performance cost to pay for writing to columns that are part of a foreign key. In order to validate the values being inserted or updated, it is necessary to read the parent table in order to ensure the value being added is present. If the parent column is indexed (which it should be!), then the cost is at best a clustered index seek. If the parent column is not indexed (it’s a heap), then a table scan is the result.

To see this in action, let’s consider an update of the SizeMeasureCode column in Production.Product:


Without a foreign key, we would simply read Production.Product in order to find the row we want to update, and then update it with the new value as we requested. With the foreign key, though, here is what we get for IO and execution plan:

Table ‘UnitMeasure’. Scan count 0, logical reads 2, physical reads 1, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table ‘Product’. Scan count 0, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

In order to check the foreign key for the value we are assigning, a seek against the UnitMeasure table was required. Since this is a small lookup table that contains 38 rows, the cost of this operation is negligible. The Assert operator is what validates that the values we are writing to Production.Product exist in Production.UnitMeasure. If this were not the case, then the query would fail here with a foreign key conflict error that would look something like this:

Msg 547, Level 16, State 0, Line 357
The UPDATE statement conflicted with the FOREIGN KEY constraint “FK_Product_UnitMeasure_SizeUnitMeasureCode”. The conflict occurred in database “AdventureWorks2014”, table “Production.UnitMeasure”, column ‘UnitMeasureCode’.

When adding a foreign key to an important table, consider this write cost when doing so. Make sure your server and storage systems can handle the additional load that will result from the read operations needed to check that foreign key constraint. This is especially important if the parent table contains a very large amount of data. Whereas a seek against a 38 row table was very fast, the same seek against a table with a hundred million rows could be quite expensive!

Some companies with exceptionally large tables will consider removing foreign keys in order to improve performance on write-heavy tables. This is a valid approach, but one in which the need for database integrity must be weighed against performance and then the best decision made based on those criteria. If foreign keys are omitted, what protects against bad data? If invalid values are created, how will this affect the application? This is not a simple decision to make, and is one that should be handled on a case-by-case basis. There are other tools available that can greatly help in offsetting the costs of foreign keys, such as partitioning, indexed views, and in-memory OLTP. In general, any tool that helps alleviate the performance cost of reading very large tables will help in managing the costs of foreign keys. Performance optimization strategies that help speed up foreign key checks are likely beneficial to any environment in which the database administrators are considering dropping foreign keys for performance reasons anyway.

Performance issues due to foreign keys are not a problem that everyone will face, so please don’t consider foreign keys a hog, as they certainly are not. They are exceptional tools for enforcing data integrity and the costs are very often worth the benefits they provide. Managing big data is a broader challenge with many effective solutions available to alleviate performance problems. Omitting foreign keys should be a last resort, and one I would not recommend without a serious design discussion to ensure that removing one problem is not creating a larger one in the future.

Conclusion

Constraints provide a variety of methods for enforcing data and relational integrity. They allow us to model our schema after business needs and allow database objects to provide some level of built-in documentation that helps explain what tables relate to which other tables, what columns are required, and what significant rules must be followed at all times.

Constraints have performance implications, both improving performance in some cases and consuming resources in others. Most of the time, the costs are worth it in that we can improve the quality of our data without the need to build complex application logic or database scripts that try to keep data in order. A well thought out database design will often answer constraint questions up-front. This makes it so that we do not need to address technical debt years later when we suddenly realize that one column should always have been populated, or that dinosaurs shouldn’t be assigned to non-existent species. Constraints are valuable tools, and ones that can improve documentation, performance, data integrity, and make sense of what would otherwise be complex data.

References and further reading

Microsoft has done a good job of documenting constraints, their syntax and usage, and demos of how to create and use them. The internet as a whole is on a bit shakier ground. I’ve read too many articles or Q&A sites where constraints are completely shunned, where performance data is completely wrong, or there are misunderstandings as to why we use them or how they work.

The following are some useful links from Microsoft that help explain the usage of constraints in SQL Server and the optimal ways to implement them:

SQL Server catalog views for constraints:


Ed Pollack
Constraints, Documentation

About Ed Pollack

Ed has 20 years of experience in database and systems administration, developing a passion for performance optimization, database design, and making things go faster. He has spoken at many SQL Saturdays, 24 Hours of PASS, and PASS Summit. This lead him to organize SQL Saturday Albany, which has become an annual event for New York’s Capital Region. In his free time, Ed enjoys video games, sci-fi & fantasy, traveling, and being as big of a geek as his friends will tolerate. View all posts by Ed Pollack

168 Views