Ed Pollack

How to solve the Identity Crisis in SQL Server

November 14, 2017 by

Description

Identity columns provide a convenient way to auto-number an ID column within a table without the need to manage the sequence. This convenience can save immense amounts of time, but also presents a single challenge: What happens when an identity column runs out of space within the data type chosen?

In this article, we will answer this question, as well as explore a variety of ways to detect this problem before it arises and solve it without causing disruption to important tables.

Introduction

Identity columns are often used as a way to auto-number some data element when we have no need to assign any specific values to it. Either the values are arbitrary, the column is a surrogate key, or we wish to generate numbers for use in other processes downstream.

For most applications, identity columns are set-it-and-forget-it. For an integer, which can contain 2,147,483,647 positive values, it’s easy to assume that a number so large is out-of-reach for a fledgling table. As time goes on, though, and applications grow larger, two billion can quickly seem like a much smaller number. It’s important to understand our data usage, predict identity consumption over time, and proactively manage data type changes before an emergency arises.

What happens when an Identity Column is exhausted

When an identity column reaches its limit, all insert operations will fail. We can test this easily by creating a table with an identity column, reseeding it to its limit, and then trying to insert a new row. Here’s a table with only 2 columns, an identity that is set to a seed near its maximum value and a string:

With the table created, we can begin inserting some data:

After these inserts, we can view the data in our table:

Note that the identity value for our Triceratops is at the highest allowed by an integer data type. Now, let’s insert one more row:

The result of this INSERT statement is the following error:

SQL Server provides no built-in warning. When we exceed the bounds of a data type, we receive the same error that would be returned if we tried to store a higher number in the INTEGER. Until an action is taken on our part to resolve the limit we have hit, inserts will continue to fail. All other operations on this table will execute normally, including DELETE, UPDATE, and SELECT.

The remainder of this article will deal with detecting a data type that is getting full with enough time so that we can take an action that does not need to be based on panic 🙂

How to determine remaining space in an Identity Column

SQL Server provides two tools that allow us to accurately and quickly research identity column usage:

  1. System views
  2. IDENT_CURRENT

System views provide us with information about tables, schemas, columns, and other objects in SQL Server. We can use these to generate lists of different types of objects that can then be used for reporting or further research. For our work here, we will use system views to collect a list of all identity columns in a given database. Once we have this list, we can begin researching each column and understanding how far along we are within each column’s data type. The following views will be used in our work:

  • sys.tables: A view with a row per user table in a given database.
  • sys.columns: Returns a row for each column in any view or table.
  • sys.types: Contains a row for each data type in defined in a given database. This includes both user and system data types.
  • sys.schemas: Returns a row for each schema defined within a database.

By joining these views together, we can create a query that will provide details about each identity column within a database:

When run in the WideWorldImporters Microsoft demo database, we get the following result set:

This database contains three identity columns, including the one we created earlier in this article. Our system views return the schema, table, and column name, as well as the data type. Filtering on columns.is_identity = 1, we can limit our result set from all columns to only those that are defined with the IDENTITY attribute.

The next tool at our disposal is the built-in function IDENT_CURRENT, which returns the last identity value that was allocated for a given identity column. This function accepts a table name (including the schema name) and returns the last identity value used:

This query returns the following results:

Since that column is a BIGINT, we have a long way to go before we exhaust the ~9.2 quintillion values available to it.

This is a good start, but we do not want to run a separate query on every single table. While the schema name is optional, if we do not include it, then it will default to the dbo schema, which will not work for any tables that are not in the default dbo schema. For tables in other schemas, we will need to write dynamic SQL to insert both the schema and table name into the IDENT_CURRENT function:

This script adds additional elements to our previous query, including:

  • The size of a column. For TINYINT, SMALLINT, INT, and BIGINT, this is a fixed value. For DECIMAL and NUMERIC, it is based on the definition of the column and how much space is allocated to the non-decimal component of the data type.
  • The current identity value, which is the number returned by IDENT_CURRENT.
  • The percent of the column consumed, as determined by dividing the current identity value by the maximum value and multiplying by 100.

When run on WideWorldImporters, we get the following output:

These results provide us with everything we need to understand how much of an identity columns data space has been consumed. We can see that the first two columns are safe for now, as they have an immense amount of identity values to churn through before being in danger of running out. Our test table, on the other hand, is at 100% capacity. As we recall from the demo earlier, this column has no space left and INSERT operations against it will fail due to this limit.

If we run the query above against AdventureWorks2016, we can see a larger result set:

We can see here that nearly all columns are under 1% used, with the exception of the TINYINT identity on HumanResources.Shift.

Links are provided at the end of this article to where you can download Microsoft’s various demo databases, which are used throughout this article.

Resolving Identity Fullness

Now that we have the ability to quickly research identity column utilization, we can take the TSQL presented above and move it into a stored procedure that gets executed via a job or some other process semi-regularly. We can then choose some threshold that we need to act on. For example, we may decide that any identity column that is over 85% consumed will be the target of DBA actions in order to prevent it from exhausting the range of its data type.

Once we have identified a table that needs to be acted upon, we have a variety of choices in how to solve the problem. Determining the best option for resolving an identity column’s fullness will be require us to consider the table, its usage, contention, and resources available. Here are some ways that we can solve our problem:

Alter an identity column in-place

The simplest way to increase the range of an identity column is to alter it to a larger data type. For example, we can change a SMALLINT into an INT, or an INT into a BIGINT. This option requires very little TSQL, but will lock the table while it is executing. In addition, quite a bit of log space can be used to complete the operation.

Pros

  • Simple TSQL, minimal effort to implement. Application may need to be updated to address the new data type, but no significant code changes should be required.
  • No need to rename objects or start a new table for our data.
  • The risk of overflowing data decreases drastically after this change.

Cons

  • Will lock the table while executing. For a large table, this can take a long time and prevent applications and processes from using the table.
  • Can cause significant log growth if the table is large.
  • Any objects that depend on the column, such as primary keys, foreign keys, and indexes will need to be dropped prior to the operation, and re-added afterwards. This may result in reduced performance until the index is recreated.

Altering a column in place is good for a smaller table or in scenarios where an application outage is tolerable. Consider that every row in the table needs to be updated as additional space needs to be allocated to the column in each row. For a table with a billion rows with an integer identity column, altering to a BIGINT will require approximately 4 billion additional bytes. That’s about 4GB of data that needs to be written to the table, at minimum. In practice, the space used will likely be much higher as the data will not all be compactly squeezed into consecutive, full pages. Instead, there will be page splits and the index (or heap) will temporarily take up more space, until index maintenance is performed.

To test a column alter on our table from earlier, we can implement it as follows:

Executing this results in an error message:

Ah, we need to drop our clustered primary key on this column before we can alter its data type:

This succeeds, and we can view the new data type in the table:

Now, we can add back the clustered primary key that originally was defined on the column:

And, we’re done! Of course, this table only had a handful of rows, so we expected this to be a quick and painless operation.

Reseed the Identity Column

An additional option to resolve an identity column that is running out of space is to reseed to column to a very low number and allow it to begin growing naturally again. This option requires no application-facing schema changes, but it will force the reuse of old IDs, which may not be an acceptable option. In addition, if the column is truly growing fast, this may only serve as a temporary solution. If the identity column fills up again soon, then we’ll be forced to revisit this problem again (and again).

Pros

  • No application-facing schema changes.
  • Will not block applications.
  • Will not cause any significant log growth.
  • Will not take long to execute.

Cons

  • Reuses old identity values, which may be unacceptable for tables in which the ID values are important and must be unique all-time.
  • If table is very fast-growing, it will only be a temporary solution that we will need to visit again soon.
  • Need to confirm existing column values to ensure that we will not duplicate a value in the near future.

This solution truly feels like we are kicking the can down the road a bit, and in many scenarios that will be the case. For a table in which the ID values are disposable and the data short-lived, though, this can be a quick and easy way to resolve the problem without having to resort to lengthening the column.

Let’s demonstrate how to reseed an identity column to a new value:

Executing this is quick and painless, and there’s no need to drop dependencies, since the schema itself isn’t changing. Now if we insert a row, we’ll note that it now has an ID that’s a very, very negative number:

The results of the SELECT are as follows:

Note that the Pterodactyl has the most negative value allowed by an integer data type. We’re now all set until we’ve exhausted about 4.2 billion integers!

If you choose to reseed the column on a semi-regular basis, be sure to validate that ID values will never, ever be reused. What happens if we reuse an identity value? We can test that here:

Now we can try once again to add another dinosaur to our table:

The result is an error:

Note that once again, the error did not mention that an identity column crashed into itself or even that the identity itself was to blame. It simply tried to insert a value that was already there and threw a primary key violation as a result. Because of this—be careful when reseeding identity columns and ensure that numbers cannot be accidentally reused. Also, keep in mind that you may reseed an INT or BIGINT to a negative value as we did earlier in order to increase the range of numbers that can be used before the column’s data type is exhausted.

Create a New Table

A more complex, but good solution for the scenario in which you have far more data is to create a new table with a correctly sized identity column, slowly sync the data up, and then switch over to use it at a future time. This allows us to stage our data and prepare the new column at our leisure without causing any significant blocking or application outage while we are updating our schema.

Pros

  • We avoid identity value reuse or the churn of altering a column in-place.
  • The new column can be backfilled slowly, preventing the need for a long outage while the new data is populated.
  • Scales, even for very large tables.

Cons

  • The small outage should still be planned while the tables are swapped.
  • More complex solution than previous solutions.
  • Trigger (or some maintenance logic) is required on source table to keep both in sync.

There are a number of ways of accomplishing this, each with varying degrees of rigidness in the release process. The method demonstrated here will ensure that all data is copied to a new table and kept in sync until we are ready to swap. If your needs are less stringent, feel free to simplify the process to reduce complexity and/or work. Here is the starting point for our efforts:

This script creates a new table called Identity_Test_BIGINT and adds a trigger onto Identity_Test that will mirror all operations onto the new table: INSERT, UPDATE, and DELETE. In addition, if an update occurs on a row that is not in the new table, we’ll also catch that and copy it over. Here are some demos of the operations and results:


We can see that the new row was correctly inserted into the new table.


The update was also correctly applied to both the exsiting table, as well as the new one.


Similarly, the delete operation removes the row from both tables. Lastly, what happens if we update a row in Identity_Test that does not exist in our new table?


The last section of the trigger ensured that the update resulted in an insertion into the new table. At this point, we can backfill our data using whatever method we’d like. The following will batch IDs into groups of 2000, allowing for a safe and slow backfill that should have minimal impact on production operations. The batch sizes can be increased, but use caution to ensure that no negative impact results from writing too much data at once.

This runs quickly enough on our test data, and we can now confirm that both tables are synced up:

Once the tables are identical, the last step is a one-time release to swap our tables:

Because this is a DDL operation in which no data is being modified, it will execute very quickly. The now-old objects are renamed and the new table is given the original table name of Identity_Test. The old table can be dropped at a later date, whenever is convenient or when the release is deemed a success.

As a result, the timing of this release can be made to occur in an extremely short window. Realistically, a very short outage for the database would be the safe way to deploy, but even without an outage, the interruption would be extremely brief. An application with only modest resilience would likely see little in the way of interruption.

This process can easily be tweaked, as needed, in order to accommodate tables of different sizes or complexities, but the general approach applies regardless.

Conclusion

Identity columns are useful ways to generate sequential values with minimal effort or intervention. Checking periodically to ensure that identities have sufficient room to grow can help avert disaster and allow the problem to be solved preemptively without the need for last-minute panic.

Once identified, increasing the size of an identity column isn’t difficult, and we can choose from a variety of solutions based on the size, usage, and availability required of the table. Like many maintenance tasks, being able to perform this work on our terms saves immense time, effort, and stress. This not only makes us happier, but frees up more time so we can work on more important tasks!

See more

Consider these free tools for SQL Server that improve database developer productivity.

References


Ed Pollack

Ed Pollack

Ed has 20 years of experience in database and systems administration, developing a passion for performance optimization, database design, and making things go faster.He has spoken at many SQL Saturdays, 24 Hours of PASS, and PASS Summit.This lead him to organize SQL Saturday Albany, which has become an annual event for New York’s Capital Region.

In his free time, Ed enjoys video games, sci-fi & fantasy, traveling, and being as big of a geek as his friends will tolerate.

View all posts by Ed Pollack
Ed Pollack
Database design, Maintenance

About Ed Pollack

Ed has 20 years of experience in database and systems administration, developing a passion for performance optimization, database design, and making things go faster. He has spoken at many SQL Saturdays, 24 Hours of PASS, and PASS Summit. This lead him to organize SQL Saturday Albany, which has become an annual event for New York’s Capital Region.

In his free time, Ed enjoys video games, sci-fi & fantasy, traveling, and being as big of a geek as his friends will tolerate.

View all posts by Ed Pollack

201 Views