Dmitry Piliugin

Ascending Key and CE Model Variation

April 5, 2018 by

In this note, I’m going to discuss one of the most useful and helpful cardinality estimator enhancements – the Ascending Key estimation.

We should start with defining the problem with the ascending keys and then move to the solution, provided by the new CE.

Ascending Key is a common data pattern and you can find it in an almost every database. These might be: identity columns, various surrogate increasing keys, date columns where some point in time is fixed (order date or sale date, for instance) or something like this – the key point is, that each new portion of such data has the values that are greater than the previous values.

As we remember, the Optimizer uses base statistics to estimate the expected number of rows returned by the query, distribution histogram helps to determine the value distribution and predict the number of rows. In various RDBMS various types of histograms might be used for that purpose, SQL Server uses a Maxdiff histogram. The histogram building algorithm builds histogram’s steps iteratively, using the sorted attribute input (the exact description of that algorithm is beyond the scope of this note, however, it is curious, and you may read the patent US 6714938 B1 – “Query planning using a maxdiff histogram” for the details, if interested). What is important that at the end of this process the histogram steps are sorted in ascending order. Now imagine, that some portion of the new data is loaded, and this portion is not big enough to exceed the automatic update statistic threshold of 20% (especially, this is the case when you have a rather big table with several millions of rows), i.e. the statistics are not updated.

In the case of the non-ascending data, the newly added data may be more or less accurate considered by the Optimizer with the existing histogram steps, because each new row will belong to some of the histogram’s steps and there is no problem.

If the data has ascending nature, then it becomes a problem. The histogram steps are ascending and the maximum step reflects the maximum value before the new data was loaded. The loaded data values are all greater than the maximum old value because the data has ascending nature, so they are also greater than the maximum histogram step, and so will be beyond the histogram scope.

The way how this situation is treated in the new CE and in the old CE is a subject of this note. Now, it is time to look at the example.

We will use the AdventureWorks2012 database, but not to spoil the data with modifications, I’ll make a copy of the tables of interest and their indexes.

Now, let’s make a query, that asks for some order information for the last month, together with the customer and some other details. I’ll also turn on statistics time metrics, because we will see the performance difference, even in such a small database. Pay attention, that TF 9481 is used to force the old cardinality estimation behavior.

The query took 250 ms on average on my machine, and produced the following plan with Hash Joins:

Now, let’s emulate the data load, as there were some new orders for the next month saved.

Not too much data was added: 939 rows for orders and 2130 rows for order details. That is not enough to exceed the 20% threshold for auto-update statistics.

Now, let’s repeat the previous query, and ask the orders for the last month (that would be the new added orders).

That took 17 500 ms on average on my machine, more than 50X times slower! If you look at the plan, you’ll see that a server now using the Nested Loops Join:

The reason for that plan shape and slow execution is the 1 row estimate, whereas 939 rows actually returned. That estimate skewed the next operator estimates. The Nested Loops Join input estimate is one row, and the optimizer decided to put the SalesOrderDetail table on the inner side of the Nested Loops – which resulted in more than 100 million of rows to be read!

CE 7.0 Solution (Pre SQL Server 2014)

To address this issue Microsoft has made two trace flags: TF 2389 and TF 2390. The first one enables statistic correction for the columns marked ascending, the second one adds other columns. More comprehensive description of those flags is provided in the post Ascending Keys and Auto Quick Corrected Statistics by Ian Jose. To see the column’s nature, you may use the undocumented TF 2388 and DBCC SHOW_STATISTICS command like this:

In this case, no surprise, the column leading type is Unknown, 3 other inserts and update statistics should be done to brand the column.

You may find a good description of this mechanism in the blog post Statistics on Ascending Columns by Fabiano Amorim. As the column branded Unknown we should use both TFs in the old CE to solve the ascending key problem.

This query took the same 250 ms on average on my machine and resulted in the similar plan shape (won’t provide it here, for the space saving). Cool, isn’t it? Yes, it is, in this synthetic example.

If you are persistent enough, try to re-run the whole example from the very beginning, commenting the index ix_OrderDate creation (the one marked with the * symbol in the creation script). You will be quite surprised, that those TFs are not helpful in case of the missing index! This is a documented behavior (KB 922063):

That means, that automatically created statistics (and I think in most of the real world scenarios the statistics are created automatically) won’t benefit from using these TFs.

CE 12.0 Solution (SQL Server 2014)

To address the issue of Ascending Key in SQL Server 2014 you should do… nothing! This model enhancement is turned on by default, and I think it is great! If we simply run the previous query without any TF, i.e. using the new CE, it will run like a charm. Also, no restriction of having a defined index on that column.

The plan would be the following (adjusted a little bit to fit the page):

You may see that the estimated number of rows is not 1 row any more. It is 281.7 rows. That estimate leads to an appropriate plan with Hash Joins, that we saw earlier. If you wonder how this estimation was made – the answer is that in CE 2014 the “out-of-boundaries” values are modeled to belong an average histogram step (trivial histogram step with a uniform data distribution) in case of equality – it is well described in Joe Sack blog post mentioned above. In case of inequality the 30% guess, over the added rows is made (common 30% guess was discussed earlier).

The result is 939*0.3 = 281.7 rows. Of course a server uses another, per-column counters, but in this case it doesn’t matter. What is matter that this really cool feature is present in the new CE 2014!

Another interesting thing to note is some internals. If you run the query with the TF 2363 (and the TF 3604 of course) to view diagnostic output, you’ll see that the specific calculator CSelCalcAscendingKeyFilter is used.

According to this output, at first the regular calculator for an inequality (or equality with non-unique column) was used. When it estimated zero selectivity, the estimation process realized that some extra steps should be done and re-planed the calculation. I think this is a result of separating the two processes, the planning for computation and the actual computation, however, I’m not sure and need some information from the inside about that architecture enhancement. The re-planed calculator is CSelCalcAscendingKeyFilter calculator that models “out-of-histogram-boundaries” distribution. You may also notice the guess argument, that stands for the 30% guess.

The Model Variation

The model variation in that case would be to turn off the ascending key logic. Besides, this is completely undocumented and should not be used in production, I strongly don’t recommend to turn off this splendid mechanism, it’s like buying a ticket and staying at home.

However, maybe this opportunity will be helpful for some geeky people (like me=)) in their optimizer experiments. To enable the model variation and turn off the ascending key logic you should run the query together with TF 9489.

And with TF 9489 we are now back to the nasty Nested Loops plan. I’m sure, due to the statistical nature of the estimation algorithms you may invent the case where this TF will be helpful, but in the real world, please, don’t use it, of course, unless you are guided by Microsoft support!

That’s all for that post! Next time we will talk about multi-statement table valued functions.

Table of contents

Cardinality Estimation Role
Cardinality Estimation Place in the Optimization Process
Cardinality Estimation Concepts
Cardinality Estimation Process
Cardinality Estimation Framework Version Control
Filtered Stats and CE Model Variation
Join Containment Assumption and CE Model Variation
Overpopulated Primary Key and CE Model Variation
Ascending Key and CE Model Variation
MTVF and CE Model Variation

References


See more

To view and analyze SQL Server query execution plans for free, check out ApexSQL Plan


Dmitry Piliugin

Dmitry Piliugin

Dmitry is a SQL Server enthusiast from Russia, Moscow. He started his journey to the world of SQL Server more than ten years ago. Most of the time he was involved as a developer of corporate information systems based on the SQL Server data platform.

Currently he works as a database developer lead, responsible for the development of production databases in a media research company. He is also an occasional speaker at various community events and tech conferences. His favorite topic to present is about the Query Processor and anything related to it. Dmitry is a Microsoft MVP for Data Platform since 2014.
Dmitry Piliugin

Latest posts by Dmitry Piliugin (see all)

Performance, Query analysis

About Dmitry Piliugin

Dmitry is a SQL Server enthusiast from Russia, Moscow. He started his journey to the world of SQL Server more than ten years ago. Most of the time he was involved as a developer of corporate information systems based on the SQL Server data platform. Currently he works as a database developer lead, responsible for the development of production databases in a media research company. He is also an occasional speaker at various community events and tech conferences. His favorite topic to present is about the Query Processor and anything related to it. Dmitry is a Microsoft MVP for Data Platform since 2014.

32 Views