Dmitry Piliugin

Overpopulated Primary Key and CE Model Variation in SQL Server

April 4, 2018 by

In this blog post, we are going to talk about another cardinality estimation model enhancement in SQL Server 2014 – Overpopulated Primary Key (OPK).

Consider a fact table that contains information about some sales, for example, and a date dimension table. Usually, a fact table contains the data about the current year and past years, but a dimension table usually contains the data for the next few years also.

If we are joining two tables fact and dimension and filtering on dimension table we usually demand the rows that do exist in a fact table, for example, we ask for the last month sales, but not for the next year sales. We also remember that a filter influences a join estimation, the join column statistics information should be modified according to the filter selectivity and cardinality. I don’t know the precise formula that is used for that in SQL Server, however, the algorithm is called Selection Without Replacement and is taken from the probability theory (the exact method doing maths in SQL Server is sqllang!CCardUtilSQL12::ProbSampleWithoutReplacement). You may read some details about the similar algorithm in Oracle further in useful links.

What is important for that post, that the number of distinct values in the join columns is used to calculate the join selectivity after filtering by another column. As we said before, a dimension table usually contains more distinct values than a fact table, because it is filled with the data for the next few years also, but we usually select only the relevant data, that is present in a fact table, i.e. those overpopulated distinct values should be somehow compensated and not concerned when we estimating the join over filtered dimension. That is what Overpopulated Primary Key model assumption about.

Let’s see this in action. We will use AdventureWorksDW2012 database and issue two identical queries, the first one – uses the new CE framework (I also add TF 2363, to view some diagnostic output), the second query uses the old framework (because we run it with TF 9481 – that forces the old CE behavior).

The actual number of rows for both of the queries equals 34 229, let’s look at the estimates.

The estimation with the new CE framework is much closer to the reality (26 154 new vs. 10 047 old vs. 34 229 actual). This is because the join selectivity was scaled to compensate the key overpopulation. If you look at the diagnostic output, you’ll see the message that tells you, that this action was taken.

The Model Variation

As you may suppose in this case the model variation would be not to use the OPK assumption and do not adjust the selectivity. To do it, you should run the query with TF 9482.

If we now examine the estimates we will see that they are very close, you also won’t see the “Scaling join selectivity…” output in the diagnostic console messages.

Again, I should warn everybody. This is undocumented and should not be used in production.

The final thing to mention, that there is nothing special about a fact and a dimension table for the OPK to be used. The decision to adjust the selectivity is made based on the difference between the distinct value counts. That means that you may observe this behavior, not only in DW databases, but in OLTP also, depending on your data distribution.

That’s all for this post. Next, we will talk about one of my favorite and really helpful changes in the new CE – the Ascending Key situation.

Table of contents

Cardinality Estimation Role in SQL Server
Cardinality Estimation Place in the Optimization Process in SQL Server
Cardinality Estimation Concepts in SQL Server
Cardinality Estimation Process in SQL Server
Cardinality Estimation Framework Version Control in SQL Server
Filtered Stats and CE Model Variation in SQL Server
Join Containment Assumption and CE Model Variation in SQL Server
Overpopulated Primary Key and CE Model Variation in SQL Server
Ascending Key and CE Model Variation in SQL Server
MTVF and CE Model Variation in SQL Server

References

Dmitry Piliugin
Execution plans, Performance, Query analysis

About Dmitry Piliugin

Dmitry is a SQL Server enthusiast from Russia, Moscow. He started his journey to the world of SQL Server more than ten years ago. Most of the time he was involved as a developer of corporate information systems based on the SQL Server data platform. Currently he works as a database developer lead, responsible for the development of production databases in a media research company. He is also an occasional speaker at various community events and tech conferences. His favorite topic to present is about the Query Processor and anything related to it. Dmitry is a Microsoft MVP for Data Platform since 2014. View all posts by Dmitry Piliugin

168 Views