CTEs in SQL Server; Querying Common Table Expressions

Common table expressions (CTEs) in SQL Server provide us with a tool that allows us to design and organize queries in ways that may allow faster development, troubleshooting, and improve performance. In the first part of this series, we’ll look at querying against these with a practice data set. From examples of wrapped query checks to organization of data to multiple structured queries, we’ll see how many options we have with this tool and where it may be useful when we query data.

Test Data Set

The data set in these examples I use queries 362 values of a date range between December 1988 to January 2019 of almond values measured by a unit of output. The “AlmondValue” is not important to this tip, while the dates are for the examples. Any monthly stored values will work for these exercises with table name and column name substitutions. For readers who may want to copy queries directly, I’ve provided the below loop which will create these values for you to run these examples – the AlmondValue and Base10AlmondValue will differ in the output when you run these query examples, but the logical design of each query in the examples won’t.

CREATE TABLE tbAlmondData (AlmondDate DATE, AlmondValue DECIMAL(13,2))

DECLARE @startdate DATE = '1988-12-01', @value DECIMAL(13,2)


WHILE @startdate <= '2019-01-01'
BEGIN
  SET @value = ((RAND()*100))

  INSERT INTO tbAlmondData VALUES (@startdate,@value)

  SET @startdate = DATEADD(MM,1,@startdate)
END

SELECT MAX(AlmondDate) LatestDate, MIN(AlmondDate) EarliestDate, COUNT(AlmondDate) CountofValues
FROM tbAlmondData

Your result after running the above loop – remember that the AlmondValue in your table will differ since these are randomly generated.

The Basics of CTE and Select Statements

The basic part of a common table expression is that we can wrap a query and run CRUD operations with the wrapped query. Let’s look at a simple example of wrapping a query and changing column names, then selecting from these new columns. In the below image, we’ll see the highlighted CTE in SQL Server, even though we ran the full query and this shows a quick debugging option that we have – we can select the wrapped query to do a quick check before running the query and this is very useful with CRUD operations that require validation, or where we may be trying to debug a transaction and need to run the first part of a statement that our CRUD operation will run again. This is an important part a CTE in SQL Server: what we see in the wrapped query is what our operation is applied against; in this case, a select statement.

;WITH GroupAlmondDates AS(
  SELECT 
    YEAR(AlmondDate) AlmondYear
    , AlmondDate
    , AlmondValue
    , (AlmondValue*10) Base10AlmondValue
  FROM tbAlmondData
)
SELECT *
FROM GroupAlmondDates

We’ve highlighted the inner part of the CTE query, which we could execute. In this case, it would generate the same output. Your AlmondValue and Base10AlmondValue will differ.

One important point when creating these is that we will get an error if we don’t explicitly name our columns in the wrapped query. Let’s take the same above example and see the error when we remove the column name AlmondYear in the wrapped query:

;WITH GroupAlmondDates AS(
  SELECT 
    YEAR(AlmondDate)
    , AlmondDate
    , AlmondValue
    , (AlmondValue*10) Base10AlmondValue
  FROM tbAlmondData
)
SELECT *
FROM GroupAlmondDates

The common table expression expects each column to be named and the name must also be unique.

This is a derived column, meaning that it comes from the values within another column, but as we see, a name is required. We can alternatively create the names explicitly and ignore naming the columns in the wrapped query and the explicit names will override any column name in the query – though I prefer to name in the wrapped query for faster troubleshooting purposes. Notice what happens to the explicit name of Base10 versus Base10AlmondValue.

;WITH GroupAlmondDates (AlmondYear,AlmondDate,AlmondValue,Base10) AS(
  SELECT 
    YEAR(AlmondDate)
    , AlmondDate
    , AlmondValue
    , (AlmondValue*10) Base10AlmondValue
  FROM tbAlmondData
)
SELECT *
FROM GroupAlmondDates

Explicit naming is another approach we can use with CTEs.. Your AlmondValue and Base10AlmondValue will differ.

We see the Base10 in the output, so if we have a mixture of explicit column names that differ from the column names within the wrapped query, we should remember to use the explicit column names in references.

Another technique we can use with common table expressions is creating multiple CTEs together, similar to creating multiple subqueries that reference other subqueries (a common query output from Entity Framework). Although we can do this, this does not mean these are the most optimal solution, relative to what we’re trying to query. In the below query, we use a combination of these to query our data like we’ve queried above this, and in the second CTE, we look at the average of the AlmondValue for the year. From there, we join the GroupAlmondDates with GetAverageByYear on the AlmondYear and compare the AlmondValue with the average of the year. Our join here allows us to avoid creating another independent common table expression and save to a table, or use a temp table through creating a table, saving data, then removing it later.

;WITH GroupAlmondDates AS(
  SELECT 
    YEAR(AlmondDate) AlmondYear
    , AlmondDate
    , AlmondValue
    , (AlmondValue*10) Base10AlmondValue
  FROM tbAlmondData
), GetAverageByYear AS(
  SELECT 
    AlmondYear
    , AVG(AlmondValue) AvgAlmondValueForYear
  FROM GroupAlmondDates
  GROUP BY AlmondYear
)
SELECT t.AlmondDate
  , (t.AlmondValue - tt.AvgAlmondValueForYear) ValueDiff
FROM GroupAlmondDates t
  INNER JOIN GetAverageByYear tt ON t.AlmondYear = tt.AlmondYear

Our ValueDiff column comes from two CTEs joined. Your ValueDiff will differ.

This can be a useful technique provided that we consider we can’t troubleshoot the GetAverageByYear in the way that we can with GroupAlmondDates – meaning that we can’t highlight the wrapped query in the second CTE since it requires the first to exist. Should we use combinations? Sometimes, provided that debugging is either not required (intuitive design) or it’s the only option we have that performs the best.

In our next few queries, we’ll apply two SQL Server functions – ROW_NUMBER() and DENSE_RANK() – to view how we can extend analysis in our select statement. Both of these functions involve ordering data, one which orders rows of data by the row number of the data demarcated with the order by clause and the other which ranks data delineated with the order by clause. In the below code, I order five date combinations using descending and ascending with some of these – the row number of the date, the row number of the date specified by the year of the date, and the rank of the year.

;WITH GroupAlmondDates AS(
  SELECT 
    ROW_NUMBER() OVER (ORDER BY AlmondDate DESC) DateDescId
    , ROW_NUMBER() OVER (ORDER BY AlmondDate ASC) DateAscId
    , ROW_NUMBER() OVER (PARTITION BY YEAR(AlmondDate) ORDER BY YEAR(AlmondDate) ASC) YearAscId
    , DENSE_RANK() OVER (ORDER BY YEAR(AlmondDate) DESC) YearOrderedDescId
    , DENSE_RANK() OVER (ORDER BY YEAR(AlmondDate) ASC) YearOrderedAscId
    , YEAR(AlmondDate) AlmondYear
    , AlmondDate
    , AlmondValue
    , (AlmondValue*10) Base10AlmondValue
  FROM tbAlmondData
)
SELECT 
  AlmondYear
  , AlmondDate
  , DateDescId
  , DateAscId
  , YearAscId
  , YearOrderedDescId
  , YearOrderedAscId
FROM GroupAlmondDates

We can compare the date and years with the output of the five ordered columns.

An overview of what we’re seeing in the above output:

DateDescId/DataAscId: we see that we have 362 records because we’re getting the row number by the AlmondDate column. The DateAscId orders the date from 1 to 362, while the DateDescId orders the date from 362 to 1
YearAscId: these fields are ordering the rows by date of the year. Because these are monthly values, starting in 1989, we see the order matching the months for the. Since there is only one month in 1988, we see the YearAscId of 1 before it starts over at the beginning of 1989
YearOrderedDescId/YearOrderedAscId: this orders the years by their group – for an example, 1988 and 1989 are in different groups (ranks) – rank 32 and 31. We can see that we have a total number of 32 years

Ordering and partitioning values like this can provide us with tools to organize our data to apply mathematical functions, get values of data from sets within sets, etc. Using the same data set, suppose that we wanted to know what the median almond value of the first four years 1988, 1989, 1990, 1991 was. Since we know that we have 1 value in 1988 and 12 values in each of the other years, we’d need to find the 19^th value. Using a derivative of the above common table expression (removing unnecessary columns and adding the value), we can get this value:

;WITH GroupAlmondDates AS(
  SELECT 
    ROW_NUMBER() OVER (ORDER BY AlmondDate ASC) DateAscId
    , YEAR(AlmondDate) AlmondYear
    , AlmondDate
    , AlmondValue
    , (AlmondValue*10) Base10AlmondValue
  FROM tbAlmondData
)
SELECT *
FROM GroupAlmondDates
WHERE DateAscId = 19

The median value for the first 37 values. Your AlmondValue and Base10AlmondValue will differ.

Using the order of our years from the previous query, we could get the average almond value of the median year from our years. In the below query, we filter out the 1988 date and get the median of the other 31 values, seeing that the year 2004 returns.

;WITH GroupAlmondDates AS(
  SELECT 
    DENSE_RANK() OVER (ORDER BY YEAR(AlmondDate) ASC) YearOrderedAscId
    , YEAR(AlmondDate) AlmondYear
    , AlmondDate
    , AlmondValue
    , (AlmondValue*10) Base10AlmondValue
  FROM tbAlmondData
  WHERE AlmondDate > '1988-12-31'
)
SELECT 
  YearOrderedAscId
  , AlmondYear
  , AlmondDate
  --AVG(AlmondValue) AvgAlmondValue
FROM GroupAlmondDates
WHERE YearOrderedAscId = 16

The median year is 2004, from here we could run the average and get the average of this year.

I’ve left the average value commented out so that we see the data set that we would be running our average value. We can use this same logic for other aggregates where we may want an aggregate for a specific year based out of the data set, like median is the middle of a data set.

Conclusion

As we see, common table expressions (CTE in SQL Server) can provide us with a convenient way to query data similar to using tools like temp tables and subqueries. We have quick troubleshooting options with these where we can run select statements inside the wrapping of the CTE and we can create multiple statements to join together. Are these faster? In some cases, they may be, or they may be faster to troubleshoot – which could be important. Still, we should always test our queries to ensure that the way we’re running our query is the most optimal route.

CTEs in SQL Server; Querying Common Table Expressions

Inserts and Updates with CTEs in SQL Server (Common Table Expressions)

CTE SQL Deletes; Considerations when Deleting Data with Common Table Expressions in SQL Server

CTEs in SQL Server; Using Common Table Expressions To Solve Rebasing an Identifier Column

Author
Recent Posts

Timothy Smith

Tim manages hundreds of SQL Server and MongoDB instances, and focuses primarily on designing the appropriate architecture for the business model.

He has spent a decade working in FinTech, along with a few years in BioTech and Energy Tech.He hosts the West Texas SQL Server Users' Group, as well as teaches courses and writes articles on SQL Server, ETL, and PowerShell.

In his free time, he is a contributor to the decentralized financial industry.

View all posts by Timothy Smith

SQLShack

CTEs in SQL Server; Querying Common Table Expressions

Test Data Set

The Basics of CTE and Select Statements

Conclusion

Table of contents

Test Data Set

The Basics of CTE and Select Statements

Conclusion

Table of contents

Related posts: