Dejan Sarka

Data science in SQL Server: Data analysis and transformation – Using SQL pivot and transpose

October 11, 2018 by

In data science, understanding and preparing data is critical, such as the use of the SQL pivot operation. In this article, in the series, we’ll discuss understanding and preparing data by using SQL transpose and SQL pivot techniques.

Transposing a matrix means reversing rows and columns. The data frames in R and Python are actually matrices, and therefore transposing is not a problem. A SQL Server table is a slightly different structure, where rows and columns are not equivalent and interchangeable. A row represents an entity, and a column an attribute of an entity. However, you might get data in SQL Server form a matrix from other systems, and meet the need to transpose it. Transposing data does not involve aggregations.

Pivoting, using SQL pivot is a similar operation. You need three columns for the operation. The first column is giving the row groups. The second column is transposed; its values are used for the new columns’ names. The third column gives the values, which are aggregated over rows and columns.

SQL Pivot operator

As you are probably already used to in my data science articles, I will start with preparing some data. I am using the data from the AdventureWorksDW2017 database. Note that I switched from the SQL Server 2016 version of this database I used in the previous articles to the SQL Server 2017 version. Don’t worry, the structure of the database and the data is nearly the same.

You can make a quick overview of the data with the following two queries. Note the distinct years returned by the second query.

The distinct years with sales are from 2010 to 2014. Besides years, I aggregated sales over countries and states as well. I also added a combined column CountryState in the table dbo.SalesGeoYear I will use further in this article.

Let me start with the SQL PIVOT operator. The following query calculates the sum of the sales over countries and years. Please note the syntax for the SQL PIVOT operator. The sales column is used for the aggregation, and the CYear column for the labels of the new pivoted columns. Grouping is implicit; all other columns, not used for pivoting or aggregation, are used in an implicit GROUP BY.

Here is the result of the query.

Of course, you can change the aggregate function. For example, the following query calculates the count of the sales over countries and years.

You probably noticed that I used a common table expression to prepare the rowset for pivoting; I am not using the table directly. This is due to the implicit grouping. Somebody that defined the syntax for the SQL PIVOT operator wanted to make the code shorter; however, because of that, you need to write more code to be on the safe side. Columns that are not used for pivoting and aggregating are used for grouping. What happens if you read a column more, like in the following query?

I read also the column State in the CTE. I am not using it in the outer query. However, the result is quite different from the previous one.

The query did implicit grouping over two columns, Country and State.

The SQL PIVOT operator is not really intended for transposing the table. You always need to have an aggregate function. However, you can simulate transposing when you have a single value over rows and pivoted columns with the MIN() or MAX() aggregate functions. For example, the following query does not work:

But, as mentioned, it is easy to change it to a query that does work, the query that just transposes the data, without aggregation. Or, to be precise, the aggregation exists, with the MAX() function, on a single value, returning the value itself.

The SQL PIVOT operator is T-SQL proprietary operator. It is not part of the ANSI SQL standard. You can write pivoting queries with ANSI standard SQL as well, using the CASE expression, like the following query shows.

Besides implicit grouping, there is another problem with the SQL PIVOT operator. You can’t get the list of the distinct values of the pivoted column dynamically, with a subquery. You need to use dynamic SQL for this task. I am showing how to create a pivoting query dynamically in the following code. Note that I create the concatenated list of pivoted column names with the STRING_AGG() function, which is new in SQL Server 2017.

This will be enough T_SQL for this article, I am switching to R now.

Transposing and Pivoting in R

As always, I am starting with reading the data from SQL Server in an R data frame. I am also using the View() function to show it immediately.

The simplest way to transpose the data is with the t() function from the basic package:

The transposed matrix is not very readable. Here is the partial result.

12
Country “Australia” “Australia”
State“Queensland”“South Australia”
CountryState“Australia Queensland”“Australia South Australia”
CYear“2013”“2010”
Sales“985381.550”“3578.270”

I didn’t define the row names, which serve as the keys in a SQL Server table. I know that the CountryState and CYear columns together uniquely identify each row in my data. I am creating a new data frame with row names in the following code, and then transposing this new data frame.

Here is the partial result, which makes much more sense now.

Australia Queensland, 2013 Australia South Australia, 2010
Country“Australia”“Australia”
State“Queensland”“South Australia”
Sales“985381.550”“3578.270”

There are numerous ways for pivoting the data in R. Some useful functions for pivoting, or crosstabulation, are already in the basic package, including the table(), xtabs(), and tapply() functions. The first one returns the counts, that second on the sums, and you can use easily the third one with any aggregate function.

For the sake of brevity, I am showing only the results of the two tapply() function calls, with counts and sums of sales over countries and years.

20102011201220132014
Australia35555
Canada12231
20102011201220132014
Australia20909.782563732.22128407.543394438507.72
Canada3578.27571571.8307604.510856339457.62

The Microsoft RevoScaleR library has many scalable functions, including functions for pivoting. The following code uses the rxCube() function.

Here is the result, showing both counts and sums in a single table.

CountryF_CYearSalesCounts
1 Australia 201020909.783
2 Canada 20103578.271
3 Australia 20112563732.255
4 Canada 2011571571.802
5 Australia 20122128407.465
6 Canada 2012307604.522
7 Australia 20134339443.385
8 Canada 20131085632.653
9 Australia 20148507.725
10 Canada 20149457.621

With the rxCrossTabs() function, you can calculate different statistics in a single call, store everything in an object, and then just read the part that you need, like the following code shows.

It is always nice to present the results graphically as well. You can use the rxHistogram(), another RevoScaleR scalable function, for showing the counts of the sales over countries and years.

The result is the following graph.

You are probably already used to the fact that in r you can always do things in many ways. The following example shows how to do pivoting with the cast() function from the reshape package.

You can use the spread() function from the tidyr package for transposing the data.

You can combine the spread() function with the group_by() and summarise() functions from the dplyr package for pivoting.

There are probably many more options in R; however, I am switching to Python now.

Python Pandas Transposing and Pivoting

As always, I need to start with importing the necessary libraries and reading the data.

In Python, all of the functions you need for transposing and pivoting data exist in the pandas package. For transposing the data, you can use the transpose() pandas data frame object method. You can also use the property T, which is the accessor to the method transpose(). The following two lines of code are equivalent.

Here is the partial result.

01
CountryAustraliaAustralia
StateQueenslandSouth Australia
CountryStateAustralia QueenslandAustralia South Australia
CYear20132010
Sales9853823578.27

Similarly, like defining the row names in R, it makes sense to define the row indexes for a Python data frame. The following code defines the index on a copy of the original data frame and then transposes it.

The results are more readable.

CountryStateAustralia QueenslandAustralia South Australia
CYear20132010
CountryAustraliaAustralia
StateQueenslandSouth Australia
Sales9853823578.27

The SQL pivot() function pivots the data without aggregations. You can achieve the same result with the unstack() function, as the following code shows.

Here is the partial last result of the previous code.

CYear201020112012
CountryState
Australia New South Wales6978.261.120237e+06933532.4239
Australia Queensland10353.255.327788e+05457701.5613
Australia South Australia3578.271.574316e+05169368.5881
Australia TasmaniaNaN5.966091e+0466831.4633
Australia VictoriaNaN6.936243e+05500973.4185
Canada AlbertaNaN7.156540e+034624.9125
Canada British Columbia3578.275.644153e+05302979.6112
Canada OntarioNaN NaN NaN

The pivot_table() function pivots the data and introduces the aggregations.

And here is the result for the counts.

CYear20102011201220132014
Country
Australia35555
Canada12231

Finally, let me also create a graph in Python.

The following figure shows the graph.

Conclusion

As you probably noticed from my data science articles, there is always the highest number of possibilities for nearly any kind of task in R. Nevertheless, you can do efficient pivoting and transposing in T-SQL and Python as well. And you can imagine what comes next: unpivoting.

Table of contents

Introduction to data science in SQL Server
Data science in SQL Server: ordinal variables and dummies
Data science in SQL Server: binning a continuous variable
Data science in SQL Server: entropy of a discrete variable
Data science in SQL Server: basic work with datasets
Data science in SQL Server: grouping and aggregating data I
Data science in SQL Server: grouping and aggregating data II
Data science in SQL Server: SQL pivot and data transposition
Interview questions and answers about data science in SQL Server

Downloads


Dejan Sarka

Dejan Sarka

Dejan Sarka, MCT and Data Platform MVP, is an independent trainer and consultant that focuses on development of database & business intelligence applications.Besides projects, he spends about half of the time on training and mentoring. He is the founder of the Slovenian SQL Server and .NET Users Group. Dejan Sarka is the main author or coauthor of sixteen books about databases and SQL Server. He also developed many courses and seminars for Microsoft, SolidQ and Pluralsight.

View all posts by Dejan Sarka
Dejan Sarka
Data science

About Dejan Sarka

Dejan Sarka, MCT and Data Platform MVP, is an independent trainer and consultant that focuses on development of database & business intelligence applications. Besides projects, he spends about half of the time on training and mentoring. He is the founder of the Slovenian SQL Server and .NET Users Group. Dejan Sarka is the main author or coauthor of sixteen books about databases and SQL Server. He also developed many courses and seminars for Microsoft, SolidQ and Pluralsight. View all posts by Dejan Sarka

804 Views