Data science in SQL Server: Data analysis and transformation

Previously, in this Data science series, I already tacitly did quite a few aggregations over the whole dataset and aggregations over groups of data. Of course, the vast majority of the readers here is familiar with the GROUP BY clause in the T-SQL SELECT statement and with the basic aggregate functions. Therefore, in this article, I want to show some advanced aggregation options in T-SQL and grouping in aggregations of data in an R or a Python data frame.

Aggregating data might be a part of data preparation in data science, a part of data overview, or even already a part of the finaly analysis of the data.

T-SQL Grouping Sets

The T-SQL language does not have a plethora of different aggregate functions. However, you can do a lot of different ways of grouping of the same rows with just a few lines of code in T-SQL.

Let me start with the basic aggregation over groups in T-SQL. The following code calculates the number of customers per country ordered by the country name, using the COUNT(*) aggregate function. You define the grouping variables in the GROUP BY clause.

USE AdventureWorksDW2016;

SELECT g.EnglishCountryRegionName AS Country,

COUNT(*) AS CustomersInCountry

FROM dbo.DimCustomer AS c

INNER JOIN dbo.DimGeography AS g

ON c.GeographyKey = g.GeographyKey

GROUP BY g.EnglishCountryRegionName

ORDER BY g.EnglishCountryRegionName;

The next step is to filter the results on the aggregated values. You can do this with the HAVING clause, like the following query shows. The query keeps in the result only the countries where there are more than 3,000 customers.

SELECT g.EnglishCountryRegionName AS Country,

COUNT(*) AS CustomersInCountry

FROM dbo.DimCustomer AS c

INNER JOIN dbo.DimGeography AS g

ON c.GeographyKey = g.GeographyKey

GROUP BY g.EnglishCountryRegionName

HAVING COUNT(*) > 3000

ORDER BY g.EnglishCountryRegionName;

Besides ordering on the grouping expression, you can also order on the aggregated value, like you can see in the following query, which orders descending the states from different countries by the average income.

SELECT g.EnglishCountryRegionName AS Country,

g.StateProvinceName AS State,

AVG(c.YearlyIncome) AS AvgIncome

FROM dbo.DimCustomer AS c

INNER JOIN dbo.DimGeography AS g

ON c.GeographyKey = g.GeographyKey

GROUP BY g.EnglishCountryRegionName,

g.StateProvinceName

ORDER BY AvgIncome DESC;

I guess there was not much new info in the previous examples for the vast majority of SQL Server developers. However, there is an advanced option of grouping in T-SQL that is not so widely known. You can perform many different ways of grouping in a single query with the GROUPING SETS clause. In this clause, you define different grouping variables in every single set. You could do something similar by using multiple SELECT statements with a single GROUP BY clause and then by unioning the results of them all. Having an option to do this in a single statement gives you the opportunity to make a lot of different grouping with only a few lines of code. In addition, this syntax enables SQL Server to better optimize the query. Look at the following example:

SELECT g.EnglishCountryRegionName AS Country,

GROUPING(g.EnglishCountryRegionName) AS CountryGrouping,

c.EnglishEducation AS Education,

GROUPING(c.EnglishEducation) AS EducationGrouping,

STDEV(c.YearlyIncome) / AVG(c.YearlyIncome)

AS CVIncome

FROM dbo.DimCustomer AS c

INNER JOIN dbo.DimGeography AS g

ON c.GeographyKey = g.GeographyKey

GROUP BY GROUPING SETS

(

(g.EnglishCountryRegionName, c.EnglishEducation),

(g.EnglishCountryRegionName),

(c.EnglishEducation),

()

)

ORDER BY NEWID();

The query calculates the coefficient of variation (defined as the standard deviation divided the mean) for the following groups, in the order as they are listed in the GROUPING SETS clause:

Country and education – expression (g.EnglishCountryRegionName, c.EnglishEducation)
Country only – expression (g.EnglishCountryRegionName)
Education only – expression (c.EnglishEducation)
Over all dataset- expression ()

Note also the usage of the GROUPING() function in the query. This function tells you whether the NULL in a cell comes because there were NULLs in the source data and this means a group NULL, or there is a NULL in the cell because this is a hyper aggregate. For example, NULL in the Education column where the value of the GROUPING(Education) equals to 1 indicates that this is aggregated in such a way that education makes no sense in the context, for example, aggregated over countries only, or over the whole dataset. I used ordering by NEWID() just to shuffle the results. I executed query multiple times before I got the desired order where all possibilities for the GROUPING() function output were included in the first few rows of the result set. Here is the result.

In the first row, you can see that Country is NULL, so this is an aggregate over Education only. In the second row, you can see that both Country and Education are NULL, meaning that this is an aggregate over the whole dataset. I will leave you to interpret the other results by yourself.

Aggregating in R

As you can imagine, there are countless possibilities for aggregating the data in R. In this data science article, I will show you only the options from the base installation, without any additional package. Let me start by reading the data from SQL Server.

library(RODBC)

con <- odbcConnect("AWDW", uid = "RUser", pwd = "Pa$$w0rd")

TM <- as.data.frame(sqlQuery(con,

"SELECT g.EnglishCountryRegionName AS Country,

c.EnglishEducation AS Education,

c.YearlyIncome AS Income

FROM dbo.DimCustomer AS c

INNER JOIN dbo.DimGeography AS g

ON c.GeographyKey = g.GeographyKey;"),

stringsAsFactors = TRUE)

close(con)

The first function I want to mention is the summarize() function. It is the simplest way of getting the basic descriptive statistics for a variable, or even for a complete data frame. In the following example, I am analyzing the Income and the Education variables from the TM data frame.

summary(TM$Income)

summary(TM$Education)

Here are the results.

Min.	1st Qu.	Median	Mean	3rd Qu.	Max.
10000	30000	60000	57310	70000	170000
Bachelors	Graduate Degree	High School
5356	3189	3294
Partial College	Partial High School
5064	1581

You can see that the summary() function returned for a continuous variable the minimal and the maximal values, the values on the first and on the third quartile, and the mean and the median. For a factor, the function returned the absolute frequencies.

There are many other descriptive statistics functions already in base R. The following code shows quite a few of them by analyzing the Income variable.

cat(' ','Mean :', mean(TM$Income))

cat('\n','Median:', median(TM$Income))

cat('\n','Min :', min(TM$Income))

cat('\n','Max :', max(TM$Income))

cat('\n','Range :', range(TM$Income))

cat('\n','IQR :', IQR(TM$Income))

cat('\n','Var :', var(TM$Income))

cat('\n','StDev :', sd(TM$Income))

The code uses the cat() function which concatenates and prints the parameters. In the result, you can se the mean, the median, the minimal and the maximal value, the range and the inter-quartile range, and the variance and the standard deviation for the Income variable. Here are the results.

Mean :	57305.78
Median:	60000
Min :	10000
Max :	170000
Range :	10000 170000
IQR :	40000
Var :	1042375574
StDev :	32285.84

You can use the aggregate() function to aggregate data over groups. Note the three parameters in the following example. The first one is the variable to be aggregated, the second is a list of variables used for grouping, and the third is the aggregate function. In the following example, I am calculating the sum of the income over countries.

1	aggregate(TM$Income, by = list(TM$Country), FUN = sum)

As mentioned, the second parameter is a list of variables used to define the grouping. Let me show you an example with two variables in the list by calculating the mean value for the income over countries and education.

aggregate(TM$Income,

by = list(TM$Country, TM$Education),

FUN = mean)

The third parameter is the name of the function. This could also be a custom function. There is no built-in function in the base R installation for calculating the third and the fourth population moments, the skewness and the kurtosis. No problem, it is very easy to write a custom function. The following code shows the function, and then calculates the skewness and the kurtosis for the Income variable over the whole dataset and per country.

skewkurt <- function(p) {

avg <- mean(p)

cnt <- length(p)

stdev <- sd(p)

skew <- sum((p - avg) ^ 3 / stdev ^ 3) / cnt

kurt <- sum((p - avg) ^ 4 / stdev ^ 4) / cnt - 3

return(c(skewness = skew, kurtosis = kurt))

};

skewkurt(TM$Income);

# Aggregations in groups with a custom function

aggregate(TM$Income, by = list(TM$Country), FUN = skewkurt)

Here are the results.

skewness	kurtosis
0.8219881	0.6451128
	Group.1	x.skewness	x.kurtosis
1	Australia	-0.05017649	-0.50240998
2	Canada	0.95311355	3.19841011
3	France	1.32590706	0.56945630
4	Germany	1.32940136	0.35521164
5	United Kingdom	1.34856013	0.27294523
6	United States	1.17370389	2.08329296

If you need to filter the dataset on the aggregated values, you can store the result of the aggregation in a new data frame, and then filter the new data frame, like the following code shows.

TMAGG <- aggregate(list(Income = TM$Income),

by = list(Country = TM$Country), FUN = mean)

TMAGG[TMAGG$Income > 60000,]

Finally, many of the graphical functions do the grouping and the aggregation by themselves. The barplot() function in the following example plots the bars showing the mean value of the income in each country.

barplot(TMAGG$Income,

legend = TMAGG$Country,

col = c('blue', 'yellow', 'red', 'green', 'magenta', 'black'))

The result is the following graph.

Python Pandas Aggregations

As I explained in my previous article, in Python you get the data frame object when you load the pandas library. This library has also all of the methods you need to aggregate and groups the data and to create simple graphs. The following code imports the necessary libraries and reads the data from SQL Server.

# Imports needed

import numpy as np

import pandas as pd

import pyodbc

import matplotlib.pyplot as plt

# Connecting and reading the data

con = pyodbc.connect('DSN=AWDW;UID=RUser;PWD=Pa$$w0rd')

query = """

SELECT g.EnglishCountryRegionName AS Country,

c.EnglishEducation AS Education,

c.YearlyIncome AS Income

FROM dbo.DimCustomer AS c

INNER JOIN dbo.DimGeography AS g

ON c.GeographyKey = g.GeographyKey;"""

TM = pd.read_sql(query, con)

The pandas data frame describe() method is quite similar to the R summary() function – provides you a quick overview of a variable. Besides this function, there are many separate descriptive statistics functions implemented as data frame methods as well. The list of these functions includes the functions that calculate the skewness and the kurtosis, as you can see from the following code.

# Summary

TM.describe()

# Count and variance

TM['Income'].count(), TM['Income'].var()

# Skewness and kurtosis

TM['Income'].skew(), TM['Income'].kurt()

Here are the results – descriptive statistics for the Income variable. You should be able to interpret the results by yourself without any problem.

>>> # Summary

Income

count 18484.000000

mean 57305.777970

std 32285.841703

min 10000.000000

25% 30000.000000

50% 60000.000000

75% 70000.000000

max 170000.000000

>>> # Count and variance

(18484, 1042375574.4691552)

>>> # Skewness and kurtosis

(0.82212157713655321, 0.64600652578292239)

The pandas data frame groupby() method allows you to calculate the aggregates over groups. You can see in the following example calculation of counts, medians, and descriptive statistics of the Income variable over countries.

TM.groupby('Country')['Income'].count()

TM.groupby('Country')['Income'].median()

TM.groupby('Country')['Income'].describe()

For the sake of brevity, I am showing only the counts here.

Country
Australia	3591
Canada	1571
France	1810
Germany	1780
United Kingdom	1913
United States	7819

If you need to filter the dataset based on the aggregated values, you can do it in the same way as in R. You store the results in a new data frame, and then filter the new data frame, like the following code shows.

1 2	TMAGG = pd.DataFrame(TM.groupby('Country')['Income'].median()) TMAGG.loc[TMAGG.Income > 50000]

Like the R graphing functions also the pandas data frame plot() function groups the data for you. The following code generates a bar chart with the medain income per country.

ax = TMAGG.plot(kind = 'bar',

colors = ('b', 'y', 'r', 'g', 'm', 'k'),

fontsize = 14, legend = False,

use_index = True, rot = 1)

ax.set_xlabel('Country', fontsize = 16)

ax.set_ylabel('Income Median', fontsize = 16)

plt.show()

And here is the bar chart as the final result of the code in this data science article. Compare the median and the mean (shown in the graph created with the R code) of the Income in different countries.

Conclusion

I am not done with grouping and aggregating in this series of Data science articles yet. In the following article, I plan to show you some advanced examples of GROUPING SETS clause in T-SQL. I will show you how to use some efficient grouping and aggregating methods in R by using external packages. I will do an additional deep dive in pandas data frame options for grouping and aggregating data.

Introduction to data science, data understanding and preparation

Data science in SQL Server: Data understanding and transformation – ordinal variables and dummies

Data science in SQL Server: Data analysis and transformation – binning a continuous variable

Data science in SQL Server: Data analysis and transformation – Information entropy of a discrete variable

Data understanding and preparation – basic work with datasets

Data science in SQL Server: Data analysis and transformation – grouping and aggregating data I

Data science in SQL Server: Data analysis and transformation – grouping and aggregating data II

Interview questions and answers about data science, data understanding and preparation

Dejan Sarka

Dejan Sarka, MCT and Data Platform MVP, is an independent trainer and consultant that focuses on development of database & business intelligence applications.Besides projects, he spends about half of the time on training and mentoring. He is the founder of the Slovenian SQL Server and .NET Users Group. Dejan Sarka is the main author or coauthor of sixteen books about databases and SQL Server. He also developed many courses and seminars for Microsoft, SolidQ and Pluralsight.

View all posts by Dejan Sarka

SQLShack

Data science in SQL Server: Data analysis and transformation – grouping and aggregating data I

T-SQL Grouping Sets

Aggregating in R

Python Pandas Aggregations

Conclusion

Table of contents

References

T-SQL Grouping Sets

Aggregating in R

Python Pandas Aggregations

Conclusion

Table of contents

References

Related posts: