Dejan Sarka

Data science in SQL Server: Data analysis and transformation – grouping and aggregating data I

July 10, 2018 by

Previously, in this Data science series, I already tacitly did quite a few aggregations over the whole dataset and aggregations over groups of data. Of course, the vast majority of the readers here is familiar with the GROUP BY clause in the T-SQL SELECT statement and with the basic aggregate functions. Therefore, in this article, I want to show some advanced aggregation options in T-SQL and grouping in aggregations of data in an R or a Python data frame.

Aggregating data might be a part of data preparation in data science, a part of data overview, or even already a part of the finaly analysis of the data.

T-SQL Grouping Sets

The T-SQL language does not have a plethora of different aggregate functions. However, you can do a lot of different ways of grouping of the same rows with just a few lines of code in T-SQL.

Let me start with the basic aggregation over groups in T-SQL. The following code calculates the number of customers per country ordered by the country name, using the COUNT(*) aggregate function. You define the grouping variables in the GROUP BY clause.

The next step is to filter the results on the aggregated values. You can do this with the HAVING clause, like the following query shows. The query keeps in the result only the countries where there are more than 3,000 customers.

Besides ordering on the grouping expression, you can also order on the aggregated value, like you can see in the following query, which orders descending the states from different countries by the average income.

I guess there was not much new info in the previous examples for the vast majority of SQL Server developers. However, there is an advanced option of grouping in T-SQL that is not so widely known. You can perform many different ways of grouping in a single query with the GROUPING SETS clause. In this clause, you define different grouping variables in every single set. You could do something similar by using multiple SELECT statements with a single GROUP BY clause and then by unioning the results of them all. Having an option to do this in a single statement gives you the opportunity to make a lot of different grouping with only a few lines of code. In addition, this syntax enables SQL Server to better optimize the query. Look at the following example:

The query calculates the coefficient of variation (defined as the standard deviation divided the mean) for the following groups, in the order as they are listed in the GROUPING SETS clause:

  • Country and education – expression (g.EnglishCountryRegionName, c.EnglishEducation)
  • Country only – expression (g.EnglishCountryRegionName)
  • Education only – expression (c.EnglishEducation)
  • Over all dataset- expression ()

Note also the usage of the GROUPING() function in the query. This function tells you whether the NULL in a cell comes because there were NULLs in the source data and this means a group NULL, or there is a NULL in the cell because this is a hyper aggregate. For example, NULL in the Education column where the value of the GROUPING(Education) equals to 1 indicates that this is aggregated in such a way that education makes no sense in the context, for example, aggregated over countries only, or over the whole dataset. I used ordering by NEWID() just to shuffle the results. I executed query multiple times before I got the desired order where all possibilities for the GROUPING() function output were included in the first few rows of the result set. Here is the result.

In the first row, you can see that Country is NULL, so this is an aggregate over Education only. In the second row, you can see that both Country and Education are NULL, meaning that this is an aggregate over the whole dataset. I will leave you to interpret the other results by yourself.

Aggregating in R

As you can imagine, there are countless possibilities for aggregating the data in R. In this data science article, I will show you only the options from the base installation, without any additional package. Let me start by reading the data from SQL Server.

The first function I want to mention is the summarize() function. It is the simplest way of getting the basic descriptive statistics for a variable, or even for a complete data frame. In the following example, I am analyzing the Income and the Education variables from the TM data frame.

summary(TM$Income)

summary(TM$Education)

Here are the results.

Min. 1st Qu. Median Mean 3rd Qu. Max.
10000 30000 60000 57310 70000 170000
Bachelors Graduate Degree High School
5356 3189 3294
Partial College Partial High School
5064 1581

You can see that the summary() function returned for a continuous variable the minimal and the maximal values, the values on the first and on the third quartile, and the mean and the median. For a factor, the function returned the absolute frequencies.

There are many other descriptive statistics functions already in base R. The following code shows quite a few of them by analyzing the Income variable.

The code uses the cat() function which concatenates and prints the parameters. In the result, you can se the mean, the median, the minimal and the maximal value, the range and the inter-quartile range, and the variance and the standard deviation for the Income variable. Here are the results.

Mean : 57305.78
Median: 60000
Min : 10000
Max : 170000
Range : 10000 170000
IQR : 40000
Var : 1042375574
StDev : 32285.84

You can use the aggregate() function to aggregate data over groups. Note the three parameters in the following example. The first one is the variable to be aggregated, the second is a list of variables used for grouping, and the third is the aggregate function. In the following example, I am calculating the sum of the income over countries.

As mentioned, the second parameter is a list of variables used to define the grouping. Let me show you an example with two variables in the list by calculating the mean value for the income over countries and education.

The third parameter is the name of the function. This could also be a custom function. There is no built-in function in the base R installation for calculating the third and the fourth population moments, the skewness and the kurtosis. No problem, it is very easy to write a custom function. The following code shows the function, and then calculates the skewness and the kurtosis for the Income variable over the whole dataset and per country.

Here are the results.

skewness kurtosis
0.8219881 0.6451128
Group.1 x.skewness x.kurtosis
1 Australia -0.05017649 -0.50240998
2 Canada 0.95311355 3.19841011
3 France 1.32590706 0.56945630
4 Germany 1.32940136 0.35521164
5 United Kingdom 1.34856013 0.27294523
6 United States 1.17370389 2.08329296

If you need to filter the dataset on the aggregated values, you can store the result of the aggregation in a new data frame, and then filter the new data frame, like the following code shows.

Finally, many of the graphical functions do the grouping and the aggregation by themselves. The barplot() function in the following example plots the bars showing the mean value of the income in each country.

The result is the following graph.

Python Pandas Aggregations

As I explained in my previous article, in Python you get the data frame object when you load the pandas library. This library has also all of the methods you need to aggregate and groups the data and to create simple graphs. The following code imports the necessary libraries and reads the data from SQL Server.

The pandas data frame describe() method is quite similar to the R summary() function – provides you a quick overview of a variable. Besides this function, there are many separate descriptive statistics functions implemented as data frame methods as well. The list of these functions includes the functions that calculate the skewness and the kurtosis, as you can see from the following code.

Here are the results – descriptive statistics for the Income variable. You should be able to interpret the results by yourself without any problem.

The pandas data frame groupby() method allows you to calculate the aggregates over groups. You can see in the following example calculation of counts, medians, and descriptive statistics of the Income variable over countries.

For the sake of brevity, I am showing only the counts here.

Country
Australia 3591
Canada 1571
France 1810
Germany 1780
United Kingdom 1913
United States 7819

If you need to filter the dataset based on the aggregated values, you can do it in the same way as in R. You store the results in a new data frame, and then filter the new data frame, like the following code shows.

Like the R graphing functions also the pandas data frame plot() function groups the data for you. The following code generates a bar chart with the medain income per country.

And here is the bar chart as the final result of the code in this data science article. Compare the median and the mean (shown in the graph created with the R code) of the Income in different countries.

Conclusion

I am not done with grouping and aggregating in this series of Data science articles yet. In the following article, I plan to show you some advanced examples of GROUPING SETS clause in T-SQL. I will show you how to use some efficient grouping and aggregating methods in R by using external packages. I will do an additional deep dive in pandas data frame options for grouping and aggregating data.

Table of contents

Introduction to data science, data understanding and preparation
Data science in SQL Server: Data understanding and transformation – ordinal variables and dummies
Data science in SQL Server: Data analysis and transformation – binning a continuous variable
Data science in SQL Server: Data analysis and transformation – Information entropy of a discrete variable
Data understanding and preparation – basic work with datasets
Data science in SQL Server: Data analysis and transformation – grouping and aggregating data I
Data science in SQL Server: Data analysis and transformation – grouping and aggregating data II
Interview questions and answers about data science, data understanding and preparation

References


Dejan Sarka
Data science

About Dejan Sarka

Dejan Sarka, MCT and Data Platform MVP, is an independent trainer and consultant that focuses on development of database & business intelligence applications. Besides projects, he spends about half of the time on training and mentoring. He is the founder of the Slovenian SQL Server and .NET Users Group. Dejan Sarka is the main author or coauthor of sixteen books about databases and SQL Server. He also developed many courses and seminars for Microsoft, SolidQ and Pluralsight. View all posts by Dejan Sarka

168 Views