An overview of the SQL GROUP BY clause

This article briefly explains the SQL group by clause, when it should be used, and what we should consider when using it.

Side Note: All code examples in this article are made using SQL Server 2019 and Stack Overflow 2010 database.

What is “Grouping” in SQL, and why is it needed?

Consider that we are analyzing the Stack overflow QA website database. This database contains several tables that store information about the website users, posted questions, answers, comments, and awarded badges.

For example, let’s take the “Posts” table. This table contains all information about different types of posts on the QA website; questions, answers, wiki, moderators’ nominations… If we are looking to count the number of each type of post, using a simple SELECT statement can return the number of rows of a single type by using the COUNT() function besides filtering the result using the WHERE clause:

Figure 1 – Calculating the number of rows for one post’s type

If we try to add the PostTypeId column before the COUNT(*), the SQL command will not be executed and will throw the following exception to notify the user that aggregation is required to perform this operation:

Column ‘StackOverflow2010.dbo.Posts.PostTypeId’ is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.

The SQL GROUP BY clause arranges similar data stored within one or several columns into groups, where an aggregate function is applied to produce summaries. For example, calculating the number of posts for each user.

Figure 2 – Calculating the number of each type of post

SQL query execution order

Before explaining the SQL GROUP BY clause and when we should use it, we need to know how is the SQL query executed by the database engine. Once asking to execute a SQL command, the database engine parses the different parts of it in the following order:

FROM / JOIN
WHERE
GROUP BY
HAVING
SELECT
DISTINCT
ORDER BY
OFFSET

GROUP BY clause

As defined in the Microsoft official documentation, a SELECT – GROUP BY command is a SELECT statement clause that divides the query result into groups of rows, usually to perform one or more aggregations on each group. The SELECT statement returns one row per group”.

The syntax of the GROUP BY clause is as follows:

GROUP BY {

column-expression

| ROLLUP ( <group_by_expression> )

| CUBE ( <group_by_expression> )

| GROUPING SETS ( <grouping_set> )

| HAVING ( <filter expression> )

}

A SQL GROUP BY clause can be used to perform aggregations over each group or even to remove duplicated rows based on the grouping expression. For example, assume that we need to extract all the distinct locations of the Stack Overflow users. We can simply add the DISTINCT keyword after the SELECT term.

SELECT DISTINCT [Location]

FROM [StackOverflow2010].[dbo].[Users]

Or, we can use a SELECT – GROUP BY command to achieve the same thing:

SELECT [Location]

FROM [StackOverflow2010].[dbo].[Users]

GROUP BY [Location]

It is worth mentioning that both queries have the same execution plan.

Figure 3 – Execution plans comparison

Now, we can add an aggregation function in the SELECT clause to perform it per each group.

SELECT [Location], COUNT(*)

FROM [StackOverflow2010].[dbo].[Users]

GROUP BY [Location]

As shown in the image below, all NULL values are considered equal and collected into a single group.

Figure 4 – Adding an aggregate function to the SELECT – GROUP BY command

Checking the command execution plan, we can see that after aggregating the data within groups, a scalar function is applied per each group.

Figure 5 – Execution plan showing the aggregation function computation

GROUP BY columns

The simplest way to use the SQL GROUP BY clause is to select the columns needed for the grouping operation. All columns specified in the SELECT clause – except the aggregation functions – should be specified in the GROUP BY clause. For example, if we execute the following query:

SELECT [Id], [Location],COUNT(*)

FROM [StackOverflow2010].[dbo].[Users]

GROUP BY [Location];

The following exception is thrown:

Column ‘StackOverflow2010.dbo.Users.Id’ is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.

Another thing worth mentioning is that column aliases cannot be used in the SQL GROUP BY clause since it is computed before the SELECT clause by the SQL engine.

GROUP BY user-defined functions

We can also use user-defined scalar functions alongside the columns specified in the SELECT and GROUP BY clauses. For example, we created the following function to get whether a question has an answer or not:

CREATE FUNCTION isSolved

(

@PostID INT)

RETURNS BIT

BEGIN

DECLARE @Exists BIT

SELECT @Exists = CASE WHEN AcceptedAnswerId =0 THEN 0 ELSE 1 END

FROM [StackOverflow2010].[dbo].[Posts] WHERE [Id] = @PostID

RETURN @Exists

END

The following query can be executed successfully.

SELECT dbo.IsSolved([Id]), COUNT(*)

FROM [StackOverflow2010].[dbo].[Posts]

GROUP BY dbo.IsSolved([Id])

Figure 6 – Grouping by a user-defined function

GROUP BY column expressions

Another way to group the result is by using column expressions. For example, if we are looking to group the columns based on a specific computation such as a mathematical expression or a CASE WHEN expression, we can simply use it similar to a single column. For example, assume that we want to count the number of questions being solved and the number of open issues on the Stack Overflow website. Noting that the Posts table only contains a column named AccetpedAnswerId that contains the identifier of the answer.

SELECT CASE WHEN AcceptedAnswerId =0 THEN 0 ELSE 1 END , COUNT(*)

FROM [StackOverflow2010].[dbo].[Posts]

GROUP BY CASE WHEN AcceptedAnswerId = 0 THEN 0 ELSE 1 END

Figure 7 – Grouping by a CASE WHEN expression

GROUP BY HAVING

We cannot use the WHERE clause in this operation to filter the query result based on the group aggregated function result since the database engine executes the WHERE clause before applying the aggregate function. This is why the HAVING clause was found.

The HAVING clause can only be used with a SQL GROUP BY clause. For example, we need to get the locations mentioned in more than 1000 and less than 10000 user profiles.

SELECT [Location],COUNT(*)

FROM [StackOverflow2010].[dbo].[Users]

GROUP BY [Location]

HAVING COUNT(*) BETWEEN 1000 AND 10000

ORDER BY COUNT(*) DESC;

Figure 8 – Using the HAVING keyword to filter the result of a grouping operation

GROUP BY ROLLUP, CUBE, and GROUPING SETS

To explain the ROLLUP, CUBE, and GROUPING SETS options, we create a view from the Posts and Users tables with the following structure:

Posts.Id

Posts.CreationDate

Posts.PostType

Users.Location

GROUP BY ROLLUP

Let’s assume that we want to generate a report showing the number of posts in each year, quarter, and month, for each post’s type. The result should be as the following:

Creation_year

Creation_quarter

Creation_Month

PostType

PostsCount

The SQL GROUP BY ROLLUP lets us create the combinations that exist in the data in addition to rolling up within the hierarchical order we define in the ROLLUP statement. For example, let us try the following query:

SELECT YEAR([CreationDate]) as Creation_year,

DATEPART(QUarter,[CreationDate]) as Creation_quarter,

MONTH([CreationDate]) as Creation_month,

COUNT(*) as Number_of_posts

FROM [StackOverflow2010].[dbo].[vPosts]

GROUP BY ROLLUP(YEAR([CreationDate]) , DATEPART(QUarter,[CreationDate]) , MONTH([CreationDate]))

As shown in the image below, the result includes three levels of aggregations:

The number of posts per month
The number of posts per quarter
The number of posts per year

Figure 9 – Using the ROLLUP statement to apply aggregation on different hierarchical levels

GROUP BY CUBE

Now, let’s assume that we are asked to generate a report that shows the number of posts per user location, the post type, or both.

THE SQL GROUP BY CUBE statement produces every possible combination between the columns mentioned in the CUBE statement. For example, let’s try the following command:

SELECT [Location], [Type],

COUNT(*) as Number_of_posts

FROM [StackOverflow2010].[dbo].[vPosts]

GROUP BY CUBE([Location], [Type])

As shown in the image below, the results contain three groups:

The number of posts per location
The number of posts per type
The number of posts per location and type

Figure 10 – Using the GROUP BY CUBE statement

GROUP BY GROUPING SETS

Sometimes we will be asked to only generate a report for only some specific combination of columns and expressions. In this case, using the CUBE or ROLLUP may be inefficient and time-consuming.

For this reason, we can use the GROUPING SETS statement, where we should define each combination explicitly. For example, we only want to generate a report showing the number of posts per location, per type, and the total number of posts.

SELECT [Location],

[Type],

COUNT(*) as Number_of_posts

FROM [StackOverflow2010].[dbo].[vPosts]

GROUP BY GROUPING SETS(

(), -- Total number of posts

([Location]),

([Type]))

As shown in the image below, the first five rows show the number of posts for each type, the 6^th row shows the total number of posts, and the rest shows the number of posts per location.

Figure 11 – using GROUPING SETS statement

Summary

This article briefly explained the SQL GROUP BY clause and how to use it to perform aggregate functions on the data. We also demonstrated the options available, such as grouping over a set of columns, expressions, and user-defined functions. In addition, we explained how to use the HAVING keyword for filtering and the ROLLUP, CUBE, and GROUPING SETS options for reporting purposes.

To learn more about the SQL GROUP BY function, you can refer to the following articles previously published on SQL Shack:

Author
Recent Posts

Hadi Fadlallah

Hadi is an SQL Server professional with more than 10 years of experience. His main expertise is in data integration. He's one of the top ETL and SQL Server Integration Services contributors at Stackoverflow.com . Also, he published several article series about Biml, SSIS features, Search engines, Hadoop, and many other technologies.

Besides working with SQL Server, he worked with different data technologies such as NoSQL databases, Hadoop, Apache Spark. He is a MongoDB, Neo4j, and ArangoDB certified professional.

On the academic level, Hadi holds two master's degrees in computer science and business computing. Currently, he is a Ph.D. candidate in data science focusing on Big Data quality assessment techniques.

Hadi really enjoys learning new things everyday and sharing his knowledge. You can reach him on his personal website.

View all posts by Hadi Fadlallah

Latest posts by Hadi Fadlallah (see all)

An overview of SQL Server monitoring tools - December 12, 2023
Different methods for monitoring MongoDB databases - June 14, 2023
Learn SQL: Insert multiple rows commands - March 6, 2023

What is “Grouping” in SQL, and why is it needed?

SQL query execution order

GROUP BY clause

GROUP BY columns

GROUP BY user-defined functions

GROUP BY column expressions

GROUP BY HAVING

GROUP BY ROLLUP, CUBE, and GROUPING SETS

GROUP BY ROLLUP

GROUP BY CUBE

GROUP BY GROUPING SETS

Summary

Related posts: