Haroon Ashraf
Creating staging table id based on unique student name and subject: Adil-Business Intelligence Adil-Data Structure

Common use cases of SQL SELECT Distinct

February 2, 2023 by

This article talks about the commonly used scenarios of SQL Select Distinct in an easily understandable format equally suitable for beginners and professionals.

The article also brings into light some of the data analysis performed based on the Distinct keyword.

Additionally, the readers of the article are going to get hands-on experience in implementing the common use cases discussed in this article by simply copying and running the scripts against the sample database of their choice.

Understanding SQL Select Distinct

Let us try to get a clear understanding of using distinct with SELECT statements in SQL with the help of simple examples first.

Setup Sample Database

Please build a simple sample database called StudentExamDemo currently with one table named Student to see even the simplest of our examples in action.

We are using an on-premises SQL instance to build the database and its objects (tables) while you are free to build an Azure SQL database or on-premises SQL database.

Run the following script against the master database (to create a database):

Next, create a Student table by running the following script against the sample database:

SELECT Statement vs SELECT ALL Statement

Before we jump to SQL SELECT Distinct, we must understand that the SELECT statement itself simply returns the rows of a table based on the columns you mention with your select statement.

However, there is one more important thing to know about the SELECT statement, it, by default returns all the records of a table including duplicate values of the columns.

So, technically in all SELECT statements by default the keyword ALL is included unless stated otherwise.

Let us view all the rows of the Student table (which we just build earlier) by using the SELECT statement as follows:

The output is as follows:

Student table with StudentId: 1,2,3, Name Atif,Atif,Atif, Detail...

Now we will run the same script with an additional ALL keyword at the start:

The results are as follows:

Student table results stay same with using SELECT ALL

So, we can see that SELECT without the mention of any specific clause is using the ALL keyword as we have seen in the above two examples.

We also discovered one more thing a student record seems to be a duplicate as we assume it was inserted twice by mistake.

SQL SELECT Distinct Statement vs SELECT ALL Statement

The Distinct keyword in the SELECT statement can help us to show only distinct records based on a column and this means we are not going to see duplicate rows that we saw earlier in the first example.

Let us see it in action by running the following script:

The query output is as follows:

Using SELECT Distinct shows only distinct Names Atif and Sarah

We don’t see the student Atif record twice this time because we are looking at distinct names (Name column values) and as a result, we have temporarily prevented the duplicate row to show up in the result set.

About the position of the distinct keyword

Please note that in a T-SQL script (SQL query) Distinct should be the first word after SELECT as you cannot use distinct with the second column leaving the first column as it is.

For example, running the following script shows an error:

Error if Distinct is used with second column rather than first column

Common use cases

Let us now discuss the common use cases of SQL SELECT Distinct statement.

Analyzing and Improving data accuracy based on distinct values

Suppose we are doing some crucial analysis of the data and during the analysis, we would like to understand the accuracy of the data that is being analyzed.

Now let us say our accuracy depends on the uniqueness of column Name in the Student table so we are going to make use of the SQL SELECT Distinct command to understand how good our data is in terms of accuracy.

One way is to count distinct values against the total values of the table and if they are not the same that means some duplicate records are present which can impact our analysis results.

Let us run the following script to prove this:

The queries output is as follows:

Total rows of Student table: 3 Total Names: 3 Total Unique Names: 2

From the output it is clear that we have fewer unique names (two) than expected (three) so this can affect the results of our analysis if we think that this is because a record has been inserted twice by mistake.

Similarly, we can use this information to remove duplicates as a means to improve our data analysis outcome. Let us fix the issue by replacing the duplicate name with another name to ensure that we have three unique rows and there is no duplicate data in our table.

The final output is as follows:

Duplicate row for Atif has been replaced with new name Adil

Finding Unique References of Data

We can use SQL SELECT Distinct to find unique references of data. For example, we have a result set that shows different rows of a table and each row consists of columns. Now sometimes we are more interested in unique patterns of the data rather than the data itself which is one of the key things in investigative analysis.

We need to build the Subject and StudentSubject table by running the following script:

Let us now focus on the StudentSubject table which contains the records of all the students who obtained marks in the exam against their subjects.

So, we have the following items of interest in this scenario:

  1. Students
  2. Subjects
  3. Marks obtained

Please note that students are already defined in a reference table named Student and likewise we also have a reference table for Subjects. Now keeping in mind the reference tables we can say students showing up in the StudentSubject table actually originated in their reference table.

Let us run the script to view the StudentSubject data first:

StudentSubject table output with duplicate Atif, Adil and Sarah names

However, we are interested in a slightly different piece of information which we referred to as finding unique references or patterns of data. This means we want to know how many distinct marks were obtained by the students so far against any subject which will give us a distinct pattern of the obtained marks.

Let us find it out with the help of the SQL SELECT Distinct command as follows:

The query output is as follows:

Unique marks obtained: 70, 80 and 90

This information not only tells us students are brilliant but also see that the Marks column follows a unique pattern of 70, 80, 90. We can then use this pattern to build another table called Unique_Marks_Pattern and then investigate further.

Staging Data with Multiple Unique Columns in a Data Warehouse

This one is rather a rare or less common scenario but shows the tremendous power of SQL SELECT Distinct that changes your script to serve the purpose.

Assume you are a SQL/BI developer who is working on a staging data logic. Staging data logic means you are building SQL scripts to successfully pull the data from the sources for your data warehouse to be loaded initially into a staging environment which is quite a common practice in data warehouse business intelligence solutions.

The problem is you need to stage data from a table with multiple unique columns whereas in a standard practice you cannot introduce unique constraints on the staging tables as they should be free from any constraints for straightforward data transfer.

Let us see it in action by running the following script:

— Adding more rows to the StudentSubject table with existing subjects for the same students

Now view the StudentSubject data:

The results are shown below:

Duplicate student and subject names for Adil, Atif and Sarah

Now we are only keen to stage the data for student and subject from this table ignoring the marks obtained so a simple query will get us duplicates as follows:

The query output is as follows: Duplicate student and subject names for Adil, Atif and Sarah

However, we need to build a staging script to only pull both unique student and subject and this is possible through SQL SELECT Distinct as follows:

The result set is as follows:

Unique Student with Unique Subject
Adil with BI
Adil with Data Structure
Atif with Data Structure
Atif with Modern Databases

Finally, we can build a staging id column for further processing as follows:

The results are shown below:

Creating staging table id based on unique student name and subject:
Adil-Business Intelligence
Adil-Data Structure

Conclusion

Congratulations, you have just learned the common use case scenarios of SQL SELECT Distinct with the help of examples.

Haroon Ashraf
Latest posts by Haroon Ashraf (see all)
Data Warehouse, Development, T-SQL

About Haroon Ashraf

Haroon’s deep interest in logic and reasoning at an early age of his academic career paved his path to become a data professional. He holds BSc and MSc Degrees in Computer Science and also received the OPF merit award. He began his professional life as a computer programmer more than a decade ago, working on his first data venture to migrate and rewrite a public sector database driven examination system from IBM AS400 (DB2) to SQL Server 2000 using VB 6.0 and Classic ASP along with developing reports and archiving many years of data. His work and interest revolves around Database-Centric Architectures and his expertise include database and reports design, development, testing, implementation and migration along with Database Life Cycle Management (DLM). He has also received passing grade to earn DevOps for Databases verified certificate, an area in which he finds particular interest and potential. View all posts by Haroon Ashraf

168 Views