T-SQL commands performance comparison - NOT IN vs NOT EXISTS vs LEFT JOIN vs EXCEPT

This articles gives you a performance comparison for NOT IN, SQL Not Exists, SQL LEFT JOIN and SQL EXCEPT.

The T-SQL commands library, available in Microsoft SQL Server and updated in each version with new commands and enhancements to the existing commands, provides us with different ways to perform the same action. In addition to an ever-evolving toolkit of commands, different developers will apply different techniques and approaches to the same problem sets and challenges

For example, three different SQL Server developers can get the same data using three different queries, with each developer having his own approach to writing the T-SQL queries to retrieve or modify the data. But the database administrator will not necessarily be happy with all of these approaches, he is looking to these methods from different aspects that they may not concentrate on. Although all of them may get the same required result, each query will behave in a different way, consume a different amount of SQL Server resources with different execution times. All of these parameters that the database administrator concentrates on shape the query performance. And it is the database administrator’s rule here to tune the performance of these queries and choose the best method with the minimum possible effect on the overall SQL Server performance.

In this article, we will describe the different ways that can be used to retrieve data from a table that does not exist in another table and compare the performance of these different approaches. These methods will use the NOT IN, SQL NOT EXISTS, LEFT JOIN and EXCEPT T-SQL commands. Before starting the performance comparison between the different methods, we will provide a brief description of each one of these T-SQL commands.

The SQL NOT IN command allows you to specify multiple values in the WHERE clause. You can imagine it as a series of NOT EQUAL TO commands that are separated by the OR condition. The NO IN command compares specific column values from the first table with another column values in the second table or a subquery and returns all values from the first table that are not found in the second table, without performing any filter for the distinct values. The NULL is considered and returned by the NOT IN command as a value.

The SQL NOT EXISTS command is used to check for the existence of specific values in the provided subquery. The subquery will not return any data; it returns TRUE or FALSE values depend on the subquery values existence check.

The LEFT JOIN command is used to return all records from the first left table, the matched records from the second right table and NULL values from the right side for the left table records that have no match in the right table.

The EXCEPT command is used to return all distinct records from the first SELECT statement that are not returned from the second SELECT statement, with each SELECT statement will be considered as a separate dataset. In other words, it returns all distinct records from the first dataset and removes from that result the records that are returned from the second dataset. You can imagine it as a combination of the SQL NOT EXISTS command and the DISTINCT clause. Take into consideration that the left and the right datasets of the EXCEPT command should have the same number of columns.

Now, let us see, in practical terms, how we could retrieve data from one table that does not exist in another table using different methods and compare the performance of these methods to conclude which one behaves in the best way. We will start by creating two new tables, using the T-SQL script below:

USE SQLShackDemo

CREATE TABLE Category_A

( Cat_ID INT ,

Cat_Name VARCHAR(50)

)

CREATE TABLE Category_B

( Cat_ID INT ,

Cat_Name VARCHAR(50)

)

After creating the tables, we will fill each table with 10K records for testing purposes, using ApexSQL Generate as shown below:

The testing tables are ready now. We will enable the TIME and IO statistics to use these statistics to compare the different methods performance. After that we will prepare the T-SQL queries that are used to pull the data that exists in Category_A table but not exists in Category_B table using four methods; NOT IN command, SQL NOT EXISTS command, LEFT JOIN command and finally EXCEPT command. This can be achieved using the T-SQL script below:

USE SQLShackDemo

SET STATISTICS TIME ON

SET STATISTICS IO ON

--- NOT INT

SELECT Cat_ID

FROM Category_A WHERE Cat_ID NOT IN (SELECT Cat_ID FROM Category_B)

-- NOT EXISTS

SELECT A.Cat_ID

FROM Category_A A WHERE NOT EXISTS (SELECT B.Cat_ID FROM Category_B B WHERE B.Cat_ID = A.Cat_ID)

-- LEFT JOIN

SELECT A.Cat_ID

FROM Category_A A

LEFT JOIN Category_B B ON A.Cat_ID = B.Cat_ID

WHERE B.Cat_ID IS NULL

-- EXCEPT

SELECT A.Cat_ID

FROM Category_A A

EXCEPT

SELECT B.Cat_ID

FROM Category_B B

If you execute the previous script, you will find that the four methods will return the same result, as shown in the below result that contains the number of returned record by each command:

At this step, the SQL Server developer will be happy, as any method he will use, will return the same result for him. But what about the SQL Server database administrator who needs to check the performance of each approach? If we review the IO and TIME statistics that are generated after executing the previous script, you will see that the script that uses NOT IN command performs 10062 logical reads on the Category_B table, takes 228ms to be completed successfully and 63ms from the CPU time as shown below:

On the other hand, the script that uses the SQL NOT EXISTS command performs only 29 logical reads on the Category_B table, takes 154ms to be completed successfully and 15ms from the CPU time, which is much better that the previous method that uses NOT IN from all aspects, as shown below:

For the script that uses the LEFT JOIN command, it performs the same number of logical reads as the previous SQL NOT EXISTS method, which is 29 logical reads, takes 151ms to be completed successfully and 16ms from the CPU time, which is somehow similar to the statistics derived from the previous SQL NOT EXISTS method, as shown below:

Finally, the statistics generated after running the method that uses the EXCEPT command show that it performs again 29 logical reads, takes 218ms to be completed successfully, and consumes 15ms from the CPU time, which is worse than SQL NOT EXISTS and LEFT JOIN methods in term of execution time, as shown below:

Until this step, we can derive from the IO and TIME statistics that the methods that use the SQL NOT EXISTS and LEFT JOIN commands act in the best way, with the best overall performance. But will the queries execution plans tell us the same result? Let us check the execution plans that are generated from the previous queries using ApexSQL Plan, a tool for SQL Server query plan analysis.

The execution plans cost summary window, below, shows that the methods that use the SQL NOT EXISTS and the LEFT JOIN commands has the least execution costs, and the method that uses the NOT IN command has the heaviest query cost, as shown below:

The screenshot is from ApexSQL Plan, a tool to view and analyze SQL Server query execution plans

Let us dive deeply to understand how each method behaves by studying the execution plans for these methods.

The execution plan for the query that uses NOT IN command is a complex plan with number of heavy operators that perform looping and counting operations. What we will concentrate on in here, for performance comparison purposes, are the Nested Loops operators. Under the Nested Loops operators, you can see that these operators are not real join operators, it performs something called Left Anti Semi Join. This partial join operator will return all rows from the first left table with no matching rows in the second right table, skipping all matching rows between the two tables. The heaviest operator in the below execution plan generated by the query using the NOT IN command is the Row Count Spool operator, that performs scans on the unsorted Category_B table, counting how many rows are returned, and returns only the rows count without any data, for rows existence check purposes only. The execution plan will be as follows:

The screenshot is from ApexSQL Plan, a tool to view and analyze SQL Server query execution plans

The below execution plan that is generated by the query using the SQL NOT EXIST command is simpler that the previous plan, with the heaviest operator in that plan is the Hash Match operator that again performs a Left Anti Semi Join partial join operation that checks for unmatched rows existence as described previously. This plan will be as follows:

The screenshot is from ApexSQL Plan, a tool to view and analyze SQL Server query execution plans

Comparing the previous execution plan generated by the query that uses the SQL NOT EXISTS command with the below execution plan that is generated by the query using the LEFT JOIN command, the new plan replaces the partial join with a FILTER operator that performs the IS NULL filtering for the data returned from the Right OUTER JOIN operator, that returns the matching rows from the second table which may include duplicates. This plan will be as shown below:

The screenshot is from ApexSQL Plan, a tool to view and analyze SQL Server query execution plans

The last execution plan generated by the query that uses the EXCEPT command contains also the Left Anti Semi Join partial join operation that checks for unmatched rows existence as shown previously. It also performs an Aggregate operation due to the large size of the table and the unsorted records on it. The Hash Aggregate process creates a hash table in the memory, that makes it a heavy operation, and a hash value will be calculated for each processed row and for each calculated hash value. After that, it checks the rows in the resulting hash bucket for the joining rows. The plan will be as shown below:

The screenshot is from ApexSQL Plan, a tool to view and analyze SQL Server query execution plans

We can conclude again from the previous execution plans generated by each used command that the best two methods are the ones using the SQL NOT EXISTS and LEFT JOIN commands. Recall that the data in the previous tables is not sorted due to the absence of the indexes. So, let us create an index on the joining column, the Cat_ID, in both tables using the T-SQL script below:

USE [SQLShackDemo]

CREATE NONCLUSTERED INDEX [IX_CAT_ID] ON [dbo].[Category_A]

(

[Cat_ID] ASC

)

CREATE NONCLUSTERED INDEX [IX_CAT_ID] ON [dbo].[Category_B]

(

[Cat_ID] ASC

)

The execution plans cost summary window generated using ApexSQL Plan after running the previous SELECT statements, shows that the method that uses the SQL NOT EXISTS command still the best one and the one using the EXCEPT command enhanced clearly after adding the indexes to the tables, as shown below:

The screenshot is from ApexSQL Plan, a tool to view and analyze SQL Server query execution plans

Checking the execution plan for the SQL NOT EXISTS command, the previous partial join operation is eliminated now and replaced by the Merge Join operator, as the data is sorted now in the tables after adding the indexes. The new plan will be as shown below:

The screenshot is from ApexSQL Plan, a tool to view and analyze SQL Server query execution plans

The query that uses the EXCEPT command enhanced clearly after adding the indexes to the tables and become one of the best methods to achieve our goal here. This also appeared in the below query execution plan, in which the previous partial join operation is also replaced by the Merge Join operator, as the data is sorted by adding the indexes. The Hash Aggregate operator is also replaced now with a Stream Aggregate operator as it aggregates a sorted data after adding the indexes.

The new plan will be as follows:

The screenshot is from ApexSQL Plan, a tool to view and analyze SQL Server query execution plans

Conclusion:

SQL Server provides us with different ways to retrieve the same data, leaving it to the SQL Server developer to follow his own development approach to achieve that. For example, there are different ways that can be used to retrieve data from one table that do not exist in another table. In this article, we described how to get such data using NOT IN, SQL NOT EXISTS, LEFT JOIN and EXCEPT T-SQL commands after providing a brief description of each command and compare the performance of these queries. We conclude, first, that using the SQL NOT EXISTS or the LEFT JOIN commands are the best choice from all performance aspects. We tried also to add an index on the joining column on both tables, where the query that uses the EXCEPT command enhanced clearly and showed better performance, besides the SQL NOT EXISTS command that still the best choice overall.

Ahmad Yaseen

Ahmad Yaseen is a Microsoft Big Data engineer with deep knowledge and experience in SQL BI, SQL Server Database Administration and Development fields.

He is a Microsoft Certified Solution Expert in Data Management and Analytics, Microsoft Certified Solution Associate in SQL Database Administration and Development, Azure Developer Associate and Microsoft Certified Trainer.

Also, he is contributing with his SQL tips in many blogs.

View all posts by Ahmad Yaseen

Conclusion:

Useful links

Related posts: