Ahmad Yaseen

T-SQL commands performance comparison – NOT IN vs NOT EXISTS vs LEFT JOIN vs EXCEPT

September 22, 2017 by

The T-SQL commands library, available in Microsoft SQL Server and updated in each version with new commands and enhancements to the existing commands, provides us with different ways to perform the same action. In addition to an ever evolving toolkit of commands, different developers will apply different techniques and approaches to the same problem sets and challenges

For example, three different SQL Server developers can get the same data using three different queries, with each developer having his own approach to writing the T-SQL queries to retrieve or modify the data. But the database administrator will not be necessarily be happy with all of these approaches, he is looking to these methods from different aspects that they may not concentrate on. Although all of them may get the same required result, each query will behave in a different way, consume a different amount of SQL Server resources with different execution times. All of these parameters that the database administrator concentrates on shape the query performance. And it is the database administrator’s rule here to tune the performance of these queries and choose the best method with the minimum possible effect on the overall SQL Server performance.

In this article, we will describe the different ways that can be used to retrieve data from a table that does not exist in another table and compare the performance of these different approaches. These methods will use the NOT IN, NOT EXISTS, LEFT JOIN and EXCEPT T-SQL commands. Before starting the performance comparison between the different methods, we will provide a brief description of each one of these T-SQL commands.

The NOT IN command allows you to specify multiple values in the WHERE clause. You can imagine it as a series of NOT EQUAL TO commands that are separated by the OR condition. The NO IN command compares specific column values from the first table with another column values in the second table or a subquery, and returns all values from the first table that are not found in the second table, without performing any filter for the distinct values. The NULL is considered and returned by the NOT IN command as a value.

The NOT EXISTS command is used to check for the existence of specific values in the provided subquery. The subquery will not return any data; it returns TRUE or FALSE values depend on the subquery values existence check.

The LEFT JOIN command is used to return all records from the first left table, the matched records from the second right table and NULL values from the right side for the left table records that have no match in the right table.

The EXCEPT command is used to return all distinct records from the first SELECT statement that are not returned from the second SELECT statement, with each SELECT statement will be considered as a separate dataset. In other words, it returns all distinct records from the first dataset and remove from that result the records that are returned from the second dataset. You can imagine it as a combination of the NOT EXISTS command and the DISTINCT clause. Take into consideration that the left and the right datasets of the EXCEPT command should have the same number of columns.

Now, let us see, in practical terms, how we could retrieve data from one table that does not exist in another table using different methods and compare the performance of these methods to conclude which one behaves in the best way. We will start with creating two new tables, using the T-SQL script below:

After creating the tables, we will fill each table with 10K records for testing purposes, using ApexSQL Generate as shown below:

The testing tables are ready now. We will enable the TIME and IO statistics to use these statistics to compare the different methods performance. After that we will prepare the T-SQL queries that are used to pull the data that exists in Category_A table but not exists in Category_B table using four methods; NOT IN command, NOT EXISTS command, LEFT JOIN command and finally EXCEPT command. This can be achieved using the T-SQL script below:

If you execute the previous script, you will find that the four methods will return the same result, as shown in the below result that contains the number of returned record by each command:

At this step, the SQL Server developer will be happy, as any method he will use, will return the same result for him. But what about the SQL Server database administrator who needs to check the performance of each approach? If we review the IO and TIME statistics that are generated after executing the previous script, you will see that the script that uses NOT IN command performs 10062 logical reads on the Category_B table, takes 228ms to be completed successfully and 63ms from the CPU time as shown below:

On the other hand, the script that uses NOT EXISTS command performs only 29 logical reads on the Category_B table, takes 154ms to be completed successfully and 15ms from the CPU time, which is much better that the previous method that uses NOT IN from all aspects, as shown below:

For the script that uses the LEFT JOIN command, it performs the same number of logical reads as the previous NOT EXISTS method, which is 29 logical reads, takes 151ms to be completed successfully and 16ms from the CPU time, which is somehow similar to the statistics derived from the previous NOT EXISTS method, as shown below:

Finally, the statistics generated after running the method that uses the EXCEPT command show that it performs again 29 logical reads, takes 218ms to be completed successfully, and consumes 15ms from the CPU time, which is worse than NOT EXISTS and LEFT JOIN methods in term of execution time, as shown below:

Until this step, we can derive from the IO and TIME statistics that the methods that use the NOT EXISTS and LEFT JOIN commands act in the best way, with the best overall performance. But will the queries execution plans tell us the same result? Let us check the execution plans that are generated from the previous queries using ApexSQL Plan, a free tool for SQL Server query plan analysis.

The execution plans cost summary window, below, shows that the methods that use the NOT EXISTS and the LEFT JOIN commands has the least execution costs, and the method that uses the NOT IN command has the heaviest query cost, as shown below:

The screenshot is from ApexSQL Plan, a free tool to view and analyze SQL Server query execution plans

Let us dive deeply to understand how each method behaves by studying the execution plans for these methods.

The execution plan for the query that uses NOT IN command is a complex plan with number of heavy operators that perform looping and counting operations. What we will concentrate on in here, for performance comparison purposes, are the Nested Loops operators. Under the Nested Loops operators, you can see that these operators are not real join operators, it performs something called Left Anti Semi Join. This partial join operator will return all rows from the first left table with no matching rows in the second right table, skipping all matching rows between the two tables. The heaviest operator in the below execution plan generated by the query using the NOT IN command is the Row Count Spool operator, that performs scans on the unsorted Category_B table, counting how many rows are returned, and returns only the rows count without any data, for rows existence check purposes only. The execution plan will be as follows:

The screenshot is from ApexSQL Plan, a free tool to view and analyze SQL Server query execution plans

The below execution plan that is generated by the query using the NOT EXIST command is simpler that the previous plan, with the heaviest operator in that plan is the Hash Match operator that again performs a Left Anti Semi Join partial join operation that checks for unmatched rows existence as described previously. This plan will be as follows:

The screenshot is from ApexSQL Plan, a free tool to view and analyze SQL Server query execution plans

Comparing the previous execution plan generated by the query that uses the NOT EXIST command with the below execution plan that is generated by the query using the LEFT JOIN command, the new plan replaces the partial join with a FILTER operator that performs the IS NULL filtering for the data returned from the Right OUTER JOIN operator, that returns the matching rows from the second table which may include duplicates. This plan will be as shown below:

The screenshot is from ApexSQL Plan, a free tool to view and analyze SQL Server query execution plans

The last execution plan generated by the query that uses the EXCEPT command contains also the Left Anti Semi Join partial join operation that checks for unmatched rows existence as shown previously. It also performs an Aggregate operation due to the large size of the table and the unsorted records on it. The Hash Aggregate process creates a hash table in the memory, that makes it a heavy operation, and a hash value will be calculated for each processed row and for each calculated hash value. After that, it checks the rows in the resulting hash bucket for the joining rows. The plan will be as shown below:

The screenshot is from ApexSQL Plan, a free tool to view and analyze SQL Server query execution plans

We can conclude again from the previous execution plans generated by each used command that the best two methods are the ones using the NOT EXIST and LEFT JOIN commands. Recall that the data in the previous tables is not sorted due to the absence of the indexes. So, let us create an index on the joining column, the Cat_ID, in both tables using the T-SQL script below:

The execution plans cost summary window generated using ApexSQL Plan after running the previous SELECT statements, shows that the method that uses the NOT EXISTS command still the best one and the one using the EXCEPT command enhanced clearly after adding the indexes to the tables, as shown below:

The screenshot is from ApexSQL Plan, a free tool to view and analyze SQL Server query execution plans

Checking the execution plan for the NOT EXIST command, the previous partial join operation is eliminated now and replaced by the Merge Join operator, as the data is sorted now in the tables after adding the indexes. The new plan will be as shown below:

The screenshot is from ApexSQL Plan, a free tool to view and analyze SQL Server query execution plans

The query that uses the EXCEPT command enhanced clearly after adding the indexes to the tables and become one of the best methods to achieve our goal here. This also appeared in the below query execution plan, in which the previous partial join operation is also replaced by the Merge Join operator, as the data is sorted by adding the indexes. The Hash Aggregate operator is also replaced now with a Stream Aggregate operator as it aggregates a sorted data after adding the indexes.

The new plan will be as follows:

The screenshot is from ApexSQL Plan, a free tool to view and analyze SQL Server query execution plans

Conclusion:

SQL Server provides us with different ways to retrieve the same data, leaving it to the SQL Server developer to follow his own development approach to achieve that. For example, there are different ways that can be used to retrieve data from one table that doe not exist in another table. In this article, we described how to get such data using NOT IN, NOT EXISTS, LEFT JOIN and EXCEPT T-SQL commands after providing a brief description of each command and compare the performance of these queries. We conclude, first, that using the NOT EXISTS or the LEFT JOIN commands are the best choice from all performance aspects. We tried also to add an index on the joining column on both tables, where the query that uses the EXCEPT command enhanced clearly and showed a better performance, besides the NOT EXISTS command that still the best choice overall.

Useful links:


Ahmad Yaseen

Ahmad Yaseen

Ahmad Yaseen is a SQL Server database administration leader at Aramex International Company with a bachelor’s degree in computer engineering as well as .NET development experience.

He is a Microsoft Certified Professional with a good experience in SQL server development, administration, performance tuning, monitoring and high availability and disaster recovery technologies. Also, he is contributing with his SQL tips in many blogs

View all posts by Ahmad Yaseen
Ahmad Yaseen
SQL Database development

About Ahmad Yaseen

Ahmad Yaseen is a SQL Server database administration leader at Aramex International Company with a bachelor’s degree in computer engineering as well as .NET development experience. He is a Microsoft Certified Professional with a good experience in SQL server development, administration, performance tuning, monitoring and high availability and disaster recovery technologies. Also, he is contributing with his SQL tips in many blogs View all posts by Ahmad Yaseen

1,178 Views
  • Ahmad Zuhair Yaseen

    Many thanks Jeff for your input here. Will take that into consideration. Using the tools makes the simulation easier for us, also for the indexes, we tried the table without indexes then after adding indexes as you saw.

    Many thanks again.

    Best Regards,
    Ahmad

  • Brian Lockwood

    Jeff – FYI there is a free community edition for ApexSQL Generate that works on Azure, Amazon, SQL express if you use any of those systems.