Tutorials
Practical T-SQL
Practical T-SQL
  • Practical T-SQL Pocket Guide For Beginners
  • Preface
    • Section I. A Note From The Author
    • Section II. Tutorial Overview
    • Section III. Running The Examples
    • Section IV. How To Report An Issue
    • Section V. Join The MSU Community On Discord
    • Section VI. Supplementary Material
  • Language Basics
    • Lesson 1. Obligatory Hello World
    • Lesson 2. Code Comments With T-SQL
    • Lesson 3. Basic Syntax
    • Lesson 4. Your First Query
    • Lesson 5. Filtering Data
    • Lesson 6. Sorting Data
    • Lesson 7. Complex Data Filtering
    • Lesson 8. Aliases
    • Lesson 9. String Functions
    • Lesson 10. Creating New Columns From Existing Data (Calculated Fields)
    • Lesson 11. Displaying Data Based On Conditions (Case Statement)
    • Lesson 12. Aggregate Functions
    • Lesson 13. Grouping And Summarizing Data
    • Lesson 14. Querying More Than One Table
    • Lesson 15. Combining Queries
    • Lesson 16. Subqueries
    • Lesson 17. Creating Data
    • Lesson 18. Updating Data
    • Lesson 19. Deleting Data
    • Lesson 20. Common Table Expressions (CTEs)
    • Lesson 21. Derived Tables
    • Lesson 22. Putting It All Together
  • Advanced Topics
    • Lesson 23. Selecting Unique Values
    • Lesson 24. Updating Data With A Join
    • Lesson 25. Data Types
    • Lesson 26. Casting Data Types
    • Lesson 27. Creating Tables
    • Lesson 28. Altering Tables
    • Lesson 29. Dropping Tables
    • Lesson 30. Variables
    • Lesson 31. Controlling Flow
    • Lesson 32. Looping
    • Lesson 33. Error Processing
    • Lesson 34. Temporary Tables
    • Lesson 35. Views
    • Lesson 36. Indexed Views
    • Lesson 37. User Defined Functions
    • Lesson 38. Stored Procedures
    • Lesson 39. BULK INSERT
    • Lesson 40. Loading Tables With MERGE
    • Lesson 41. Partitioning A Dataset
    • Lesson 42. Pivoting Data
    • Lesson 43. Dynamic SQL
    • Lesson 44. Cursors
  • Solutions To Real World Problems
    • Lesson 45. Listing All Tables In A SQL Server Database
    • Lesson 46. Listing All Columns In A SQL Server Database
    • Lesson 47. Pull Records From A Table At Random
    • Lesson 48. A Better Alternative To WITH (NOLOCK)
    • Lesson 49. Boost Performance When Calling A Stored Proc From SSIS
    • Lesson 50. Setting Up Queries For Ablation Testing
    • Lesson 51. Reduce Code And Save Time With Default Column Values
    • Lesson 52. Finding Duplicate Records In A Table
    • Lesson 53. Why You Cannot Have More Than One Clustered Index On A Table
    • Lesson 54. Converting Dates To YYYYMMDD
    • Lesson 55. Sending Notification Emails With T-SQL Without Using Hardcoded Email Addresses
    • Lesson 56. Troubleshooting Long Running Queries
    • Lesson 57. Loading Large CSVs Into Data Warehouse Staging Tables
    • Lesson 58. The Only Bloody Good Reason To Use Cursors
    • Lesson 59. Loading A Type II Slowly Changing Dimension With SQL Merge
    • Lesson 60. A Clearer Explanation Of The Parameters Of The Numeric Data Type
    • Lesson 61. Why You Cannot Join On Null Values
    • Lesson 62. A Deep Dive On How The Where Clause Functions
    • Lesson 63. Using HASHBYTES() To Compare Character Strings
    • Lesson 64. Using Pipe To Hash Multiple Columns For Matching
    • Lesson 65. Why People That Indent Code Drive Me Nuts
    • Lesson 66. How To Rapidly Stand Up A Data Warehouse From Scratch
    • Lesson 67. How To Pivot Data With T-SQL When Columns Are Not Predefined
    • Lesson 68. Prepopulating A Junk Dimension
Powered by GitBook
On this page
  1. Solutions To Real World Problems

Lesson 62. A Deep Dive On How The Where Clause Functions

Below is the transcript of a fascinating discussion I had on how the WHERE clause actually functions.

Bob

I’ve been out in industry for over 10 years now doing DB work most of that time. Lately I’ve been doing some work that is starting to challenge my long held assumptions about the internal workings of SQL Server, specifically what exactly is happening when I do joins and where clauses.

I was explaining to someone that when doing a join putting in a where clause limits the amount of records that are available to do the join on thereby making the query faster. My thought was if a table has 100MM records but only 50MM of those meet the WHERE criteria then the JOIN clause wouldn’t try matching the 50MM records that don’t meet the WHERE clause.

I did some work that afternoon and the empirical results didn’t match the expected outcome of a query I had written in terms of execution time. I don’t think my understanding of what goes on during a JOIN is correct. Can someone educate me?

Person 1

There is a definition of the order of the process of JOINs and the WHERE clause. Then there is what SQL actually does. These may be different.

But the definition is the FROM clause including any JOINs is done first. In general, if you have multiple JOINs, they are done left to right in the order of the ON clauses. So, if you have

FROM A LEFT JOIN B INNER JOIN C ON B.ID = C.ID ON A.ID + B.ID

then first B is joined to C and then that result is joined to A. Of course, in most cases people write the ON clause immediately after the JOIN, so you will see something like

FROM A LEFT JOIN B ON A.ID = B.ID LEFT JOIN C ON A.ID = C.ID

And, the rules of which join is done first can become complex if you are mixing both OUTER JOINs and CROSS JOINs.

But the WHERE is done after all of the JOINs have been processed.

But that's only the definition of what is theoretically done. When the query is actuality done, SQL can (and does) process the query in any manner it thinks is most efficient as long as the result is the same as if it had followed the above definition. In particular, if your query has only INNER JOINs and CROSS JOINs and a WHERE clause, then the order things are done will make no difference in the result. So in that case, SQL will do the joins and where in any order it thinks is fastest. So that could mean doing the joins first, or doing the where first, or doing part of the where first, followed by some of the joins followed by the rest of the where clause followed by the rest of the joins or whatever SQL thinks will be fastest given the available indexes and statistics.

Person 2

The important concept here is that in SQL you specify the result you want, not the algorithm to create that result. It's helpful to think about the FROM clause executing first, followed by the WHERE clause in order to reason about the results and the correctness of your query. But that's not how SQL Server actually runs the query.

The Query Optimizer looks as the query, the join criteria and the where clauses and considers several different algorithms (query plans) to get the specified result. It assigns a "cost" to each potential plan and then executes whichever plan has the lowest cost.

So here it might start with a where clause, or it might start by joining tables and apply the where clause criteria later.

Person 3

Here is how a SELECT works in SQL ... at least in theory. Real products will optimize things, but the code has to produce the same results.

  1. Effectively materialize the CTEs in the optional WITH clause. CTEs come into existence in the order they are declared so only backward references are allowed. A CTE can be recursive. Think of them as VIEWs that exist only in the scope of the query. In practice, if they are used once, then they are implemented as an in-line macro.

  2. Start in the FROM clause and build a working table from all of the joins, unions, intersections, and whatever other table constructors are there. The AS option allows you give a name to this working table which you then have to use for the rest of the containing query. There are UNION, INTERSECT and EXCEPT set constructors, LATERAL tables, table-valued functions and all kinds of things happening in here.

  3. Go to the WHERE clause and remove rows that do not pass criteria; that is, that do not test to TRUE (i.e. reject UNKNOWN and FALSE). The WHERE clause is applied to the working set in the FROM clause.

  4. Go to the optional GROUP BY clause, partition the original table into groups and reduce each grouping to a single row, replacing the original working table with the new grouped table. The rows of a grouped table must be only group characteristics:

    • a grouping column

    • a statistic about the group (i.e. aggregate functions)

    • a function or constant

    • an expression made up of only those three items

    • The original table no longer exists and you cannot reference anything in it (this was an error in early Sybase products)

  5. Go to the optional HAVING clause and apply it against the grouped working table; if there was no GROUP BY clause, treat the entire table as one group.

  6. Go to the SELECT clause and construct the expressions in the list. This means that the scalar subqueries, function calls and expressions in the SELECT are done after all the other clauses are done. The AS operator can also give names to expressions in the SELECT list. These new names come into existence all at once, but after the WHERE clause, GROUP BY clause and HAVING clause have been executed; you cannot use them in the SELECT list or the WHERE clause for that reason. If there is a SELECT DISTINCT, then redundant duplicate rows are removed. For purposes of defining a duplicate row, NULLs are treated as matching (just like in the GROUP BY).

  7. Nested query expressions follow the usual scoping rules you would expect from a block structured language like C, Pascal, Algol, etc. Namely, the innermost queries can reference columns and tables in the queries in which they are contained.

  8. The ORDER BY clause is part of a cursor, not a query. The result set is passed to the cursor, which can only see the names in the SELECT clause list, and the sorting is done there. The ORDER BY clause cannot have expression in it or references to other columns because the result set has been converted into a sequential file structure and that is what is being sorted.

As you can see, things happen "all at once" in SQL, not "from left to right" as they would in a sequential file/procedural language model. In those languages, these two statements produce different results:

READ (a, b, c) FROM File_X;

READ (c, a, b) FROM File_X;

While these two statements return the same data:

SELECT a, b, c FROM Table_X;

SELECT c, a, b FROM Table_X;

Think about what a confused mess this statement is in the SQL model.

SELECT f(c2) AS c1, f(c1) AS c2 FROM Foobar;

That is why such nonsense is illegal syntax.

Bob

I have a ton of books on T-SQL but none of them go into the real details of how SQL Server does it's thing. If I take all of your comments this is my take away. Let me know if you agree.

  1. SQL Server DOES work like I expect. Using a where clause WILL limit the amount of records it attempts to join on. (Based on comment from Person 1 in another thread.)

  2. It does this by optimization not because of anything that I write. It will take a look at the query and then on it's own come up with the most effective way to execute.

Person 1

  1. I would say it CAN work like you expect and a where clause CAN limit the number of rows it attempts to join. But SQL also might do the join first and only then apply the where clause. Whatever way SQL thinks is more efficient as long as that way generates the correct result.

  2. As previously noted, when you write a SQL statement, you should think of it as you are telling SQL the result you want and then SQL generates that result in the manner it believes to be most efficient.

PreviousLesson 61. Why You Cannot Join On Null ValuesNextLesson 63. Using HASHBYTES() To Compare Character Strings

Last updated 3 years ago