Lesson 61. Why You Cannot Join On Null Values
This is a fairly common and insidious error because you might not catch the problem if you are working with a large amount of data. Basically, NULL means unknown and unknown cannot equal unknown, so you cannot join records on null values. The best thing to do is not have NULLs in candidate key fields.
When I was working this problem and mentioned this was weird, another data engineer had this to say:
No, it is not weird. It actually follows from the definitions in the ANSI/ISO standards. The in fixed notation that we use now, was actually a result of an article published by Chris Date in one of the database newsstand magazines of the time (he and I had columns in DBMS and Database Programming and Design, and they were both owned by the same publisher at one point).
The actual syntax was not proposed by Date, by one of the ANSI X3H2 committee members. The rule is that infixed joins are evaluated strictly left to right, and the ON clause associates with the nearest join. If I have a really complicated set of inner joins, I prefer to use the original set-oriented notation (no infixed operators). People that use infixed inner join have a mindset locked into + versus "big Sigma" from mathematics.
Examples
A Walk Through Of The Problem And Solution
My expectation here would be to get back results such that I get all the genders and only matching records for toys. However, when I apply a filter on toys, the dataset has a null value for gender ID in the toys table. My left outer join now behaves as an inner join.
In [ ]:
Here is the correct approach to the problem.
In [ ]:
Here is an approach that filters first by encapsulating the filter logic in a CTE.
In [ ]:
Last updated