This is a continuation of the previous chapter on INNER JOINs. But I'll repeat the beginning of the previous chapter, on what kind of things we want to find between tables t1980
and t2014
:
These are names that have apparently fallen out of favor in 2014:
babynames_1980
name | sex | babies |
---|---|---|
Jennifer | F | 58385 |
Amanda | F | 35818 |
Elizabeth | F | 19526 |
Emily | F | 9004 |
babynames_2014
name | sex | babies |
---|---|---|
Emma | F | 20799 |
Olivia | F | 19674 |
Emily | F | 12562 |
Elizabeth | F | 9492 |
These are names that apparently gained popularity in the several decades:
babynames_1980
name | sex | babies |
---|---|---|
Jennifer | F | 58385 |
Amanda | F | 35818 |
Elizabeth | F | 19526 |
Emily | F | 9004 |
babynames_2014
name | sex | babies |
---|---|---|
Emma | F | 20799 |
Olivia | F | 19674 |
Emily | F | 12562 |
Elizabeth | F | 9492 |
Everything I've explained in the INNER JOIN
chapter applies the same to LEFT JOIN
type queries, including the structure and syntax. The major change, of course, is that in order to do a LEFT JOIN
query, we write LEFT JOIN
instead of INNER JOIN
.
Take the basic INNER JOIN
that we've done, make it a LEFT JOIN
, and see what happens:
SELECT *
FROM t1980
LEFT JOIN t2014
ON t1980.name = t2014.name;
name | sex | babies | name | sex | babies |
---|---|---|---|---|---|
Jennifer | F | 58385 | NULL | NULL | NULL |
Amanda | F | 35818 | NULL | NULL | NULL |
Elizabeth | F | 19526 | Elizabeth | F | 9492 |
Emily | F | 9004 | Emily | F | 12562 |
Not only did we get all 4 rows of t1980
, we got the corresponding matches for t2014
when there were matches. And then rows full of NULL
.
What do those NULL
values represent? Remembering that we've joined 2 tables and thus, the result is the t1980
and t2014
tables, side-by-side. The names "Jennifer" and "Amanda" do not exist in t2014
. The corresponding NULL
values is how SQL denotes their nonexistence.
So NULL
is a term for specifying an existentially unknown or nonsensical state. It is not the same as a value being empty, in the same way that turning in an empty text file or blank sheet of paper is different than not turning anything at all.
So don't confuse NULL
with these kinds of "nothingness",
0
""
i.e. an empty or blank stringFALSE
- SQLite doesn't really have a FALSE
value, but other databases and languages do. Either way, NULL
is not FALSE
.For the purposes of our typical query work, take note that looking for NULL
– such as in a WHERE
or ON
condition – requires a different comparison syntax:
WHERE some_column IS NULL
But not:
WHERE some_column = NULL
When the result rows are as few as in the previous example, it might be enough to eyeball the rows and conclude, "Jennifer" and "Amanda" do not exist in t2014
. But again, that's not possible with a massive dataset. So let's add a WHERE
clause to our LEFT JOIN
query and look for NULL
:
SELECT *
FROM t1980
LEFT JOIN t2014
ON t1980.name = t2014.name
WHERE t2014.name IS NULL;
name | sex | babies | name | sex | babies |
---|---|---|---|---|---|
Jennifer | F | 58385 | NULL | NULL | NULL |
Amanda | F | 35818 | NULL | NULL | NULL |
In the above situation, there's generally not a reason to include the right-side of the table since, by definition of the LEFT JOIN
, the values will all be NULL
anyway.
So, to keep things aesthetically simple and to the point:
SELECT t1980.*
FROM t1980
LEFT JOIN t2014
ON t1980.name = t2014.name
WHERE t2014.name IS NULL;
name | sex | babies |
---|---|---|
Jennifer | F | 58385 |
Amanda | F | 35818 |
Unlike with INNER JOIN
queries, it does matter which table you start FROM
and then LEFT JOIN
to. The previous example queried for names from t1980
that do not appear in t2014
.
To find the inverse, i.e. names in t2014
that don't appear in t1980
, we just switch the order in which the tables appear in the query:
SELECT *
FROM t2014
LEFT JOIN t1980
ON t1980.name = t2014.name;
name | sex | babies | name | sex | babies |
---|---|---|---|---|---|
Emma | F | 20799 | NULL | NULL | NULL |
Olivia | F | 19674 | NULL | NULL | NULL |
Emily | F | 12562 | Emily | F | 9004 |
Elizabeth | F | 9492 | Elizabeth | F | 19526 |
The rest of the syntax for LEFT JOIN
is the same as with INNER JOIN
-type queries. So there's not much more to say specifically about LEFT JOIN
except that when we need to find differences between two tables, LEFT JOIN
is our go-to tool.
The rest of this tutorial is just going to be examples of INNER JOIN
and LEFT JOIN
in action, using the full babynames_1980
and babynames_2014
tables.
SELECT COUNT(DISTINCT babynames_1980.name)
FROM babynames_1980
INNER JOIN babynames_2014
ON babynames_1980.name = babynames_2014.name;
COUNT(DISTINCT babynames_1980.name) |
---|
9712 |
This just requires adding a GROUP BY
clause. Note that we doesn't matter whether we GROUP BY
the sex
column of babynames_1980
or babynames_2014
. We also don't need to use DISTINCT
because each name only shows up once per sex
value:
SELECT babynames_1980.sex, COUNT(*)
FROM babynames_1980
INNER JOIN babynames_2014
ON babynames_1980.name = babynames_2014.name
AND babynames_1980.sex = babynames_2014.sex
GROUP BY babynames_1980.sex;
sex | COUNT(*) |
---|---|
F | 5589 |
M | 4720 |
We can add a WHERE
clause to filter the joined results to a single name. Note that it doesn't matter whether the condition is babynames_1980.name = 'Jordan'
or babynames_2014.name = 'Jordan'
SELECT babynames_1980.sex,
babynames_1980.babies, babynames_2014.babies
FROM babynames_1980
INNER JOIN babynames_2014
ON babynames_1980.name = babynames_2014.name
AND babynames_1980.sex = babynames_2014.sex
WHERE babynames_1980.name = 'Jordan';
sex | babies | babies |
---|---|---|
F | 372 | 1300 |
M | 1616 | 6810 |
Same as above, except I introduce a new aesthetic technique: aliasing the table names, e.g. babynames_1980 AS b1980
. It just makes up for a lot less clutter when the table names are long. I've also aliased the respective babies
column names to make the result clearer:
SELECT b1980.sex,
b1980.babies AS babies_in_1980,
b2014.babies AS babies_in_2014
FROM babynames_1980 AS b1980
INNER JOIN babynames_2014 AS b2014
ON b1980.name = b2014.name
AND b1980.sex = b2014.sex
WHERE b1980.name = 'Madison';
sex | babies_in_1980 | babies_in_2014 |
---|---|---|
M | 32 | 40 |
Hmmm…this result seems suspicious. Are there really no girls named "Madison" in 1980 or 2014? If either situation is the case, the INNER JOIN
won't show the combination. We'll have to see the answer when doing a LEFT JOIN later…
Keeping in mind that the Social Security Administration tracks names by sex of the baby, we match on both name
and sex
fields:
SELECT babynames_1980.name,
babynames_1980.sex,
babynames_1980.babies,
babynames_2014.babies
FROM babynames_1980
INNER JOIN babynames_2014
ON babynames_1980.name = babynames_2014.name
AND babynames_1980.sex = babynames_2014.sex
ORDER BY babynames_2014.babies DESC
LIMIT 5;
name | sex | babies | babies |
---|---|---|---|
Emma | F | 534 | 20799 |
Olivia | F | 1117 | 19674 |
Noah | M | 929 | 19144 |
Sophia | F | 642 | 18490 |
Liam | M | 112 | 18342 |
One conclusion we can make: names that are trendy in 2014 were not particularly trendy in 1980.
Let's look at the opposite: how popular are 1980 names, 15 years later?
The query is almost exactly the same except the column by which we ORDER BY
.
However, I introduce a new aesthetic technique: aliasing the table names, i.e. babynames_1980 AS b1980
. It just makes up for a lot less clutter when the table names are long. I've also aliased the respective babies
column names to make the result clearer:
SELECT b1980.name,
b1980.sex,
b1980.babies AS babies_1980,
b2014.babies AS babies_2014
FROM babynames_1980 AS b1980
INNER JOIN babynames_2014 AS b2014
ON b1980.name = b2014.name
AND b1980.sex = b2014.sex
ORDER BY babies_1980 DESC
LIMIT 5;
name | sex | babies_1980 | babies_2014 |
---|---|---|---|
Michael | M | 68666 | 15323 |
Jennifer | F | 58385 | 1514 |
Christopher | M | 49086 | 10278 |
Jason | M | 48176 | 5510 |
David | M | 41913 | 12078 |
We can't conclude much from these results, except that it appears boy names (which happened to have the most babies per name in 1980) didn't not suffer as much of a drop in popularity as girl names.
Same as above, except a WHERE
clause to filter for sex = 'F'
. Note that we still have to use sex
as a constraint in the JOIN
clause.
SELECT b1980.name,
b1980.sex,
b1980.babies AS babies_1980,
b2014.babies AS babies_2014
FROM babynames_1980 AS b1980
INNER JOIN babynames_2014 AS b2014
ON b1980.name = b2014.name
AND b1980.sex = b2014.sex
WHERE b1980.sex = 'F'
ORDER BY babies_1980 DESC
LIMIT 5;
name | sex | babies_1980 | babies_2014 |
---|---|---|---|
Jennifer | F | 58385 | 1514 |
Amanda | F | 35820 | 1042 |
Jessica | F | 33920 | 1779 |
Melissa | F | 31631 | 1496 |
Sarah | F | 25741 | 4647 |
SELECT b1980.sex, COUNT(*)
FROM babynames_1980 AS b1980
INNER JOIN babynames_2014 AS b2014
ON b1980.name = b2014.name
AND b1980.sex = b2014.sex
WHERE b2014.babies >= 10000
AND b1980.babies >= 10000
GROUP BY b1980.sex;
sex | COUNT(*) |
---|---|
M | 13 |
We can surmise that only boy names met the condition of having 10,000 babies in both years.
SELECT b1980.name,
b1980.babies AS babies_1980,
b2014.babies AS babies_2014
FROM babynames_1980 AS b1980
INNER JOIN babynames_2014 AS b2014
ON b1980.name = b2014.name
AND b1980.sex = b2014.sex
WHERE babies_1980 >= 1000
AND babies_2014 >= 1000
AND b1980.sex = 'F'
ORDER BY babies_2014 DESC
LIMIT 5;
name | babies_1980 | babies_2014 |
---|---|---|
Olivia | 1117 | 19674 |
Emily | 9005 | 12562 |
Abigail | 1576 | 11985 |
Elizabeth | 19528 | 9492 |
Victoria | 2923 | 7955 |
SELECT b1980.name, b1980.sex,
b1980.babies AS babies_1980,
b2014.babies AS babies_2014,
b2014.babies - b1980.babies AS diffbabies
FROM babynames_1980 AS b1980
INNER JOIN babynames_2014 AS b2014
ON b1980.name = b2014.name
AND b1980.sex = b2014.sex
ORDER BY diffbabies DESC
LIMIT 5;
name | sex | babies_1980 | babies_2014 | diffbabies |
---|---|---|---|---|
Emma | F | 534 | 20799 | 20265 |
Olivia | F | 1117 | 19674 | 18557 |
Liam | M | 112 | 18342 | 18230 |
Noah | M | 929 | 19144 | 18215 |
Sophia | F | 642 | 18490 | 17848 |
SELECT b1980.name, b1980.sex,
b1980.babies AS babies_1980,
b2014.babies AS babies_2014,
b2014.babies - b1980.babies AS diffbabies
FROM babynames_1980 AS b1980
INNER JOIN babynames_2014 AS b2014
ON b1980.name = b2014.name
AND b1980.sex = b2014.sex
ORDER BY diffbabies ASC
LIMIT 5;
name | sex | babies_1980 | babies_2014 | diffbabies |
---|---|---|---|---|
Jennifer | F | 58385 | 1514 | -56871 |
Michael | M | 68666 | 15323 | -53343 |
Jason | M | 48176 | 5510 | -42666 |
Christopher | M | 49086 | 10278 | -38808 |
Amanda | F | 35820 | 1042 | -34778 |
Note that it doesn't matter whether we order by babies_2014
or by babies_1980
:
SELECT b1980.name, b1980.sex,
b1980.babies AS babies_1980,
b2014.babies AS babies_2014
FROM babynames_1980 AS b1980
INNER JOIN babynames_2014 AS b2014
ON b1980.name = b2014.name
AND b1980.sex = b2014.sex
AND b1980.babies = b2014.babies
ORDER BY babies_2014 DESC
LIMIT 5;
name | sex | babies_1980 | babies_2014 |
---|---|---|---|
Sunny | F | 201 | 201 |
Sirena | F | 53 | 53 |
Zena | F | 51 | 51 |
Kendal | M | 50 | 50 |
Jewell | F | 49 | 49 |
SELECT b1980.name,
b1980.sex,
b1980.babies AS babies_1980,
b2014.babies AS babies_2014,
ABS(b2014.babies - b1980.babies) AS diffbabies
FROM babynames_1980 AS b1980
INNER JOIN babynames_2014 AS b2014
ON b1980.name = b2014.name
AND b1980.sex = b2014.sex
WHERE
babies_1980 >= 1000
AND babies_2014 >= 1000
ORDER BY diffbabies ASC
LIMIT 5;
name | sex | babies_1980 | babies_2014 | diffbabies |
---|---|---|---|---|
Summer | F | 1754 | 1759 | 5 |
Rafael | M | 1185 | 1195 | 10 |
Annie | F | 1002 | 1046 | 44 |
Wesley | M | 3164 | 3112 | 52 |
Benjamin | M | 13630 | 13687 | 57 |
TK: Make sure your table is indexed
SELECT b1980.name,
b1980.babies AS babies_1980,
b2014.babies AS babies_2014
FROM babynames_1980 AS b1980
INNER JOIN babynames_2014 AS b2014
ON b1980.name = b2014.name
WHERE babies_2014 >= 500
AND b1980.sex = 'M'
AND b2014.sex = 'F'
ORDER BY babies_1980 DESC
LIMIT 5;
name | babies_1980 | babies_2014 |
---|---|---|
Ryan | 24801 | 603 |
Jamie | 2846 | 633 |
Jordan | 1616 | 1300 |
Angel | 1418 | 1179 |
Kelly | 1184 | 734 |
SELECT COUNT(DISTINCT b1980.name)
FROM babynames_1980 AS b1980
LEFT JOIN babynames_2014 AS b2014
ON b1980.name = b2014.name
WHERE b2014.name IS NULL;
COUNT(DISTINCT b1980.name) |
---|
8000 |
SELECT b1980.sex,
COUNT(DISTINCT b1980.name)
FROM babynames_1980 AS b1980
LEFT JOIN babynames_2014 AS b2014
ON b1980.name = b2014.name
AND b1980.sex = b2014.sex
WHERE b2014.name IS NULL
GROUP BY b1980.sex;
sex | COUNT(DISTINCT b1980.name) |
---|---|
F | 6568 |
M | 2562 |
SELECT b2014.name, b2014.sex, b2014.babies
FROM babynames_2014 AS b2014
LEFT JOIN babynames_1980 AS b1980
ON b1980.name = b2014.name
AND b1980.sex = b2014.sex
WHERE b1980.name IS NULL
AND b2014.sex = 'F'
ORDER BY b2014.babies DESC
LIMIT 5;
name | sex | babies |
---|---|---|
Madison | F | 10247 |
Zoey | F | 7358 |
Nevaeh | F | 4188 |
Khloe | F | 3518 |
Adalynn | F | 2512 |
SELECT b1980.name, b1980.sex, b1980.babies
FROM babynames_1980 AS b1980
LEFT JOIN babynames_2014 AS b2014
ON b1980.name = b2014.name
AND b1980.sex = b2014.sex
WHERE b2014.name IS NULL
AND b1980.sex = 'F'
ORDER BY b1980.babies DESC
LIMIT 5;
},%Q{
name | sex | babies |
---|---|---|
Lakisha | F | 912 |
Tameka | F | 843 |
Lakesha | F | 585 |
Lashonda | F | 474 |
Shameka | F | 452 |
})%>
SELECT b1980.name, b1980.sex, b1980.babies
FROM babynames_1980 AS b1980
LEFT JOIN babynames_2014 AS b2014
ON b1980.name = b2014.name
AND b1980.sex = b2014.sex
WHERE b2014.name IS NULL
AND b1980.sex = 'M'
ORDER BY b1980.babies DESC
LIMIT 5;
},%Q{
name | sex | babies |
---|---|---|
Brain | M | 177 |
Melissa | M | 138 |
Jessica | M | 125 |
Michelle | M | 100 |
Jarad | M | 90 |
})%>
I'm not sure about the veracity of those numbers, i.e. that naming boys "Jessica" or "Melissa" was a thing in the 1980s and not so much in 2014. It could be that some records were just misclassified. As for 177 boys named "Brain"…are those typos for "Brian"?
Earlier, we tried comparing the popularity of "Madison" as both a boy and girl name using an INNER JOIN
. But no result showed up for girl babies, which means that "Madison" wasn't a popular name in either 1980 or 2014.
That's where a LEFT JOIN
can help; all entries on the "left"-side of the LEFT JOIN
will be in the result, regardless if they have a proper match in the other table:
SELECT b1980.sex,
b1980.babies AS babies_in_1980,
b2014.babies AS babies_in_2014
FROM babynames_1980 AS b1980
LEFT JOIN babynames_2014 AS b2014
ON b1980.name = b2014.name
AND b1980.sex = b2014.sex
WHERE b1980.name = 'Madison';
sex | babies_in_1980 | babies_in_2014 |
---|---|---|
M | 32 | 40 |
Hmm…that got us the same answer as with the INNER JOIN
…that must mean that either "Madison" is not a popular girl name in 2014 either or it wasn't a popular girl name in 1980. To find out, we repeat the previous query except switch the order of the tables.
We also have to modify the WHERE
condition and selected column names to get the proper result – basically, choose from the 2014 table instead of the 1980 table:
SELECT b2014.sex,
b1980.babies AS babies_in_1980,
b2014.babies AS babies_in_2014
FROM babynames_2014 AS b2014
LEFT JOIN babynames_1980 AS b1980
ON b1980.name = b2014.name
AND b1980.sex = b2014.sex
WHERE b2014.name = 'Madison';
},%Q{
sex | babies_in_1980 | babies_in_2014 |
---|---|---|
F | NULL | 10247 |
M | 32 | 40 |
})%>
Basically, "Madison" was non-existent as a girl name in 1980, but became quite popular by 2014.