This chapter doesn't introduce a new type of JOIN
, per se. Rather, it shows how we can join tables against the results of another query. The concept of joining remains the same, but the amount of code we write will seem to expand significantly. But there's no new syntax, other than more frequent use of aliasing tables:
SELECT
aliased_table.id, alias_table.things,
FROM some_table
AS aliased_table;
Don't get intimidated, just get into the habit of writing parts of the query, one step at a time. While the code looks convoluted, it's just because we're trying to effectively bypass the steps of manually creating new tables – which we could always go back to doing (but what's the fun in that).
Sometimes you want to join a table to itself.
For example, let's say we're interested in the number of names that were given to boys in 1980 but were also given to girls in 2014. We would use an INNER JOIN
like this:
SELECT COUNT(*)
FROM babynames_1980 AS b1980
INNER JOIN babynames_2014 AS b2014
ON b1980.name = b2014.name
WHERE b1980.sex = 'M'
AND b2014.sex = 'F';
COUNT(*) |
---|
1521 |
But what if we're interested in the number of 1980 boy names that were also girl names in 1980? It's still an INNER JOIN
, but with the data from 1980 instead of 2014. We could of course just make a duplicate of the babynames_1980
table, but that seems wasteful.
In fact, there's an easy way to refer to babynames_1980
twice in the same query: just use a different alias for each table reference:
SELECT COUNT(*)
FROM babynames_1980 AS b1980_x
INNER JOIN babynames_1980 AS b1980_y
ON b1980_x.name = b1980_y.name
WHERE b1980_x.sex = 'M'
AND b1980_y.sex = 'F';
COUNT(*) |
---|
1727 |
Let's modify that query slightly to get count the number of boy names that were also girl names in 1980, for names that were given to at least 500 babies of both sex:
SELECT COUNT(*)
FROM babynames_1980 AS b1980_x
INNER JOIN babynames_1980 AS b1980_y
ON b1980_x.name = b1980_y.name
WHERE b1980_x.sex = 'M'
AND b1980_y.sex = 'F'
AND b1980_x.babies >= 500
AND b1980_y.babies >- 500;
COUNT(*) |
---|
334 |
And let's change up the query to show the actual names, i.e. the 10 most popular boy name that were given to at least 1000 boys and 1000 girls in 1980
SELECT b1980_x.name,
b1980_x.babies AS boy_babies,
b1980_y.babies AS girl_babies
FROM babynames_1980 AS b1980_x
INNER JOIN babynames_1980 AS b1980_y
ON b1980_x.name = b1980_y.name
WHERE b1980_x.sex = 'M'
AND b1980_y.sex = 'F'
AND boy_babies >= 1000
AND girl_babies >= 1000
ORDER BY boy_babies DESC
LIMIT 10;
Only 6 results show up because apparently there are only 6 names that fit this category.
name | boy_babies | girl_babies |
---|---|---|
Casey | 3220 | 2172 |
Jamie | 2846 | 11523 |
Jaime | 1464 | 2645 |
Angel | 1418 | 1691 |
Shannon | 1330 | 9668 |
Kelly | 1184 | 13715 |
Let's try a similar query as above, except for the year 2014, and sorted by popularity of the name among girl babies. In the example below, I shorten the aliases even more, just because I don't want to type as much and because that's how aliases work.
(Warning: we typically spend far more time re-reading code than writing it…so there's a point where brevity can cause many more problems than the amount of work it takes to type things out.)
SELECT x.name,
x.babies AS boy_babies,
y.babies AS girl_babies
FROM babynames_2014 AS x
INNER JOIN babynames_2014 AS y
ON x.name = y.name
WHERE x.sex = 'M'
AND y.sex = 'F'
AND boy_babies >= 1000
AND girl_babies >= 1000
ORDER BY girl_babies DESC
LIMIT 10;
name | boy_babies | girl_babies |
---|---|---|
Avery | 2269 | 9517 |
Riley | 2429 | 4761 |
Peyton | 1828 | 4426 |
Alexis | 1172 | 4188 |
Emerson | 1048 | 1670 |
Hayden | 2851 | 1656 |
Charlie | 1670 | 1432 |
Parker | 5677 | 1386 |
Jordan | 6810 | 1300 |
Angel | 6255 | 1179 |
And that's basically all there is to self-joins: give a table an alias and it can be treated essentially as an entirely different table. This is much easier than creating duplicate tables just to be able to compare one table against itself.
A "nested" query is a query inside of another query.
As you've probably noticed by now, the result of any given query looks like it could be just another table. And so, why not query the result of a query? This is possible by wrapping up that nested query in parentheses and giving it an alias:
Here's a trivial example:
SELECT *
FROM (SELECT * FROM babynames_1980)
AS whatever
ORDER BY babies DESC
LIMIT 5;
How is that different than this:
SELECT *
FROM babynames_1980
AS whatever
ORDER BY babies DESC
LIMIT 5;
There is no difference (except more computational work on the part of the database). I just wanted to show off the syntax.
Here's a slightly more realistic example. First, let me try to describe the query in English:
Of the 20 most popular baby names in 1980 that begin with the letter "D", sort those top 20 names in alphabetical order:
SELECT *
FROM (
SELECT * FROM babynames_1980
WHERE name LIKE 'D%'
ORDER BY babies DESC
LIMIT 20
)
AS top_d_names
ORDER BY name
LIMIT 5;
name | sex | babies |
---|---|---|
Dana | F | 4438 |
Daniel | M | 29889 |
Danielle | F | 8844 |
Danny | M | 2266 |
Darren | M | 1530 |
There must be an easier way, you're thinking. Go ahead and try these simpler non-nested variations and see how their results differ.
The following gets names that begin with 'D'
in order of baby count:
SELECT * FROM babynames_1980
WHERE name LIKE 'D%'
ORDER BY babies DESC
LIMIT 5;
name | sex | babies |
---|---|---|
David | M | 41913 |
Daniel | M | 29889 |
Danielle | F | 8844 |
Dustin | M | 8419 |
Derek | M | 7245 |
But that's not quite what we want…we want the top 5 names in alphabetical order of the 20 most popular 'D'
names.
This simpler variation is just wrong:
SELECT * FROM babynames_1980
WHERE name LIKE 'D%'
ORDER BY name ASC, babies DESC
LIMIT 5;
},%Q{
name | sex | babies |
---|---|---|
Daaimah | F | 5 |
Daaiyah | F | 13 |
Daarina | F | 11 |
Dabney | F | 5 |
Dace | M | 6 |
})%>
OK, while the type of query I've proposed isn't super practical or interesting, I hope it illustrates the primary problem: we want to subset the original table (i.e. find top 20 'D'
-letter names from 1980), and then sort that subset. But we can't do that as a simple, single query, because all of the sorting and limiting happens at the very end.
What we want to do fundamentally requires at least 2 different queries.
OK, now that we're at least acquainted with the syntax of a nested query, and the fact that it is possible, let's go on to a more useful use case, albeit one that appears to be much more complicated and convoluted:
I want to see how the top 5 baby names in one year did in another year, e.g. how did the 5 most popular names in 1980 do in 2014?
Start out by thinking of the query to get the most popular names in 1980:
SELECT *
FROM babynames_1980
ORDER BY babies DESC
LIMIT 5;
name | sex | babies |
---|---|---|
Michael | M | 68666 |
Jennifer | F | 58385 |
Christopher | M | 49086 |
Jason | M | 48176 |
David | M | 41913 |
That result looks just like a table, right? In fact, we could just create a whole new table and insert the query results into that new table. Here's the series of statements to do that:
DROP TABLE IF EXISTS top1980babies;
CREATE TABLE "top1980babies"(
name TEXT,
sex TEXT,
babies INTEGER
);
INSERT INTO top1980babies
SELECT *
FROM babynames_1980
ORDER BY babies DESC
LIMIT 5;
Now, if we want to see how much the top 5 baby names in 1980 changed in popularity in 2014, we just do an INNER JOIN
from this new table, top1980babies
to babynames_2014
:
SELECT b1980.name, b1980.sex,
b1980.babies AS babies_1980,
b2014.babies AS babies_2014
FROM top1980babies AS b1980
INNER JOIN babynames_2014 AS b2014
ON b1980.name = b2014.name
AND b1980.sex = b2014.sex;
name | sex | babies_1980 | babies_2014 |
---|---|---|---|
Michael | M | 68666 | 15323 |
Jennifer | F | 58385 | 1514 |
Christopher | M | 49086 | 10278 |
Jason | M | 48176 | 5510 |
David | M | 41913 | 12078 |
But creating a table – i.e. top1980babies
– that is a subset of babynames_1980
just to compare it to babynames_2014
seems like an extra step too many.
In fact, we can SELECT...FROM
directly from the results of a query, without having to create a new table. The strategy and concept is similar to a self-join: just give the query results an alias.
The query below will look convoluted, but just focus on the FROM
clause, which is where the only change is made between this query and the previous query that refers to top1980babies
:
SELECT b1980.name, b1980.sex,
b1980.babies AS babies_1980,
b2014.babies AS babies_2014
FROM
(SELECT * FROM babynames_1980
ORDER BY babies DESC
LIMIT 5)
AS b1980
INNER JOIN babynames_2014 AS b2014
ON b1980.name = b2014.name
AND b1980.sex = b2014.sex;
Yes, it looks much more complicated than the previous query, but the previous query relied on several queries to create a new table and inserting data into that table. The nested query example does all of that and the actual analytical query, i.e. "How popular in 2014 are the most 5 popular 1980 names?"
We can move that nested query into the JOIN
clause for the same effect – well, the same effect for an INNER JOIN
, in which the order of tables being joined doesn't matter:
SELECT b1980.name, b1980.sex,
b1980.babies AS babies_1980,
b2014.babies AS babies_2014
FROM babynames_2014 AS b2014
INNER JOIN (SELECT * FROM babynames_1980
ORDER BY babies DESC
LIMIT 5)
AS b1980
ON b1980.name = b2014.name
AND b1980.sex = b2014.sex;
It's worth asking why we need such a convoluted structure.
In other words, what's the difference between using a nested query with an INNER JOIN
to look at the top 5 names, versus the following simpler query:
SELECT b1980.name, b1980.sex,
b1980.babies AS babies_1980,
b2014.babies AS babies_2014
FROM babynames_2014 AS b2014
INNER JOIN babynames_1980 AS b1980
ON b1980.name = b2014.name
AND b1980.sex = b2014.sex
ORDER BY babies_1980 DESC
LIMIT 5;
If we do a regular INNER JOIN
and order the result by the baby count of the 1980 table, and limit it to 5 results…isn't that the same as what we asked for with the nested query? T
Try it out for yourself and see the results:
name | sex | babies_1980 | babies_2014 |
---|---|---|---|
Michael | M | 68666 | 15323 |
Jennifer | F | 58385 | 1514 |
Christopher | M | 49086 | 10278 |
Jason | M | 48176 | 5510 |
David | M | 41913 | 12078 |
Hey! Those queries get us the exact same result.
Fair enough. Let's try the same query, except sort by babies_2014
: That is, display the most 5 popular 1980 baby names, ordered by how popular they are in 2014.
Seems like a simple fix, just change the ORDER BY
clause, right?
SELECT b1980.name, b1980.sex,
b1980.babies AS babies_1980,
b2014.babies AS babies_2014
FROM babynames_2014 AS b2014
INNER JOIN babynames_1980 AS b1980
ON b1980.name = b2014.name
AND b1980.sex = b2014.sex
ORDER BY babies_2014 DESC
LIMIT 5;
name | sex | babies_1980 | babies_2014 |
---|---|---|---|
Emma | F | 534 | 20799 |
Olivia | F | 1117 | 19674 |
Noah | M | 929 | 19144 |
Sophia | F | 642 | 18490 |
Liam | M | 112 | 18342 |
The answer above doesn't work because it's showing the most popular baby names in 2014, and how they did in 1980.
The key to understanding this nuance is how the ORDER BY
and LIMIT
clauses come at the end of the query. The SELECT...FROM
and INNER JOIN
clauses have done their work: selecting data from one table, joining it to the other. And so the ORDER BY
and LIMIT
clauses do their work on the result of the join.
However, when we want to compare the top 5 names in 1980 against all the names in 2014 – we want to apply the ORDER BY
and LIMIT
to the babynames_1980
table – before it is joined to babynames_2014
. And after the joining is done, then we sort the result by babies_2014
in descending order.
Below is the proper way to get the result for the previous goal, i.e. "the most 5 popular 1980 baby names, ordered by how popular they are in 2014". I'm putting it side-by-side with the previous, incorrect query, so that the differences are easier to see:
The previous, incorrect query:
SELECT b1980.name, b1980.sex,
b1980.babies AS babies_1980,
b2014.babies AS babies_2014
FROM babynames_2014 AS b2014
INNER JOIN babynames_1980 AS b1980
ON b1980.name = b2014.name
AND b1980.sex = b2014.sex
ORDER BY babies_2014 DESC
LIMIT 5;
The correct query, using a nested query in the INNER JOIN
:
SELECT b1980.name, b1980.sex,
b1980.babies AS babies_1980,
b2014.babies AS babies_2014
FROM babynames_2014 AS b2014
INNER JOIN (
SELECT * FROM babynames_1980
ORDER BY babies DESC
LIMIT 5
)
AS b1980
ON b1980.name = b2014.name
AND b1980.sex = b2014.sex
ORDER BY babies_2014 DESC;
The result of the correct query:
name | sex | babies_1980 | babies_2014 |
---|---|---|---|
Michael | M | 68666 | 15323 |
David | M | 41913 | 12078 |
Christopher | M | 49086 | 10278 |
Jason | M | 48176 | 5510 |
Jennifer | F | 58385 | 1514 |
Think of this as selecting not the entirety of babynames_1980
– but just its top 100 names – and then joining that against babynames_2014
:
SELECT b1980.name, b1980.sex,
b2014.babies - b1980.babies AS diffbabies
FROM (SELECT *
FROM babynames_1980
ORDER BY babies DESC
LIMIT 100)
AS b1980
INNER JOIN
babynames_2014 AS b2014
ON b1980.name = b2014.name
AND b1980.sex = b2014.sex
ORDER BY diffbabies
LIMIT 5;
name | sex | diffbabies |
---|---|---|
Jennifer | F | -56871 |
Michael | M | -53343 |
Jason | M | -42666 |
Christopher | M | -38808 |
Amanda | F | -34778 |
Same query as above, just a change in the second ORDER BY
statement
SELECT b1980.name, b1980.sex,
b2014.babies - b1980.babies AS diffbabies
FROM (SELECT *
FROM babynames_1980
ORDER BY babies DESC
LIMIT 100)
AS b1980
INNER JOIN
babynames_2014 AS b2014
ON b1980.name = b2014.name
AND b1980.sex = b2014.sex
ORDER BY diffbabies DESC
LIMIT 5;
name | sex | diffbabies |
---|---|---|
Jacob | M | 8190 |
Samuel | M | 4850 |
Emily | F | 3557 |
Benjamin | M | 57 |
Nathan | M | -2237 |
The fact that "Nathan", with a drop in 2,237 babies, is in this result probably means that virtually all of the most popular 1980 names did not maintain their popularity in 2014 – unless there were far more babies total in 1980 versus 2014 (there weren't, you can query the database for yourself).
Nothing different here, just a LEFT JOIN
:
SELECT b2014.name, b2014.sex,
b2014.babies
FROM (SELECT *
FROM babynames_2014
ORDER BY babies DESC
LIMIT 500)
AS b2014
LEFT JOIN
babynames_1980 AS b1980
ON b1980.name = b2014.name
AND b1980.sex = b2014.sex
WHERE b1980.name IS NULL
ORDER BY b2014.babies DESC
LIMIT 5;
name | sex | babies |
---|---|---|
Aiden | M | 13296 |
Jayden | M | 12878 |
Madison | F | 10247 |
Zoey | F | 7358 |
Ayden | M | 5586 |
Don't panic here; this looks complicated but all we are doing is doing 2 nested queries: a query for the top 100 in 1980, and a query for the top 100 in 2014.
It looks ugly but should seem logically consistent:
SELECT COUNT(*)
FROM (
SELECT * FROM babynames_1980
ORDER BY babies DESC
LIMIT 100
)
AS b1980
INNER JOIN (
SELECT * FROM babynames_2014
ORDER BY babies DESC
LIMIT 100
)
AS b2014
ON b1980.name = b2014.name
AND b1980.sex = b2014.sex;
COUNT(*) |
---|
25 |
The same query as above, except with a GROUP BY
to aggregate by sex
:
SELECT b1980.sex, COUNT(*)
FROM (
SELECT * FROM babynames_1980
ORDER BY babies DESC
LIMIT 100
)
AS b1980
INNER JOIN (
SELECT * FROM babynames_2014
ORDER BY babies DESC
LIMIT 100
)
AS b2014
ON b1980.name = b2014.name
AND b1980.sex = b2014.sex
GROUP BY b1980.sex;
sex | COUNT(*) |
---|---|
F | 2 |
M | 23 |
This is similar to the previous queries, but structured to show the actual names, not just an aggregate count.
SELECT b1980.name,
b1980.sex,
b1980.babies AS babies_1980,
b2014.babies AS babies_2014
FROM (
SELECT * FROM babynames_1980
ORDER BY babies DESC
LIMIT 100
)
AS b1980
INNER JOIN (
SELECT * FROM babynames_2014
ORDER BY babies DESC
LIMIT 100
)
AS b2014
ON b1980.name = b2014.name
AND b1980.sex = b2014.sex
ORDER BY babies_2014 DESC
LIMIT 10;
name | sex | babies_1980 | babies_2014 |
---|---|---|---|
Jacob | M | 8522 | 16712 |
William | M | 25659 | 16687 |
Michael | M | 68666 | 15323 |
James | M | 39312 | 14301 |
Daniel | M | 29889 | 13829 |
Benjamin | M | 13630 | 13687 |
Matthew | M | 37857 | 12809 |
Emily | F | 9005 | 12562 |
David | M | 41913 | 12078 |
Joseph | M | 30189 | 11995 |
Let's perform a similar query but limit the result to girl names.
SELECT b1980.name,
b1980.sex,
b1980.babies AS babies_1980,
b2014.babies AS babies_2014
FROM (
SELECT * FROM babynames_1980
ORDER BY babies DESC
LIMIT 100
)
AS b1980
INNER JOIN (
SELECT * FROM babynames_2014
ORDER BY babies DESC
LIMIT 100
)
AS b2014
ON b1980.name = b2014.name
AND b1980.sex = b2014.sex
WHERE b1980.sex = 'F'
ORDER BY babies_2014 DESC
LIMIT 10;
name | sex | babies_1980 | babies_2014 |
---|---|---|---|
Emily | F | 9005 | 12562 |
Elizabeth | F | 19528 | 9492 |
As we saw earlier, only 2 of the 25 names that were in the top 100 for both years were female names, so this result makes sense.