This chapter doesn't introduce a new type of JOIN, per se. Rather, it shows how we can join tables against the results of another query. The concept of joining remains the same, but the amount of code we write will seem to expand significantly. But there's no new syntax, other than more frequent use of aliasing tables:

SELECT 
  aliased_table.id, alias_table.things,
FROM some_table 
    AS aliased_table;

Don't get intimidated, just get into the habit of writing parts of the query, one step at a time. While the code looks convoluted, it's just because we're trying to effectively bypass the steps of manually creating new tables – which we could always go back to doing (but what's the fun in that).

Self-joins

Sometimes you want to join a table to itself.

For example, let's say we're interested in the number of names that were given to boys in 1980 but were also given to girls in 2014. We would use an INNER JOIN like this:

SELECT COUNT(*)
FROM babynames_1980 AS b1980
INNER JOIN babynames_2014 AS b2014
  ON b1980.name = b2014.name
WHERE b1980.sex = 'M'
  AND b2014.sex = 'F';
COUNT(*)
1521

But what if we're interested in the number of 1980 boy names that were also girl names in 1980? It's still an INNER JOIN, but with the data from 1980 instead of 2014. We could of course just make a duplicate of the babynames_1980 table, but that seems wasteful.

In fact, there's an easy way to refer to babynames_1980 twice in the same query: just use a different alias for each table reference:

SELECT COUNT(*)
FROM babynames_1980 AS b1980_x
INNER JOIN babynames_1980 AS b1980_y
  ON b1980_x.name = b1980_y.name
WHERE b1980_x.sex = 'M'
  AND b1980_y.sex = 'F';
COUNT(*)
1727

Let's modify that query slightly to get count the number of boy names that were also girl names in 1980, for names that were given to at least 500 babies of both sex:

SELECT COUNT(*)
FROM babynames_1980 AS b1980_x
INNER JOIN babynames_1980 AS b1980_y
  ON b1980_x.name = b1980_y.name
WHERE b1980_x.sex = 'M'
  AND b1980_y.sex = 'F'
  AND b1980_x.babies >= 500 
  AND b1980_y.babies >- 500;
COUNT(*)
334

And let's change up the query to show the actual names, i.e. the 10 most popular boy name that were given to at least 1000 boys and 1000 girls in 1980

SELECT b1980_x.name, 
  b1980_x.babies AS boy_babies,
  b1980_y.babies AS girl_babies
FROM babynames_1980 AS b1980_x
INNER JOIN babynames_1980 AS b1980_y
  ON b1980_x.name = b1980_y.name
WHERE b1980_x.sex = 'M'
  AND b1980_y.sex = 'F'
  AND boy_babies >= 1000 
  AND girl_babies >= 1000
ORDER BY boy_babies DESC
LIMIT 10;

Only 6 results show up because apparently there are only 6 names that fit this category.

name boy_babies girl_babies
Casey 3220 2172
Jamie 2846 11523
Jaime 1464 2645
Angel 1418 1691
Shannon 1330 9668
Kelly 1184 13715

Let's try a similar query as above, except for the year 2014, and sorted by popularity of the name among girl babies. In the example below, I shorten the aliases even more, just because I don't want to type as much and because that's how aliases work.

(Warning: we typically spend far more time re-reading code than writing it…so there's a point where brevity can cause many more problems than the amount of work it takes to type things out.)

SELECT x.name, 
  x.babies AS boy_babies,
  y.babies AS girl_babies
FROM babynames_2014 AS x
INNER JOIN babynames_2014 AS y
  ON x.name = y.name
WHERE x.sex = 'M'
  AND y.sex = 'F'
  AND boy_babies >= 1000 
  AND girl_babies >= 1000
ORDER BY girl_babies DESC
LIMIT 10;
name boy_babies girl_babies
Avery 2269 9517
Riley 2429 4761
Peyton 1828 4426
Alexis 1172 4188
Emerson 1048 1670
Hayden 2851 1656
Charlie 1670 1432
Parker 5677 1386
Jordan 6810 1300
Angel 6255 1179

And that's basically all there is to self-joins: give a table an alias and it can be treated essentially as an entirely different table. This is much easier than creating duplicate tables just to be able to compare one table against itself.

Nested queries

A "nested" query is a query inside of another query.

As you've probably noticed by now, the result of any given query looks like it could be just another table. And so, why not query the result of a query? This is possible by wrapping up that nested query in parentheses and giving it an alias:

Here's a trivial example:

SELECT * 
FROM (SELECT * FROM babynames_1980) 
  AS whatever
ORDER BY babies DESC
LIMIT 5;

How is that different than this:

SELECT * 
FROM babynames_1980 
  AS whatever
ORDER BY babies DESC
LIMIT 5;

There is no difference (except more computational work on the part of the database). I just wanted to show off the syntax.

Here's a slightly more realistic example. First, let me try to describe the query in English:

Of the 20 most popular baby names in 1980 that begin with the letter "D", sort those top 20 names in alphabetical order:

SELECT * 
FROM (
    SELECT * FROM babynames_1980 
    WHERE name LIKE 'D%'
    ORDER BY babies DESC
    LIMIT 20
  ) 
  AS top_d_names
ORDER BY name
LIMIT 5;
name sex babies
Dana F 4438
Daniel M 29889
Danielle F 8844
Danny M 2266
Darren M 1530

There must be an easier way, you're thinking. Go ahead and try these simpler non-nested variations and see how their results differ.

The following gets names that begin with 'D' in order of baby count:

SELECT * FROM babynames_1980 
WHERE name LIKE 'D%'
ORDER BY babies DESC
LIMIT 5;
name sex babies
David M 41913
Daniel M 29889
Danielle F 8844
Dustin M 8419
Derek M 7245

But that's not quite what we want…we want the top 5 names in alphabetical order of the 20 most popular 'D' names.

This simpler variation is just wrong:

SELECT * FROM babynames_1980 
WHERE name LIKE 'D%'
ORDER BY name ASC, babies DESC
LIMIT 5;

},%Q{

name sex babies
Daaimah F 5
Daaiyah F 13
Daarina F 11
Dabney F 5
Dace M 6

})%>

OK, while the type of query I've proposed isn't super practical or interesting, I hope it illustrates the primary problem: we want to subset the original table (i.e. find top 20 'D'-letter names from 1980), and then sort that subset. But we can't do that as a simple, single query, because all of the sorting and limiting happens at the very end.

What we want to do fundamentally requires at least 2 different queries.

Another practical example of a nested query

OK, now that we're at least acquainted with the syntax of a nested query, and the fact that it is possible, let's go on to a more useful use case, albeit one that appears to be much more complicated and convoluted:

I want to see how the top 5 baby names in one year did in another year, e.g. how did the 5 most popular names in 1980 do in 2014?

Start out by thinking of the query to get the most popular names in 1980:

SELECT * 
FROM babynames_1980
ORDER BY babies DESC
LIMIT 5;
name sex babies
Michael M 68666
Jennifer F 58385
Christopher M 49086
Jason M 48176
David M 41913

That result looks just like a table, right? In fact, we could just create a whole new table and insert the query results into that new table. Here's the series of statements to do that:

DROP TABLE IF EXISTS top1980babies;
CREATE TABLE "top1980babies"(
  name TEXT,
  sex TEXT,
  babies INTEGER
);

INSERT INTO top1980babies
SELECT * 
FROM babynames_1980
ORDER BY babies DESC
LIMIT 5;

Now, if we want to see how much the top 5 baby names in 1980 changed in popularity in 2014, we just do an INNER JOIN from this new table, top1980babies to babynames_2014:

SELECT b1980.name, b1980.sex,
  b1980.babies AS babies_1980,
  b2014.babies AS babies_2014
  FROM top1980babies AS b1980
INNER JOIN babynames_2014 AS b2014
  ON b1980.name = b2014.name
    AND b1980.sex = b2014.sex;
name sex babies_1980 babies_2014
Michael M 68666 15323
Jennifer F 58385 1514
Christopher M 49086 10278
Jason M 48176 5510
David M 41913 12078

But creating a table – i.e. top1980babies – that is a subset of babynames_1980 just to compare it to babynames_2014 seems like an extra step too many.

In fact, we can SELECT...FROM directly from the results of a query, without having to create a new table. The strategy and concept is similar to a self-join: just give the query results an alias.

The query below will look convoluted, but just focus on the FROM clause, which is where the only change is made between this query and the previous query that refers to top1980babies:

SELECT b1980.name, b1980.sex,
  b1980.babies AS babies_1980,
  b2014.babies AS babies_2014

  FROM 
    (SELECT * FROM babynames_1980 
          ORDER BY babies DESC
          LIMIT 5) 
     AS b1980

INNER JOIN babynames_2014 AS b2014
  ON b1980.name = b2014.name
    AND b1980.sex = b2014.sex;

Yes, it looks much more complicated than the previous query, but the previous query relied on several queries to create a new table and inserting data into that table. The nested query example does all of that and the actual analytical query, i.e. "How popular in 2014 are the most 5 popular 1980 names?"

Joining against a nested query

We can move that nested query into the JOIN clause for the same effect – well, the same effect for an INNER JOIN, in which the order of tables being joined doesn't matter:

SELECT b1980.name, b1980.sex,
  b1980.babies AS babies_1980,
  b2014.babies AS babies_2014
FROM babynames_2014 AS b2014
    
INNER JOIN (SELECT * FROM babynames_1980 
    ORDER BY babies DESC
    LIMIT 5) 
  AS b1980
  ON b1980.name = b2014.name
    AND b1980.sex = b2014.sex;

Why nested queries?

It's worth asking why we need such a convoluted structure.

In other words, what's the difference between using a nested query with an INNER JOIN to look at the top 5 names, versus the following simpler query:

SELECT b1980.name, b1980.sex,
  b1980.babies AS babies_1980,
  b2014.babies AS babies_2014
FROM babynames_2014 AS b2014
    
INNER JOIN babynames_1980 AS b1980
  ON b1980.name = b2014.name
    AND b1980.sex = b2014.sex

ORDER BY babies_1980 DESC
LIMIT 5;

If we do a regular INNER JOIN and order the result by the baby count of the 1980 table, and limit it to 5 results…isn't that the same as what we asked for with the nested query? T

Try it out for yourself and see the results:

name sex babies_1980 babies_2014
Michael M 68666 15323
Jennifer F 58385 1514
Christopher M 49086 10278
Jason M 48176 5510
David M 41913 12078

Hey! Those queries get us the exact same result.

Fair enough. Let's try the same query, except sort by babies_2014: That is, display the most 5 popular 1980 baby names, ordered by how popular they are in 2014.

Seems like a simple fix, just change the ORDER BY clause, right?

SELECT b1980.name, b1980.sex,
  b1980.babies AS babies_1980,
  b2014.babies AS babies_2014
FROM babynames_2014 AS b2014
    
INNER JOIN babynames_1980 AS b1980
  ON b1980.name = b2014.name
    AND b1980.sex = b2014.sex

ORDER BY babies_2014 DESC
LIMIT 5;
name sex babies_1980 babies_2014
Emma F 534 20799
Olivia F 1117 19674
Noah M 929 19144
Sophia F 642 18490
Liam M 112 18342

The answer above doesn't work because it's showing the most popular baby names in 2014, and how they did in 1980.

The key to understanding this nuance is how the ORDER BY and LIMIT clauses come at the end of the query. The SELECT...FROM and INNER JOIN clauses have done their work: selecting data from one table, joining it to the other. And so the ORDER BY and LIMIT clauses do their work on the result of the join.

However, when we want to compare the top 5 names in 1980 against all the names in 2014 – we want to apply the ORDER BY and LIMIT to the babynames_1980 table – before it is joined to babynames_2014. And after the joining is done, then we sort the result by babies_2014 in descending order.

Below is the proper way to get the result for the previous goal, i.e. "the most 5 popular 1980 baby names, ordered by how popular they are in 2014". I'm putting it side-by-side with the previous, incorrect query, so that the differences are easier to see:

The previous, incorrect query:

SELECT b1980.name, b1980.sex,
  b1980.babies AS babies_1980,
  b2014.babies AS babies_2014
FROM babynames_2014 AS b2014
    
INNER JOIN babynames_1980 AS b1980
  ON b1980.name = b2014.name
    AND b1980.sex = b2014.sex

ORDER BY babies_2014 DESC
LIMIT 5;

The correct query, using a nested query in the INNER JOIN:

SELECT b1980.name, b1980.sex,
  b1980.babies AS babies_1980,
  b2014.babies AS babies_2014
FROM babynames_2014 AS b2014
    
INNER JOIN (
    SELECT * FROM babynames_1980
    ORDER BY babies DESC
    LIMIT 5
  ) 
  AS b1980
  ON b1980.name = b2014.name
    AND b1980.sex = b2014.sex
ORDER BY babies_2014 DESC;

The result of the correct query:

name sex babies_1980 babies_2014
Michael M 68666 15323
David M 41913 12078
Christopher M 49086 10278
Jason M 48176 5510
Jennifer F 58385 1514

More nested query fun

From the top 100 names in 1980, show the 5 names that dropped the most in popularity in 2014

Think of this as selecting not the entirety of babynames_1980 – but just its top 100 names – and then joining that against babynames_2014:

SELECT b1980.name, b1980.sex, 
  b2014.babies - b1980.babies AS diffbabies
  FROM (SELECT * 
  FROM babynames_1980
  ORDER BY babies DESC
  LIMIT 100) 
  AS b1980
INNER JOIN
  babynames_2014 AS b2014
  ON b1980.name = b2014.name
    AND b1980.sex = b2014.sex
ORDER BY diffbabies
LIMIT 5;
name sex diffbabies
Jennifer F -56871
Michael M -53343
Jason M -42666
Christopher M -38808
Amanda F -34778

From the top 100 names in 1980, show the 5 names that increased the most in popularity

Same query as above, just a change in the second ORDER BY statement

SELECT b1980.name, b1980.sex, 
  b2014.babies - b1980.babies AS diffbabies
  FROM (SELECT * 
  FROM babynames_1980
  ORDER BY babies DESC
  LIMIT 100) 
  AS b1980
INNER JOIN
  babynames_2014 AS b2014
  ON b1980.name = b2014.name
    AND b1980.sex = b2014.sex
ORDER BY diffbabies DESC
LIMIT 5;
name sex diffbabies
Jacob M 8190
Samuel M 4850
Emily F 3557
Benjamin M 57
Nathan M -2237

The fact that "Nathan", with a drop in 2,237 babies, is in this result probably means that virtually all of the most popular 1980 names did not maintain their popularity in 2014 – unless there were far more babies total in 1980 versus 2014 (there weren't, you can query the database for yourself).

Nothing different here, just a LEFT JOIN:

SELECT b2014.name, b2014.sex, 
  b2014.babies
  FROM (SELECT * 
  FROM babynames_2014
  ORDER BY babies DESC
  LIMIT 500) 
  AS b2014
LEFT JOIN
  babynames_1980 AS b1980
  ON b1980.name = b2014.name
    AND b1980.sex = b2014.sex
WHERE b1980.name IS NULL
ORDER BY b2014.babies DESC
LIMIT 5;
name sex babies
Aiden M 13296
Jayden M 12878
Madison F 10247
Zoey F 7358
Ayden M 5586

Don't panic here; this looks complicated but all we are doing is doing 2 nested queries: a query for the top 100 in 1980, and a query for the top 100 in 2014.

It looks ugly but should seem logically consistent:

SELECT COUNT(*)
FROM (
    SELECT * FROM babynames_1980
    ORDER BY babies DESC
    LIMIT 100
  ) 
  AS b1980
INNER JOIN (
    SELECT * FROM babynames_2014
    ORDER BY babies DESC
    LIMIT 100
  ) 
  AS b2014
  ON b1980.name = b2014.name
    AND b1980.sex = b2014.sex;
COUNT(*)
25

The same query as above, except with a GROUP BY to aggregate by sex:

SELECT b1980.sex, COUNT(*)
FROM (
    SELECT * FROM babynames_1980
    ORDER BY babies DESC
    LIMIT 100
  ) 
  AS b1980
INNER JOIN (
    SELECT * FROM babynames_2014
    ORDER BY babies DESC
    LIMIT 100
  ) 
  AS b2014
  ON b1980.name = b2014.name
    AND b1980.sex = b2014.sex
GROUP BY b1980.sex;
sex COUNT(*)
F 2
M 23

This is similar to the previous queries, but structured to show the actual names, not just an aggregate count.

SELECT b1980.name,
  b1980.sex, 
  b1980.babies AS babies_1980,
  b2014.babies AS babies_2014
FROM (
    SELECT * FROM babynames_1980
    ORDER BY babies DESC
    LIMIT 100
  ) 
  AS b1980
INNER JOIN (
    SELECT * FROM babynames_2014
    ORDER BY babies DESC
    LIMIT 100
  ) 
  AS b2014
  ON b1980.name = b2014.name
    AND b1980.sex = b2014.sex

ORDER BY babies_2014 DESC
LIMIT 10;

name sex babies_1980 babies_2014
Jacob M 8522 16712
William M 25659 16687
Michael M 68666 15323
James M 39312 14301
Daniel M 29889 13829
Benjamin M 13630 13687
Matthew M 37857 12809
Emily F 9005 12562
David M 41913 12078
Joseph M 30189 11995

Let's perform a similar query but limit the result to girl names.

SELECT b1980.name,
  b1980.sex, 
  b1980.babies AS babies_1980,
  b2014.babies AS babies_2014
FROM (
    SELECT * FROM babynames_1980
    ORDER BY babies DESC
    LIMIT 100
  ) 
  AS b1980
INNER JOIN (
    SELECT * FROM babynames_2014
    ORDER BY babies DESC
    LIMIT 100
  ) 
  AS b2014
  ON b1980.name = b2014.name
    AND b1980.sex = b2014.sex

WHERE b1980.sex = 'F'
ORDER BY babies_2014 DESC
LIMIT 10;
name sex babies_1980 babies_2014
Emily F 9005 12562
Elizabeth F 19528 9492

As we saw earlier, only 2 of the 25 names that were in the top 100 for both years were female names, so this result makes sense.