For the data journalist, joins are the feature that completely justify the jump from spreadsheets to databases.

A JOIN is done through yet another clause of a SELECT statement – and so expect things to get even more cluttered. This lesson focuses on the overall syntax of joins and the most common type of join: INNER JOIN. The next lesson looks at LEFT JOINs – which share the same syntax but give us the ability to find when things are missing from tables.

Why Joins?

The purpose of a JOIN query is to bring together two tables. The easiest way to imagine this is taking two separate tables and smashing them together, so that they're side-by-side.

Below are two baby name tables, one from 1980 and the other from 2014. We want to compare them. Putting them side-by-side is much easier than flipping through different tabs/windows:

name sex babies   name sex babies
Jennifer F 58385   Emma F 20799
Amanda F 35818   Olivia F 19674
Elizabeth F 19526   Emily F 12562
Emily F 9004   Elizabeth F 9492

Looking from the left-to-right to make comparisons – this is the old-fashioned, non-computational – but perfectly common sense – way of making comparisons. This is how we would compare two lists on paper, for example. And there's nothing wrong with doing this when we don't have access to a computer.

But while it's easy to pick out the common and different names, it won't be when the lists are thousands or millions of rows each. But even when dealing with lists of just a few dozen rows each, making comparisons can be so difficult as to not just be done.

Learning to use JOIN statements is learning how to describe a way to compare two different lists of things. But on a massive scale.

Setting up a small example

For the sake of demonstration, let's pretend that babynames_1980 and babynames_2014 are much smaller and contain only 4 names each:

babynames_1980
name sex babies
Jennifer F 58385
Amanda F 35818
Elizabeth F 19526
Emily F 9004
babynames_2014
name sex babies
Emma F 20799
Olivia F 19674
Emily F 12562
Elizabeth F 9492

From this small sample, we can see more clearly that there are 3 interesting categories of names.

The entire purpose of this chapter: how do we write a query to identify these 3 categories of names?

1. Names that exist only in 1980

These are names that have apparently fallen out of favor in 2014:

babynames_1980
name sex babies
Jennifer F 58385
Amanda F 35818
Elizabeth F 19526
Emily F 9004
babynames_2014
name sex babies
Emma F 20799
Olivia F 19674
Emily F 12562
Elizabeth F 9492

2. Names that exist only in 2014

These are names that apparently gained popularity in the several decades:

babynames_1980
name sex babies
Jennifer F 58385
Amanda F 35818
Elizabeth F 19526
Emily F 9004
babynames_2014
name sex babies
Emma F 20799
Olivia F 19674
Emily F 12562
Elizabeth F 9492

3. Names that exist both in 1980 and 2014.

These are names that exist in both years, apparently having staying power despite cultural changes and trends over the 3 decades:

babynames_1980
name sex babies
Jennifer F 58385
Amanda F 35818
Elizabeth F 19526
Emily F 9004
babynames_2014
name sex babies
Emma F 20799
Olivia F 19674
Emily F 12562
Elizabeth F 9492

In this chapter, we will learn how to use INNER JOIN to derive the third category of names – Names that exist both in 1980 and 2014. In the next chapter, we'll learn how to use LEFT JOIN to filter for names that exist in only one or the other table.

Creating sample data for yourself

If you want to follow the explanatory examples that use this example data, just run these queries to create 2 tables – t1980 and t2014, which you can delete later – with the data:

DROP TABLE IF EXISTS t1980;
CREATE TABLE t1980(
  "name" TEXT,
  "sex" TEXT,
  "babies" INTEGER
);
INSERT INTO 
  t1980 ("name", "sex", "babies")
  VALUES
  ("Jennifer", "F", 58385),
  ("Amanda", "F", 35818),
  ("Elizabeth", "F", 19526),
  ("Emily", "F", 9004);

DROP TABLE IF EXISTS t2014;
CREATE TABLE t2014(
  "name" TEXT,
  "sex" TEXT,
  "babies" INTEGER
);
INSERT INTO 
  t2014 ("name", "sex", "babies")
  VALUES
  ("Emma", "F", 20799),
  ("Olivia", "F", 19674),
  ("Emily", "F", 12562),
  ("Elizabeth", "F", 9492);

Use INNER JOIN to find matching entries between two tables

An INNER JOIN is used to find entries that match between two tables. This corresponds to the aforementioned scenario, in which we're interested in names that appear in both 1980 and 2014.

In the sequence of a SELECT statement, the JOIN clause occurs after the FROM clause:

SELECT * 
FROM table_a
INNER JOIN table_b
ON table_a.column = table_b.column;

It's easier to walk through the syntax, step-by-step, starting from scratch:

SELECTing the first table

Let’s begin by using SELECT to select the entirety of the t1980 table. This is simple enough:

SELECT *
FROM t1980;
name sex babies
Jennifer F 58385
Amanda F 35818
Elizabeth F 19526
Emily F 9004

In the context of JOIN statements, think of this as telling the SQL interpreter that t1980 is the table that we want to start from.

Adding the INNER JOIN clause

The JOIN clause comes after the FROM clause. By default, a JOIN is an INNER-type of join. I like to be explicit about it so in all my examples, I use INNER JOIN even though just JOIN will do fine.

Trying to JOIN without a constraint

Adding the INNER JOIN clause can be as simple as this:

SELECT *
FROM t1980
INNER JOIN t2014;

However, this simple query makes for an equally simple but nonsensical result:

name sex babies name sex babies
Jennifer F 58385 Emma F 20799
Jennifer F 58385 Olivia F 19674
Jennifer F 58385 Emily F 12562
Jennifer F 58385 Elizabeth F 9492
Amanda F 35818 Emma F 20799
Amanda F 35818 Olivia F 19674
Amanda F 35818 Emily F 12562
Amanda F 35818 Elizabeth F 9492
Elizabeth F 19526 Emma F 20799
Elizabeth F 19526 Olivia F 19674
Elizabeth F 19526 Emily F 12562
Elizabeth F 19526 Elizabeth F 9492
Emily F 9004 Emma F 20799
Emily F 9004 Olivia F 19674
Emily F 9004 Emily F 12562
Emily F 9004 Elizabeth F 9492

…What happened? Well, we didn't tell the interpreter how the two tables should be joined – i.e. which column or columns have the similarities that we want to find – so, in typical computer-programming fashion, it just gave us every possible combination.

There are 16 rows because both tables have 4 rows each, i.e. the 16 rows are the maximum number of combinations between the two tables, i.e. 4 * 4 = 16.

We obviously don't want 16 rows. We know from the beginning of this lesson that there are only two shared names, "Emily" and "Elizabeth".

Adding a constraint to a JOIN clause

With the ON clause, we are able to constrain how the tables are joined. It's pretty much like a WHERE clause, in which we specify a true/false condition that, if true for a given combination of rows, the query will return those rows in the result.

Logical screwups

Before we get to the ostensibly desired answer – let's intentionally screw things up. If we pass in a condition that is always false, such as 1 = 2, then the JOIN clause will always be false for every combination of rows.

Try it out for yourself:

SELECT *
FROM t1980
INNER JOIN t2014
  ON 1 = 2;

Conversely, if we use an always true condition for the ON constraint, then it will be true for every possible combination of rows. Therefore, all possible combination of rows will be returned, i.e. the 16 we saw previously:

SELECT *
FROM t1980
INNER JOIN t2014
  ON 1 = 1;

Choosing a column for a constraint

Let's pick a condition that makes sense: we want the query to return the rows in which t1980 and t2014 have matching names. Think to yourself what the query will look like, and then look at the answer below:

SELECT *
FROM t1980
INNER JOIN t2014
  ON t1980.name = t2014.name;
name sex babies name sex babies
Elizabeth F 19526 Elizabeth F 9492
Emily F 9004 Emily F 12562

And that's it, that's an INNER JOIN. It may not look like much because, after all, we're dealing with a very small dataset, but imagine reducing two lists of 100 names each – with a maximum of 10,000 possible comparisons; a INNER JOIN query will find the matches just as quickly.

So congrats on performing one of the most fundamentally powerful – though, at the same time, somewhat hard to figure out – features of SQL.

Joining on multiple columns

I simplified the examples to include only female names. But pretend both genders were represented in t1980 and t2014 and we wanted to find same-gender rows for gender neutral names – think "Taylor", and pretend "Taylor" exists as an entry in t1980 and t2014, though if you want to do the query below, just substitute babynames_1980 and babynames_2014, respectively:

(also, ignore the use of the WHERE clause for now – we'll get to that later)

SELECT * FROM t1980 
INNER JOIN t2014
  ON t1980.name = t2014.name
WHERE t1980.name = 'Taylor';
name sex babies name sex babies
Taylor F 237 Taylor F 3782
Taylor F 237 Taylor M 691
Taylor M 617 Taylor F 3782
Taylor M 617 Taylor M 691

There are 4 rows because both 1980 and 2014 have 2 rows each – one for each gender – corresponding to the name of "Taylor".

What if we want to compare "Taylor" between the years but just for same genders? In other words, we don't care that "Taylor"'s popularity among girls versus boys, we just care how it changed in popularity as a name for boys and girls, respectively, between 1980 and 2014.

Then we simply add another constraint: the two sex columns must be equal:

SELECT * FROM t1980 
INNER JOIN t2014
  ON t1980.name = t2014.name
    AND t1980.sex = t2014.sex
WHERE t1980.name = 'Taylor';
name sex babies name sex babies
Taylor F 237 Taylor F 3782
Taylor M 617 Taylor M 691

Making irrelevant, unnecessary, or illogical comparisons

As always, we can tell the interpreter to look for things that don't make much real-world sense. The following query would, for our small example dataset, return 16 rows because, again, all of the example rows happen to have the same sex:

SELECT * FROM t1980 
INNER JOIN t2014
  ON t1980.sex = t2014.sex;

The following query would return 0 rows because it's asking for rows in which the babies number in t1980 matches the name in t2014 (if that confuses you, it's because it's not supposed to make any sense):

SELECT * FROM t1980 
INNER JOIN t2014
  ON t1980.name = t2014.count;

And all of the following queries don't make much logical sense, though they each either return all possible results, or none at all. See if you can derive their bizarre logic from their code:

SELECT * FROM t1980 
INNER JOIN t2014
  ON t1980.name = t1980.name;
SELECT * FROM t1980 
INNER JOIN t2014
  ON t1980.name != t2014.name;
SELECT * FROM t1980 
INNER JOIN t2014
  ON t1980.name = t2014.name
  OR t1980.name != t2014.name;

Making specific JOINs that may return nothing

As a quick sidenote, remember that an empty result does not mean the query was wrong. In our t1980 and t2014 datasets, the following query would return 0 results:

SELECT * FROM t1980 
INNER JOIN t2014
  ON t1980.name = t2014.name 
    AND t1980.count = t2014.count;

This is a reasonable query: which names between 1980 and 2014 have exactly the same number of babies between the two years? In our data, it is 0. But it may have results when looking at the greater dataset…

Working with joined table results

It's key to remember that a JOIN basically makes a new table. So what worked with a single table – including WHERE, sorts, and limits – will basically work with the joined result.

But before we get to LEFT JOINs, I'll review a few things particular to the results of JOINing tables.

Keeping track of tables and columns in a JOIN

Let's examine the result of our previous, successful INNER JOIN:

SELECT *
FROM t1980
INNER JOIN t2014
  ON t1980.name = t2014.name;
name sex babies name sex babies
Elizabeth F 19526 Elizabeth F 9492
Emily F 9004 Emily F 12562

It makes sense that there are 2 rows - one for each matching name between the two tables.

But why are there 6 columns? Once again, because we didn't give the interpreter any specific directions, it just gave us all the columns when we used SELECT * in our statement. Both tables have 3 columns each, 3 + 3 = 6, and so forth.

How do we show just the 4 columns we need? Specifically: name, sex, and the 2 different babies counts?

Being specific about column names

Let's try something simpler: just show the matching name values.

The following query will return an error:

SELECT name
FROM t1980
INNER JOIN t2014
  ON t1980.name = t2014.name;
  Error: near line 2: ambiguous column name: name

Remember that we're joining two different tables. Conceptually, to us, there's only one name column. But to the database, there are 2 such columns, because both tables have a name column.

The way to resolve ambiguity is to prefix the table name to each column:

SELECT t1980.name
FROM t1980
INNER JOIN t2014
  ON t1980.name = t2014.name;
name
Elizabeth
Emily

And this is how we get the 4 columns we need:

SELECT t1980.name, 
t1980.sex, t1980.babies, t2014.babies
FROM t1980
INNER JOIN t2014
  ON t1980.name = t2014.name;
name sex babies babies
Elizabeth F 19526 9492
Emily F 9004 12562

Note that it doesn't matter which of the name columns we used – by definition of INNER JOIN, t2014.name and t1980.name will always be the same thing.

But take special note that t2014.sex and t1980.sex are not the same. They just happen to be the same in our female-only example.

Using aliases

In the above result, we see that SQLite insists on labeling both columns as babies, which makes it hard to read for anyone else who didn't actually write the query.

This is where aliases – using the AS keyword – can be handy.

SELECT t1980.name, 
  t1980.sex, 
  t1980.babies AS num1980, 
  t2014.babies AS num2014
FROM t1980
INNER JOIN t2014
  ON t1980.name = t2014.name;
name sex num1980 num2014
Elizabeth F 19526 9492
Emily F 9004 12562

The aliases can be used in subsequent clauses – for example, if we want to ORDER BY the 2014 baby count in reverse order:

SELECT t1980.name, 
  t1980.sex, 
  t1980.babies AS num1980, 
  t2014.babies AS num2014
FROM t1980
INNER JOIN t2014
  ON t1980.name = t2014.name
ORDER BY num2014 DESC;
name sex num1980 num2014
Emily F 9004 12562
Elizabeth F 19526 9492

While we're here, we might as well see where things fit with the WHERE clause. Here's an example of using WHERE to filter the joined table results by a condition on the num2014 and num1980 columns, which, again, are actually just aliases for t1980.babies and t2014.babies, respectively.

The following query returns only the joined results in which the 2014 count is higher than 1980:

SELECT t1980.name, 
  t1980.sex, 
  t1980.babies AS num1980, 
  t2014.babies AS num2014
FROM t1980
INNER JOIN t2014
  ON t1980.name = t2014.name
WHERE num2014 > num1980;
name sex num1980 num2014
Emily F 9004 12562

Creating derived columns from joined columns

Just like with a single table, we can perform calculations on a joined results table. The following calculates the change of babies between 2014 and 1980. Note that I can't refer to the aliases within the SELECT statement, i.e. to do num2014 - num1980, because they haven't really been assigned yet at that point:

SELECT t1980.name, 
  t1980.sex, 
  t1980.babies AS num1980, 
  t2014.babies AS num2014,
  t2014.babies - t1980.babies AS diffbabies
FROM t1980
INNER JOIN t2014
  ON t1980.name = t2014.name;
name sex num1980 num2014 diffbabies
Elizabeth F 19526 9492 -10034
Emily F 9004 12562 3558

Here's a variation of that query, except I don't even bother showing the two babies columns. Instead, I show diffbabies, and I calculate a new value: the percent difference between the years. And I also sort by that derived value in descending order.

SELECT t1980.name, 
  t1980.sex, 
  t2014.babies - t1980.babies AS diffbabies,
  ROUND(100 * (t2014.babies - t1980.babies) / t1980.babies) AS diffpct
FROM t1980
INNER JOIN t2014
  ON t1980.name = t2014.name
ORDER BY diffpct DESC;

Again, just an example and not really a concept, other than slowly realizing that we can pretty much do things exactly we want, depending on the calculations we need:

name sex diffbabies diffpct
Emily F 3558 39.0
Elizabeth F -10034 -51.0

That's enough about INNER JOINs. In the next lesson, we look at LEFT JOINs, which allow us to find the things that are not shared between two tables.