For the data journalist, joins are the feature that completely justify the jump from spreadsheets to databases.
JOIN is done through yet another clause of a
SELECT statement – and so expect things to get even more cluttered. This lesson focuses on the overall syntax of joins and the most common type of join:
INNER JOIN. The next lesson looks at
LEFT JOINs – which share the same syntax but give us the ability to find when things are missing from tables.
The purpose of a
JOIN query is to bring together two tables. The easiest way to imagine this is taking two separate tables and smashing them together, so that they're side-by-side.
Below are two baby name tables, one from 1980 and the other from 2014. We want to compare them. Putting them side-by-side is much easier than flipping through different tabs/windows:
Looking from the left-to-right to make comparisons – this is the old-fashioned, non-computational – but perfectly common sense – way of making comparisons. This is how we would compare two lists on paper, for example. And there's nothing wrong with doing this when we don't have access to a computer.
But while it's easy to pick out the common and different names, it won't be when the lists are thousands or millions of rows each. But even when dealing with lists of just a few dozen rows each, making comparisons can be so difficult as to not just be done.
Learning to use
JOIN statements is learning how to describe a way to compare two different lists of things. But on a massive scale.
For the sake of demonstration, let's pretend that
babynames_2014 are much smaller and contain only 4 names each:
From this small sample, we can see more clearly that there are 3 interesting categories of names.
The entire purpose of this chapter: how do we write a query to identify these 3 categories of names?
These are names that have apparently fallen out of favor in 2014:
These are names that apparently gained popularity in the several decades:
These are names that exist in both years, apparently having staying power despite cultural changes and trends over the 3 decades:
In this chapter, we will learn how to use
INNER JOIN to derive the third category of names – Names that exist both in 1980 and 2014. In the next chapter, we'll learn how to use
LEFT JOIN to filter for names that exist in only one or the other table.
If you want to follow the explanatory examples that use this example data, just run these queries to create 2 tables –
t2014, which you can delete later – with the data:
DROP TABLE IF EXISTS t1980; CREATE TABLE t1980( "name" TEXT, "sex" TEXT, "babies" INTEGER ); INSERT INTO t1980 ("name", "sex", "babies") VALUES ("Jennifer", "F", 58385), ("Amanda", "F", 35818), ("Elizabeth", "F", 19526), ("Emily", "F", 9004); DROP TABLE IF EXISTS t2014; CREATE TABLE t2014( "name" TEXT, "sex" TEXT, "babies" INTEGER ); INSERT INTO t2014 ("name", "sex", "babies") VALUES ("Emma", "F", 20799), ("Olivia", "F", 19674), ("Emily", "F", 12562), ("Elizabeth", "F", 9492);
INNER JOINto find matching entries between two tables
INNER JOIN is used to find entries that match between two tables. This corresponds to the aforementioned scenario, in which we're interested in names that appear in both 1980 and 2014.
In the sequence of a
SELECT statement, the
JOIN clause occurs after the
SELECT * FROM table_a INNER JOIN table_b ON table_a.column = table_b.column;
It's easier to walk through the syntax, step-by-step, starting from scratch:
Let’s begin by using
SELECT to select the entirety of the
t1980 table. This is simple enough:
SELECT * FROM t1980;
In the context of
JOIN statements, think of this as telling the SQL interpreter that
t1980 is the table that we want to start from.
JOIN clause comes after the
FROM clause. By default, a
JOIN is an
INNER-type of join. I like to be explicit about it so in all my examples, I use
INNER JOIN even though just
JOIN will do fine.
JOINwithout a constraint
INNER JOIN clause can be as simple as this:
SELECT * FROM t1980 INNER JOIN t2014;
However, this simple query makes for an equally simple but nonsensical result:
…What happened? Well, we didn't tell the interpreter how the two tables should be joined – i.e. which column or columns have the similarities that we want to find – so, in typical computer-programming fashion, it just gave us every possible combination.
There are 16 rows because both tables have 4 rows each, i.e. the 16 rows are the maximum number of combinations between the two tables, i.e.
4 * 4 = 16.
We obviously don't want 16 rows. We know from the beginning of this lesson that there are only two shared names, "Emily" and "Elizabeth".
ON clause, we are able to constrain how the tables are joined. It's pretty much like a
WHERE clause, in which we specify a true/false condition that, if true for a given combination of rows, the query will return those rows in the result.
Before we get to the ostensibly desired answer – let's intentionally screw things up. If we pass in a condition that is always false, such as
1 = 2, then the
JOIN clause will always be false for every combination of rows.
Try it out for yourself:
SELECT * FROM t1980 INNER JOIN t2014 ON 1 = 2;
Conversely, if we use an always true condition for the
ON constraint, then it will be true for every possible combination of rows. Therefore, all possible combination of rows will be returned, i.e. the 16 we saw previously:
SELECT * FROM t1980 INNER JOIN t2014 ON 1 = 1;
Let's pick a condition that makes sense: we want the query to return the rows in which
t2014 have matching names. Think to yourself what the query will look like, and then look at the answer below:
SELECT * FROM t1980 INNER JOIN t2014 ON t1980.name = t2014.name;
And that's it, that's an
INNER JOIN. It may not look like much because, after all, we're dealing with a very small dataset, but imagine reducing two lists of 100 names each – with a maximum of 10,000 possible comparisons; a
INNER JOIN query will find the matches just as quickly.
So congrats on performing one of the most fundamentally powerful – though, at the same time, somewhat hard to figure out – features of SQL.
I simplified the examples to include only female names. But pretend both genders were represented in
t2014 and we wanted to find same-gender rows for gender neutral names – think "Taylor", and pretend "Taylor" exists as an entry in
t2014, though if you want to do the query below, just substitute
(also, ignore the use of the
WHERE clause for now – we'll get to that later)
SELECT * FROM t1980 INNER JOIN t2014 ON t1980.name = t2014.name WHERE t1980.name = 'Taylor';
There are 4 rows because both 1980 and 2014 have 2 rows each – one for each gender – corresponding to the name of "Taylor".
What if we want to compare "Taylor" between the years but just for same genders? In other words, we don't care that "Taylor"'s popularity among girls versus boys, we just care how it changed in popularity as a name for boys and girls, respectively, between 1980 and 2014.
Then we simply add another constraint: the two
sex columns must be equal:
SELECT * FROM t1980 INNER JOIN t2014 ON t1980.name = t2014.name AND t1980.sex = t2014.sex WHERE t1980.name = 'Taylor';
As always, we can tell the interpreter to look for things that don't make much real-world sense. The following query would, for our small example dataset, return 16 rows because, again, all of the example rows happen to have the same
SELECT * FROM t1980 INNER JOIN t2014 ON t1980.sex = t2014.sex;
The following query would return 0 rows because it's asking for rows in which the
babies number in
t1980 matches the
t2014 (if that confuses you, it's because it's not supposed to make any sense):
SELECT * FROM t1980 INNER JOIN t2014 ON t1980.name = t2014.count;
And all of the following queries don't make much logical sense, though they each either return all possible results, or none at all. See if you can derive their bizarre logic from their code:
SELECT * FROM t1980 INNER JOIN t2014 ON t1980.name = t1980.name;
SELECT * FROM t1980 INNER JOIN t2014 ON t1980.name != t2014.name;
SELECT * FROM t1980 INNER JOIN t2014 ON t1980.name = t2014.name OR t1980.name != t2014.name;
As a quick sidenote, remember that an empty result does not mean the query was wrong. In our
t2014 datasets, the following query would return 0 results:
SELECT * FROM t1980 INNER JOIN t2014 ON t1980.name = t2014.name AND t1980.count = t2014.count;
This is a reasonable query: which names between 1980 and 2014 have exactly the same number of babies between the two years? In our data, it is 0. But it may have results when looking at the greater dataset…
It's key to remember that a
JOIN basically makes a new table. So what worked with a single table – including
WHERE, sorts, and limits – will basically work with the joined result.
But before we get to
LEFT JOINs, I'll review a few things particular to the results of JOINing tables.
Let's examine the result of our previous, successful
SELECT * FROM t1980 INNER JOIN t2014 ON t1980.name = t2014.name;
It makes sense that there are 2 rows - one for each matching name between the two tables.
But why are there 6 columns? Once again, because we didn't give the interpreter any specific directions, it just gave us all the columns when we used
SELECT * in our statement. Both tables have 3 columns each,
3 + 3 = 6, and so forth.
How do we show just the 4 columns we need? Specifically:
sex, and the 2 different
Let's try something simpler: just show the matching
The following query will return an error:
SELECT name FROM t1980 INNER JOIN t2014 ON t1980.name = t2014.name;
Error: near line 2: ambiguous column name: name
Remember that we're joining two different tables. Conceptually, to us, there's only one
name column. But to the database, there are 2 such columns, because both tables have a
The way to resolve ambiguity is to prefix the table name to each column:
SELECT t1980.name FROM t1980 INNER JOIN t2014 ON t1980.name = t2014.name;
And this is how we get the 4 columns we need:
SELECT t1980.name, t1980.sex, t1980.babies, t2014.babies FROM t1980 INNER JOIN t2014 ON t1980.name = t2014.name;
Note that it doesn't matter which of the
name columns we used – by definition of
t1980.name will always be the same thing.
But take special note that
t1980.sex are not the same. They just happen to be the same in our female-only example.
In the above result, we see that SQLite insists on labeling both columns as
babies, which makes it hard to read for anyone else who didn't actually write the query.
This is where aliases – using the
AS keyword – can be handy.
SELECT t1980.name, t1980.sex, t1980.babies AS num1980, t2014.babies AS num2014 FROM t1980 INNER JOIN t2014 ON t1980.name = t2014.name;
The aliases can be used in subsequent clauses – for example, if we want to
ORDER BY the
2014 baby count in reverse order:
SELECT t1980.name, t1980.sex, t1980.babies AS num1980, t2014.babies AS num2014 FROM t1980 INNER JOIN t2014 ON t1980.name = t2014.name ORDER BY num2014 DESC;
While we're here, we might as well see where things fit with the
WHERE clause. Here's an example of using
WHERE to filter the joined table results by a condition on the
num1980 columns, which, again, are actually just aliases for
The following query returns only the joined results in which the 2014 count is higher than 1980:
SELECT t1980.name, t1980.sex, t1980.babies AS num1980, t2014.babies AS num2014 FROM t1980 INNER JOIN t2014 ON t1980.name = t2014.name WHERE num2014 > num1980;
Just like with a single table, we can perform calculations on a joined results table. The following calculates the change of
1980. Note that I can't refer to the aliases within the
SELECT statement, i.e. to do
num2014 - num1980, because they haven't really been assigned yet at that point:
SELECT t1980.name, t1980.sex, t1980.babies AS num1980, t2014.babies AS num2014, t2014.babies - t1980.babies AS diffbabies FROM t1980 INNER JOIN t2014 ON t1980.name = t2014.name;
Here's a variation of that query, except I don't even bother showing the two
babies columns. Instead, I show
diffbabies, and I calculate a new value: the percent difference between the years. And I also sort by that derived value in descending order.
SELECT t1980.name, t1980.sex, t2014.babies - t1980.babies AS diffbabies, ROUND(100 * (t2014.babies - t1980.babies) / t1980.babies) AS diffpct FROM t1980 INNER JOIN t2014 ON t1980.name = t2014.name ORDER BY diffpct DESC;
Again, just an example and not really a concept, other than slowly realizing that we can pretty much do things exactly we want, depending on the calculations we need:
That's enough about
INNER JOINs. In the next lesson, we look at
LEFT JOINs, which allow us to find the things that are not shared between two tables.