This lesson is effectively a review of the last lesson, in which we practiced INNER JOINS
and LEFT JOINS
on a very, very small version of the babynames
table.
Now, let's go through the motions, but on the real data:
boynames
and girlnames
tablesboynames
and girlnames
tablesDid you create the boynames
and girlnames
tables as shown in the previous lesson? If not, no matter, just run the following code to DROP the tables (if they exist) and then recreate them:
DROP TABLE IF EXISTS boynames;
DROP TABLE IF EXISTS girlnames;
CREATE TABLE boynames(
"name" VARCHAR,
"babies" INTEGER
);
CREATE TABLE girlnames(
"name" VARCHAR,
"babies" INTEGER
);
Now let's populate those tables, starting with boynames
. The query to SELECT all names given to boys from the babynames
table is simply this (remember, we don't need to keep/display the sex
column, as it's redundant):
SELECT name, babies
FROM babynames
WHERE sex = 'M';
How do we get that into the boynames
table? We could rely on the GUI: DB Browser for SQLite has an Export button to save results as a CSV, which we could then re-import (again via the GUI) into boynames
.
But that's cumbersome compared to executing this in SQL – basically, you're inserting into a table the results of the given query:
INSERT INTO boynames
SELECT name, babies
FROM babynames
WHERE sex = 'M';
In fact, it's easy enough to do everything in a series of queries, including re-creating the tables and inserting the SELECT
query as data.
DROP TABLE IF EXISTS boynames;
DROP TABLE IF EXISTS girlnames;
CREATE TABLE boynames(
"name" VARCHAR,
"babies" INTEGER
);
CREATE TABLE girlnames(
"name" VARCHAR,
"babies" INTEGER
);
INSERT INTO boynames
SELECT name, babies
FROM babynames
WHERE sex = 'M';
INSERT INTO girlnames
SELECT name, babies
FROM babynames
WHERE sex = 'F';
Now we can re-do the queries from the last lesson, but with all of the baby names.
Let's start with LEFT JOINS
, which we learned can be used to find when values in one table are not in another. This will be useful in finding all boy-only and girl-only names.
If you've already forgotten the JOIN
syntax, just start with a simple SELECT
. This query works without having to know about girlnames
, but it simply returns the data from boynames
.
SELECT *
FROM boynames;
Now we add the LEFT JOIN
clause. Remember that it comes after the FROM
clause, i.e. you've already specified which table will be on the "left" side:
SELECT *
FROM boynames
LEFT JOIN girlnames
ON boynames.name = girlnames.name;
The result will look very similar to an INNER JOIN
; this is because the most popular boy names (the boynames
table happens to contain names in order of popularity) were also given to girl babies.
Scroll down the results of the previous query to see where the gaps exist – since we didn't bother to specify column names, just remember that the two columns on the right side belong to girlnames
:
name | babies | name | babies |
---|---|---|---|
Jack | 1831 | NULL | NULL |
Albert | 1826 | Albert | 8 |
Louis | 1818 | Louis | 15 |
Bryce | 1816 | Bryce | 31 |
Edwin | 1814 | Edwin | 15 |
Max | 1791 | NULL | NULL |
Jimmy | 1786 | Jimmy | 17 |
Apparently, there were a few girls named "Edwin" and "Jimmy", but no girls named "Max" or "Jack".
OK, now let's use the WHERE
clause to show only rows in which girlnames.name
– or girlnames.babies
– have a value of NULL
:
SELECT *
FROM boynames
LEFT JOIN girlnames
ON boynames.name = girlnames.name
WHERE girlnames.name IS NULL;
name | babies | name | babies |
---|---|---|---|
Russell | 2213 | NULL | NULL |
Jack | 1831 | NULL | NULL |
Max | 1791 | NULL | NULL |
Clinton | 1499 | NULL | NULL |
Gerald | 1451 | NULL | NULL |
And that's basically how we find the list of boy-only names.
Let's rewrite the query slightly so that we get just a count of boy-only names:
SELECT COUNT(*)
FROM boynames
LEFT JOIN girlnames
ON boynames.name = girlnames.name
WHERE girlnames.name IS NULL;
The answer:
COUNT(*) |
---|
7443 |
See if you can re-write the above query to filter for girl-only names – remember that the order of tables in LEFT JOIN
is absolutely important.
And when you've figured out how to alter that query, add a SUM
function on the girlnames.babies
column – this effectively counts how many baby girls have girl-only names.
Here are the queries for boy-only and girl-only names, respectively:
SELECT COUNT(*), SUM(boynames.babies)
FROM boynames
LEFT JOIN girlnames
ON boynames.name = girlnames.name
WHERE girlnames.name IS NULL;
COUNT(*) | SUM(boynames.babies) |
---|---|
7443 | 257039 |
SELECT COUNT(*), SUM(girlnames.babies)
FROM girlnames
LEFT JOIN boynames
ON boynames.name = girlnames.name
WHERE boynames.name IS NULL;
COUNT(*) | SUM(girlnames.babies) |
---|---|
13192 | 494921 |
That was a quick review of LEFT JOINS
. There's not much more to it than that. Let's move on to INNER JOINS
to find the names that were given to both boys and girls. The syntax is the exact same, but we do a little more work to determine how boy- and girl-oriented each name is.
Before we forget about the babynames
dataset, let's count how many unique names are in babynames
:
SELECT SUM(babies)
FROM babynames;
SUM(babies) |
---|
3950252 |
And let's count how many babies total are in the babynames
data:
SELECT COUNT(DISTINCT name)
FROM babynames;
COUNT(DISTINCT name) |
---|
22674 |
So how many of those 22,674 names were given to both boys and girls?
We can use the COUNT
function in combination with an INNER JOIN
" (we don't need DISTINCT
because the matched names will be unique – it's just how it works out in this situation, with a single year of SSA data…)
SELECT COUNT(*)
FROM girlnames
INNER JOIN boynames
ON girlnames.name = boynames.name;
COUNT(*) |
---|
2039 |
A little less than 10% of the total number of unique names are given to babies of both sex.
But how many total babies (i.e. boys + girls) have these names?
SELECT
SUM(girlnames.babies + boynames.babies) AS total_babies
FROM girlnames
INNER JOIN boynames
ON girlnames.name = boynames.name;
}, %Q{ |—————-| | total_babies | |—————-| | 3198292 |
}) %>
Divide 3,198,292
by the total number of babies, 3,950,252
, and we see that 81% of all babies were given these boy/girl names.
Another way to phrase this is that it's relatively rare for a baby to have a boy- or girl-only name.
Of these 2,000+ boy/girl names, let's see which ones are the most popular among boys and girls, respectively:
SELECT girlnames.name,
girlnames.babies AS girl_babies,
boynames.babies AS boy_babies
FROM girlnames
INNER JOIN boynames ON
girlnames.name = boynames.name
ORDER BY girl_babies DESC
LIMIT 5;
name | girl_babies | boy_babies |
---|---|---|
Jessica | 46466 | 140 |
Ashley | 45549 | 239 |
Brittany | 36535 | 112 |
Amanda | 34406 | 96 |
Samantha | 25864 | 58 |
To do the same for boys, we can use the same query, but change the ORDER BY
clause to sort by boy_babies
:
SELECT girlnames.name,
girlnames.babies AS girl_babies,
boynames.babies AS boy_babies
FROM girlnames
INNER JOIN boynames ON
girlnames.name = boynames.name
ORDER BY boy_babies DESC
LIMIT 5;
name | girl_babies | boy_babies |
---|---|---|
Michael | 277 | 65274 |
Christopher | 188 | 52323 |
Matthew | 125 | 44794 |
Joshua | 166 | 43214 |
Daniel | 148 | 33809 |
A quick sidenote: it seems a bit strange that, according to this 1990 data from the Social Security Administration, there are nearly 500 boys who were given the names of either Jessica, Ashley, or Brittany. I mean, according to the following query:
SELECT *
FROM babynames
WHERE name = 'Sue';
There are only 11 boys named 'Sue', despite there being a popular Shel Silverstein and famous Johnny Cash song about a boy named Sue. I know exploring data often challenges one's assumptions, but I kind of feel that we would be hearing a lot more This American Life episodes on the childhoods of boys named "Jessica" and girls named "Joshua"…But let's move on for now.
In the previous queries, we found the most popular names that happened to also be given to babies of both sexes. But what about names that have the most purported gender ambivalence, i.e. names given to roughly the same number of boys as girls.
There's a few approaches to this:
If 100 boys and 100 girls were given a particular name, then the difference between boy_babies
and girl_babies
would be small, i.e. 0. So one way to find most gender-flexible name is to subtract one from the other, and sort by the smallest absolute number of the difference:
SELECT girlnames.name,
ABS(girlnames.babies - boynames.babies) AS diff_count,
girlnames.babies + boynames.babies AS total_count
FROM girlnames
INNER JOIN boynames ON
girlnames.name = boynames.name
ORDER BY diff_count ASC
LIMIT 5;
name | diff_count | total_count |
---|---|---|
Michal | 0 | 138 |
Layne | 0 | 136 |
Charley | 0 | 112 |
Adrean | 0 | 32 |
Yee | 0 | 32 |
This is only mildly interesting, as all of those names are pretty rare. Let's use a WHERE
clause to only include names in which total_count
is at least 1,000:
SELECT girlnames.name,
ABS(girlnames.babies - boynames.babies) AS diff_count,
girlnames.babies + boynames.babies AS total_count
FROM girlnames
INNER JOIN boynames ON
girlnames.name = boynames.name
WHERE total_count >= 1000
ORDER BY diff_count ASC
LIMIT 5;
name | diff_count | total_count |
---|---|---|
Infant | 16 | 1176 |
Jessie | 139 | 2371 |
Ashton | 195 | 1469 |
Kendall | 224 | 1508 |
Dominique | 460 | 3430 |
Mmmkay…nearly 1,200 mothers named their baby, "Infant"? Either American parents are even stranger than we thought or we have more reason to think that the SSA data isn't 100% what we expect it to be…
Another way to do the calculation is to divide either boy_babies
or girl_babies
by total_babies
, multiply by 100 (to get a round number), then sort by the absolute difference between the ratio and 50, as 50% is what the ratio would be if a name was split evenly between boys and girls.
This is not particularly different in strategy than the previous strategy using ABS
. But note how the ABS
function is used in the ORDER BY
clause – that's always an option.
Again, we use WHERE
to limit it to names that have been given to more than 1,000 babies:
SELECT girlnames.name,
girlnames.babies + boynames.babies AS total_count,
girlnames.babies * 100.0 /
(girlnames.babies + boynames.babies) AS girl_ratio
FROM girlnames
INNER JOIN boynames ON
girlnames.name = boynames.name
WHERE total_count >= 1000
ORDER BY ABS(50 - girl_ratio) ASC
LIMIT 5;
name | total_count | girl_ratio |
---|---|---|
Infant | 1176 | 49.3197278911565 |
Taylor | 13830 | 52.4656543745481 |
Jessie | 2371 | 47.0687473639814 |
Casey | 7391 | 44.1212285211744 |
Ashton | 1469 | 56.6371681415929 |
The resulting names are slightly different because of how this ratio is perhaps a more fine-grained tool for measuring gender-neutrality of a name.
I bring up this convoluted approach because calculating this ratio
– i.e the ratio of male/female babies that have a given name – might lead into creating a simplified way to predict gender based on name.
Imagine a table in which this query returned a single, simple answer:
SELECT gender
FROM imagined_table
WHERE name = 'Dakota';
I'll tentatively name this table, genderizer
. Just kidding, "genderizer" is the most creative thing I've thought of for this entire tutorial, so nevermind the "tentatively" part.
Let's try to [build it in the next tutorial]((/tutorials/babynames-and-college-salaries/creating-genderizer-table-with-joins).