I wrapped up the last lesson by pondering how nice it would be to have a SQL table – built from the babynames data – that made it much easier to predict a person's gender based on their first name. I also said we should call it the genderizer.

In this lesson, we build the genderizer, one SQL query at a time.

Table of contents

Why we need to build the genderizer table

Using the babynames table that we started out with, this is how I would use it to find out if "Dakota" was more likely to belong to a boy or a girl (at least when the 1990 data came out):

SELECT * 
FROM babynames
WHERE name = 'Dakota'
ORDER BY babies DESC;
name sex babies
Dakota M 1549
Dakota F 380

We can eyeball these results and do the math: 100 * 1549 / (1549 + 380) results in an answer of roughly 80%, i.e. 80 percent of babies named Dakota are male, at least in 1990.

OK, seems easy enough. But what if I have to look up hundreds of names and determine which are probably male and which are probably female? That simple SQL query and mental math is going to get rough. Certainly, it's not going to scale across tens of thousands of names.

Instead, we want a table as easy to use as this:

SELECT * 
FROM genderizer
WHERE name = 'Dakota';
name gender ratio total_babies
Dakota M 80 1929

If you can't see how this can easily be joined against any other kind of table, that's fine…but at least you can see how genderizer returns an easy-to-digest single row.

A few things to note:

  1. The gender the name indicates the most likely gender, e.g. M or F
  2. Just how likely? If a user really wants to know, genderizer has a ratio column, which represents what percentage of babies belong to the corresponding most-likely gender.
  3. But for very rare names, the fact that all 5 of them were boy or girls is not very comforting. OK, so the total_babies column is there for the user to figure out the statistical significance any way they like.

And to keep things simple and consistent with the florida_positions table – remember that table? We're about to return to it – we should also uppercase the names.

Sound good? Let's build it out of pure SQL and (simple) math.

Creating the genderizer table

You know how to create a table in SQL by now, so try to think what the statement looks like before seeing the scheme that I propose below:

CREATE TABLE "genderizer"(
  "name" VARCHAR,
  "gender" CHAR,
  "ratio" INTEGER,
  "total_babies" INTEGER
);

(Again, specifying VARCHAR and CHAR are totally optional for SQLite…you could just as well use TEXT – I do it out of habit from using MySQL for so long…)

Filling the genderizer table

Obviously, we're going to re-use that INSERT INTO statement that we learned in the last lesson – there's no point in further explaining it…it just works, when it comes to transferring data between exiting SQLite tables.

But from what SELECT statements do we use? In my mind, it's simplest to not do one statement, but four separate INSERT INTO and SELECT queries to fill out the genderizer table:

  1. Do a LEFT JOIN from girlnames to get girl-only names
  2. Do a LEFT JOIN from boynames to get boy-only names
  3. Do a INNER JOIN with boynames and girlnames, then filter for names in which there were equal or more girl than boy babies.
  4. Do a INNER JOIN with boynames and girlnames, then filter for names in which there were more boy than girl babies.

Let's go through each of these queries.

Selecting girl-only names

We did this in the previous lesson. See if you can recall it without looking at the previous example.

Here's a hint: we need to start with selecting FROM the girlnames table.

And here's the query:

SELECT 
  girlnames.name,
  girlnames.babies
FROM girlnames
LEFT JOIN boynames
  ON boynames.name = girlnames.name
WHERE boynames.name IS NULL;   
name babies
Breanna 2952
Katrina 2479
Alisha 2343
Grace 2186
Krista 2143

How do we INSERT this thing?

Remember that the schema for genderizer contains four columns:

  "name" VARCHAR,
  "gender" CHAR,
  "ratio" INTEGER,
  "total_babies" INTEGER

These columns are easy to calculate because of the nature of our LEFT JOIN, in which we know that all of the names returned are exclusively female:

This means that gender can just be set to 'F'. If you remember back to an early lesson about SQL, we can arbitrarily include constant and literal values in a query results:

SELECT 
  boynames.name,
  'hello world',
  'whatever' AS gender,
  42 AS "ANSWER TO LIFE",
  '?' AS "༼◕_◕༽"
FROM boynames
LIMIT 3;
name ‘hello world’ gender ANSWER TO LIFE ༼◕_◕༽
Michael hello world whatever 42 ?
Christopher hello world whatever 42 ?
Matthew hello world whatever 42 ?

In terms of ratio, if we know these names were given only to females, then the female ratio of that name is 100%.

And of course, this means that total_babies is just equal to girlnames.babies, as there are no boy babies with these names.

Here's the query, basically unchanged except for the additional, aliased columns:

SELECT 
  girlnames.name AS name,
  'F' AS gender,
  100 AS ratio,
  girlnames.babies AS total_babies
FROM girlnames
LEFT JOIN boynames
  ON boynames.name = girlnames.name
WHERE boynames.name IS NULL;   
name gender ratio total_babies
Breanna F 100 2952
Katrina F 100 2479
Alisha F 100 2343
Grace F 100 2186
Krista F 100 2143

Selecting boy-only names

See if you can figure this out on your own. It requires just a few changes to the LEFT JOIN we did for girl-only names.

(If you don't get it, I provide the answer later in this lesson).

Selecting names in which girl babies equal or outnumber boy babies

Selecting for common values in two different tables requires the INNER JOIN.

The syntax is largely the same as the LEFT JOIN, except use INNER instead of LEFT, of course. The following inner-join query returns the names common to both boynames and girlnames:

SELECT 
  girlnames.name,
  girlnames.babies
FROM girlnames
INNER JOIN boynames
  ON boynames.name = girlnames.name;
name babies
Jessica 46466
Ashley 45549
Brittany 36535
Amanda 34406
Samantha 25864

Filtering for names given mostly to girls

I've decided arbitrarily to designate names that were split evenly between boys and girls as "female". So we use the greater-than-or-equal operator in the WHERE condition, i.e. >=:

SELECT 
  girlnames.name,
  girlnames.babies
FROM girlnames
INNER JOIN boynames
  ON boynames.name = girlnames.name
WHERE girlnames.babies > boynames.babies;

Calculating the extra columns for genderizer

Again, the literal value, 'F', can be passed into the gender column.

The total_babies value is the sum of boynames.babies and girlnames.babies.

As far as ratio, this is only slightly more complicated: it's the result of multiplying 100 by girlnames.babies and dividing the result by total_babies, though we have to explicitly restate that calculation, i.e. we can't refer to total_babies and must re-add boynames.babies with girlnames.babies.

Nothing else about the query changes:

SELECT 
  girlnames.name,
  'F' AS gender,
  100 * girlnames.babies / 
          (girlnames.babies + boynames.babies) AS ratio,
  girlnames.babies + boynames.babies AS total_babies
FROM girlnames
INNER JOIN boynames
  ON boynames.name = girlnames.name
WHERE girlnames.babies >= boynames.babies;
name gender ratio total_babies
Jessica F 99 46606
Ashley F 99 45788
Brittany F 99 36647
Amanda F 99 34502
Samantha F 99 25922

Now we're ready to add these "mostly-female" names to the genderizer table.

Filtering for names given mostly to boys

Again, see if you can figure this out on your own, it's very similar to deriving the mostly-female names. If you don't get it, I provide the answer below

All together!

OK, let's add the data to genderizer. Since it's so long since we last looked at the table schema, we might as well DROP it (if it still exists), re-create it, and then run the queries described above.

Create the genderizer table

DROP TABLE IF EXISTS "genderizer";
CREATE TABLE "genderizer"(
  "name" VARCHAR,
  "gender" CHAR,
  "ratio" INTEGER,
  "total_babies" INTEGER
);

Add the boy- and girl-only names

First, the boy-only names:

INSERT INTO "genderizer"
  SELECT 
    boynames.name AS name,
    'M' AS gender,
    100 AS ratio,
    boynames.babies AS total_babies
  FROM boynames
  LEFT JOIN girlnames
    ON boynames.name = girlnames.name
  WHERE girlnames.name IS NULL;

Then the girl-only names:

INSERT INTO "genderizer"
  SELECT 
    girlnames.name AS name,
    'F' AS gender,
    100 AS ratio,
    girlnames.babies AS total_babies
  FROM girlnames
  LEFT JOIN boynames
    ON boynames.name = girlnames.name
  WHERE boynames.name IS NULL;

Add the mixed-gender names

First, the mostly-boy names:

INSERT INTO "genderizer"
  SELECT 
    boynames.name,
    'M' AS gender,
    100 * boynames.babies / 
            (girlnames.babies + boynames.babies) AS ratio,
    girlnames.babies + boynames.babies AS total_babies
  FROM girlnames
  INNER JOIN boynames
    ON boynames.name = girlnames.name
  WHERE boynames.babies > girlnames.babies;

Then the mostly (or at least 50%)-girl names:

INSERT INTO "genderizer"
  SELECT 
    girlnames.name,
    'F' AS gender,
    100 * girlnames.babies / 
            (girlnames.babies + boynames.babies) AS ratio,
    girlnames.babies + boynames.babies AS total_babies
  FROM girlnames
  INNER JOIN boynames
    ON boynames.name = girlnames.name
  WHERE girlnames.babies >= boynames.babies;

Update the names with uppercased versions

Remember that the florida_positions table has a FirstName column, and that all the names are uppercased. To simplify our JOIN statements – i.e. to avoid having to repeatedly write UPPER(genderizer.name) = ... – let's use an UPDATE statement to uppercase all of the values in name.

Sure, I could create a new column just to be safe…but we know that every name in the Social Security Administration's database is capitalized in the same way, i.e. it's easy to reverse our changes:

UPDATE genderizer
SET name = UPPER(name);

Did it work?

That was a lot of code we just copy-pasted/retyped, which means there is a lot of possibility for human error – think fatigue or clumsy fingers – and for the kind of errors that don't get caught by the SQLite interpreter.

The very least we can do is confirm the counts that we got in the previous lesson

  name count total babies
All 22,674 3,950,252
Boy & girl names 2,039 3,198,292
Boy-only names 7,443 257,039
Girl-only names 13,192 494,921

Here are the queries to do so; you can run them yourself:

SELECT 
  COUNT(*) AS all_name_count, 
  SUM(total_babies) AS all_total_babies
FROM genderizer;
SELECT 
  COUNT(*) AS mixed_sex_name_count, 
  SUM(total_babies) AS mixed_sex_total_babies
FROM genderizer
WHERE ratio BETWEEN 1 AND 99;
SELECT 
  COUNT(*) AS boy_only_name_count, 
  SUM(total_babies) AS boy_only_total_babies
FROM genderizer
WHERE ratio = 100 
  AND gender = 'M';
SELECT 
  COUNT(*) AS girl_only_name_count, 
  SUM(total_babies) AS girl_only_total_babies
FROM genderizer
WHERE ratio = 100 
  AND gender = 'F';

The results of the queries should equal the values in the table above. At least, they did for me…

Putting the genderizer table to use

Whew, that was a lot of work. Let's try it out!

Which boy- and girl-only names – i.e. have a ratio of 100 – are the most popular?

SELECT *
FROM genderizer
WHERE ratio = 100
ORDER BY total_babies DESC
LIMIT 5;
name gender ratio total_babies
BREANNA F 100 2952
KATRINA F 100 2479
ALISHA F 100 2343
RUSSELL M 100 2213
GRACE F 100 2186

This is an important query, as the names it returns are the ones that have the highest likelihood of misclassifying names. Let's define gender-neutral as a ratio of 55% or less:

SELECT *
FROM genderizer
WHERE ratio <= 55
ORDER BY total_babies DESC
LIMIT 10;
name gender ratio total_babies
TAYLOR F 52 13830
CASEY M 55 7391
JESSIE M 52 2371
INFANT M 50 1176
SKYLAR M 51 866
DEVAN M 54 708
KRISTIAN F 53 671
BABY M 54 498
SHEA F 50 450
DOMINQUE F 51 409

There are some pretty common names – Taylor especially – and of course, good 'ol Infant.

Let's expand the definition of gender-neutral to include a ratio as high as 75%:

SELECT *
FROM genderizer
WHERE ratio <= 75
ORDER BY total_babies DESC
LIMIT 10;
name gender ratio total_babies
JORDAN M 73 22082
TAYLOR F 52 13830
CASEY M 55 7391
ANGEL M 59 4067
DOMINIQUE F 56 3430
DEVON M 67 3053
JESSIE M 52 2371
JAIME M 61 2318
KASEY F 74 1797
KENDALL M 57 1508

Let's see what names fall within a ratio of 85%:

SELECT *
FROM genderizer
WHERE ratio <= 85
ORDER BY total_babies DESC
LIMIT 10;
name gender ratio total_babies
JORDAN M 73 22082
TAYLOR F 52 13830
JAMIE F 83 7828
CASEY M 55 7391
DEVIN M 81 4723
ANGEL M 59 4067
DOMINIQUE F 56 3430
DEVON M 67 3053
JESSIE M 52 2371
JAIME M 61 2318

It's worth noting that the origin of the original babynames table will have a major impact. If you followed these tutorials from the beginning, you might remember that babynames consists of baby name data from 1990. Is that representative of the kind of people whose first names we'll attempt to mass-classify?

Here's a query for some gender-neutral names that I didn't see in our previous queries, perhaps because they weren't as popular in 1990:

SELECT *
FROM genderizer
WHERE name IN('JACKIE', 'KIM', 'KELLY', 
'LESLIE', 'LOREN', 'MICAH', 'SHAWN');
name gender ratio total_babies
MICAH M 84 1389
SHAWN M 97 5890
KELLY F 92 8821
LESLIE F 90 3123
LOREN F 63 724
JACKIE F 63 717
KIM F 80 321

Surely, a few readers will be worried that a name such as Leslie can have a ratio as high as 90%

Breakdown of names by popularity

So just how many names – and babies – have a skewed ratio such that we can be nearly certain that it predicts a person's gender based on first name alone?

Let's run a query that returns a result that can be easily turned into a histogram, i.e. number of names and total babies per value of ratio:

SELECT
  ratio,
  COUNT(*) AS name_count,
  SUM(total_babies) AS total_babies
FROM genderizer
GROUP BY ratio;
ratio name_count total_babies
50 48 3186
51 11 1766
52 36 17261
53 31 2257
95 46 40946
96 62 55065
97 73 87515
98 190 303397
99 427 2429751
100 20635 751960

We don't even have to graph this to show how many names and babies have a ratio of 97% – more than 85% of the baby count. But let's graph it anyway:

image

In lieu of more sophisticated statistical analysis, we're going to have a few mistaken classifications – but our genderizer table looks like it'll be mostly right if the names we analyze are not too unusual…