Get the data

For this lesson, download the following file, which is a SQLite database built from the U.S. Social Security Administration file of popular baby names:

Unzip it, and open the sqlite file using the SQLite client of your choice (e.g. DB Browser for SQLite)

Table of contents

The inconvenience of aggregate results

In the previous lesson, we learned about aggregate functions, such as COUNT and SUM. We also learned that aggregate functions reduce a result set to a single value. The query below counts the number of rows and sums the count column (reminder: the count column has nothing to do with the COUNT function), effectively giving us the number of boy and girl names, and the total number of boy and girl babies nationwide for the year 2014:

SELECT COUNT(*), SUM(count)
FROM baby_names
WHERE year = 2014
  AND state = 'USA';

The result:

COUNT(*) SUM(count)
33044 3670151

If we wanted to calculate this aggregate for each sex, we would have to do two queries:

For boys
SELECT COUNT(*), SUM(count)
FROM baby_names
WHERE year = 2014
  AND state = 'USA'
  AND sex = 'M';

The result:

COUNT(*) SUM(count)
13977 1901376
For girls
SELECT COUNT(*), SUM(count)
FROM baby_names
WHERE year = 2014
  AND state = 'USA'
  AND sex = 'F';

The result:

COUNT(*) SUM(count)
19067 1768775

It works, but it's a bit cumbersome. Besides running two different queries, we don't get the convenience of having the answers of these queries in a single table for easy comparison.

It'd be nice to be able to do something similar to a pivot table:

sex COUNT of rows SUM of count
M 13977 1901376
F 19067 1768775

Using GROUP BY to aggregate across multiple groups

This is where the GROUP BY clause comes in. As we saw in the above examples, without a GROUP BY, an aggregate function treats the entire result set as a single group and returns a single value.

But with a GROUP BY, we tell the SQLite interpreter, "Hey, perform those aggregate calculations on these groups".

To get a pivot-table-like answer, in which we perform an aggregate calculation for each unique value in given column, we simply GROUP BY that column. Here it is for sex:

SELECT sex, COUNT(*), SUM(count)
FROM baby_names
WHERE year = 2014
  AND state = 'USA'
GROUP BY sex;
sex COUNT(*) SUM(count)
F 19067 1768775
M 13977 1901376

Note that I include sex in the SELECT clause. In the previous lesson, I strongly advised that you not include columns that aren't being aggregated. However, we can think of any column in a GROUP BY clause as being aggregated. Thus, it's fine to include a GROUP BY column in the SELECT clause as well.

If we hadn't done so in the previous example, the result would be a bit ambiguous:

SELECT COUNT(*), SUM(count)
FROM baby_names
WHERE year = 2014
  AND state = 'USA'
GROUP BY sex;
COUNT(*) SUM(count)
19067 1768775
13977 1901376

Let's try doing this same aggregate, but by year:

SELECT year, COUNT(*), SUM(count)
FROM baby_names
WHERE year = 2014
  AND state = 'USA'
GROUP BY year;

The answer:

year COUNT(*) SUM(count)
2014 33044 3670151

The above answer is not very useful because the WHERE year = 2014 restricts the result to only rows with a year of 2014. Simply remove that part of the conditional expression to get all 5 years:

SELECT year, COUNT(*), SUM(count)
FROM baby_names
WHERE state = 'USA'
GROUP BY year;

The result:

year COUNT(*) SUM(count)
2010 34041 3686589
2011 33869 3646730
2012 33684 3643336
2013 33203 3626802
2014 33044 3670151

And to get a break down of boy versus girl names, by year, we simply GROUP BY both columns. In the example below, I also alias the aggregate functions to produce nicer output:

SELECT year, sex,
  COUNT(*) AS name_count, 
  SUM(count) AS baby_count
FROM baby_names
WHERE state = 'USA'
GROUP BY year, sex;

The result:

year sex name_count baby_count
2010 F 19800 1772738
2010 M 14241 1913851
2011 F 19540 1753500
2011 M 14329 1893230
2012 F 19468 1753922
2012 M 14216 1889414
2013 F 19191 1745339
2013 M 14012 1881463
2014 F 19067 1768775
2014 M 13977 1901376

I'm not sure why there are so many girl names versus boy names: 19,067 vs 14,977 in 2014, a 21% difference. Or why there are more boy babies than girl babies – 1.90 million versus 1.77 million, i.e. 51.7% boys versus 48.3% girls.

But if we account for the fact that the Social Security Administration doesn't include names that had 5 or fewer babies, then it's possible that because girls seem to receive a wider variety of names, more girls receive rare names, and thus, more girl babies are likely to be uncounted.

Let's do a query for babies nationwide: a group COUNT and SUM, but for just the year 2014 and for names with fewer than 10 babies per sex. This will give us some insight to how many girl babies have rare names:

SELECT sex,
  COUNT(*) AS name_count, 
  SUM(count) AS baby_count
FROM baby_names
WHERE state = 'USA'
  AND count < 10
  AND year = 2014
GROUP BY sex;
sex name_count baby_count
F 8222 53370
M 5914 38259

There are a lot of rare girl names. But are there enough to make up the 130,000 gap between the total count gap between boys and girls? Even if we assume that there are 20,000 girl names that were each given to 4 babies in 2014, that only adds up to 80,000 more girls. So it's unlikely that the fact that girls have a wider variety of names is enough to account for the gap between the SUM of count for each sex.

As it turns out, the ratio for babies in the United States has long been 51 to 49 percent in favor of boys, according to this March 30, 2015 NPR report, "Why Are More Baby Boys Born Than Girls?". The summary of the report is: the imbalance may be due to the fact that more female fetuses are lost during pregnancy.

Note: because of how the Social Security Administration tallies these baby numbers, it may just be total coincidence that we observe an approximation of the 51% to 49% boy vs. girl gap in the baby names data. If you want to pursue that topic further, I'd go with the Centers for Disease Control and Prevention's birth data.

Reproducing the totals table

SELECT state, sex, year,
  SUM(count) AS baby_count,
  COUNT(*) AS name_count
FROM baby_names
GROUP BY state, sex, year

And to get the rows for 'All', which is simply a group count/sum without splitting it by sex:

SELECT state, 'All', year,
  SUM(count) AS baby_count,
  COUNT(*) AS name_count
FROM baby_names
GROUP BY state, year

The above two queries is what was used to generate the totals table – you can assume I just exported their results into one CSV, then created the totals table via import.

The fun and danger of GROUP BY COUNT

(show how easy it is to mess up) (show this after the main examples?) (show a chart)

Counting by groups

In a previous example, we found the number of unique boy names by executing this query:

SELECT name FROM baby_names
WHERE sex = 'M'
GROUP BY name;

If we wanted to find the number of girls, we could easily alter the WHERE condition:

SELECT name FROM baby_names
WHERE sex = 'F'
GROUP BY name;

But we could get both bits of information in a single query:

SELECT sex, COUNT(*)
FROM baby_names
GROUP BY sex;

Note that we don't have to rely on the count.

sex COUNT(*)
F 350993
M 282615

Count unique names

SELECT COUNT(DISTINCT name)
FROM baby_names;

The result:

COUNT(DISTINCT name)
46359

To count unique names per sex:

SELECT sex, COUNT(DISTINCT name)
FROM baby_names
GROUP BY sex;
sex COUNT(DISTINCT name)
F 29229
M 21193

To do it by year:

SELECT year, COUNT(DISTINCT name)
FROM baby_names
GROUP BY year;

It seems that the total number of names has decreased over the years:

year COUNT(DISTINCT name)
2010 31603
2011 31418
2012 31236
2013 30774
2014 30579

And we can do a count of unique names by year and sex:

SELECT sex, year, COUNT(DISTINCT name)
FROM baby_names
GROUP BY sex, year;

The downward trend of unique names appears to exist for both boys and girls:

sex year COUNT(DISTINCT name)
F 2010 19800
F 2011 19540
F 2012 19468
F 2013 19191
F 2014 19067
M 2010 14241
M 2011 14329
M 2012 14216
M 2013 14012
M 2014 13977

Let's do one more: a count of unique names, by state and year, but just for the states of California and Texas:

SELECT state, year, COUNT(DISTINCT name)
FROM baby_names
WHERE state IN('CA', 'TX')
GROUP BY state, year;

It appears that Texas, unlike California and the rest of the nation, has a slight upward bump in unique names from 2010 to 2014.


MAX/MIN and GROUP BY

Top 10 girl names, nationwide, ordered by descending order of their maximum count over the 2010 to 2014 period. Note that we don't have to GROUP BY the sex or state columns because we know those are restricted to F and USA:

SELECT name, MAX(count) AS max_count, 
  MIN(count) AS min_count,
  AVG(count) AS avg_count
FROM baby_names
WHERE state = 'USA' AND sex = 'F'
GROUP BY name
ORDER BY max_count DESC
LIMIT 5;
name max_count min_count avg_count
Isabella 22883 16950 19266.8
Sophia 22267 18490 20866.4
Emma 20902 17322 19735.2
Olivia 19674 17012 17924.6
Ava 15586 15211 15441.4

Ordering it by the biggest difference between the MAX and MIN of their count values.

SELECT name, 
  MAX(count) - MIN(count) AS max_change,
  MAX(count) AS max_count, 
  MIN(count) AS min_count,
  AVG(count) AS avg_count
FROM baby_names
WHERE state = 'USA' AND sex = 'F'
GROUP BY name
ORDER BY max_change DESC
LIMIT 5;

The result:

name max_change max_count min_count avg_count
Harper 6935 9564 2629 6461.6
Isabella 5933 22883 16950 19266.8
Aria 4990 5893 903 3421.0
Charlotte 4694 10048 5354 7712.4
Alexis 4061 8249 4188 5808.0

Note that this calculation doesn't tell us if the name is becoming more or less popular, since we don't know when the MAX happened with respect to the MIN value. We'd have to do a separate (non-aggregate) query based on the names given above:

SELECT name, year, count
FROM baby_names
WHERE name IN ('Harper', 'Isabella', 'Aria', 'Charlotte', 'Alexis')
  AND state = 'USA'
  AND sex = 'F'
ORDER BY name, year;

Rather than print out the table, I'll just show you what it looks like as a visualized pivot table on Google Sheets:

  2010 2011 2012 2013 2014
Alexis 8249 6510 5337 4756 4188
Aria 903 1979 3226 5104 5893
Charlotte 5354 6418 7468 9274 10048
Harper 2629 4668 7179 8268 9564
Isabella 22883 19870 19058 17573 16950

This ranking of names – by biggest difference between MAX and MIN – seems to contain two types of names:

AVG and GROUP BY

The boy baby names, nationwide, with the highest average count of babies over the years 2010 to 2014:

SELECT name, 
  AVG(count) AS average_count
FROM baby_names
WHERE sex = 'M'
  AND state = 'USA'
GROUP BY name
ORDER BY average_count DESC
LIMIT 5;
name average_count
Jacob 19245.4
Mason 17601.4
Noah 17578.2
William 16889.0
Ethan 16820.4

Let's do it by MAX:

SELECT name, 
  MAX(count) AS max_count
FROM baby_names
WHERE sex = 'M'
  AND state = 'USA'
GROUP BY name
ORDER BY max_count DESC
LIMIT 5;

We see that the name 'Liam' shows up in this ranking, which suggests that maybe either the name has just recently become popular, or is on a downturn, for 'Liam' not to appear in the top 5 by average:

name max_count
Jacob 22082
Mason 19488
Noah 19144
Liam 18342
Ethan 17985

Mistakes

Common mistake:

SELECT name, MAX(count) AS max_count, 
  MIN(count) AS min_count
FROM baby_names
WHERE state = 'USA' ## this needs  to have sex, as well
GROUP BY name
ORDER BY max_count DESC
LIMIT 10;