Table of contents
The inconvenience of aggregate results
In the previous lesson, we learned about aggregate functions, such as COUNT
and SUM
. We also learned that aggregate functions reduce a result set to a single value. The query below counts the number of rows and sums the count
column (reminder: the count
column has nothing to do with the COUNT
function), effectively giving us the number of boy and girl names, and the total number of boy and girl babies nationwide for the year 2014:
SELECT COUNT(*), SUM(count)
FROM baby_names
WHERE year = 2014
AND state = 'USA';
The result:
COUNT(*) |
SUM(count) |
33044 |
3670151 |
If we wanted to calculate this aggregate for each sex
, we would have to do two queries:
For boys
SELECT COUNT(*), SUM(count)
FROM baby_names
WHERE year = 2014
AND state = 'USA'
AND sex = 'M';
The result:
COUNT(*) |
SUM(count) |
13977 |
1901376 |
For girls
SELECT COUNT(*), SUM(count)
FROM baby_names
WHERE year = 2014
AND state = 'USA'
AND sex = 'F';
The result:
COUNT(*) |
SUM(count) |
19067 |
1768775 |
It works, but it's a bit cumbersome. Besides running two different queries, we don't get the convenience of having the answers of these queries in a single table for easy comparison.
It'd be nice to be able to do something similar to a pivot table:
sex |
COUNT of rows |
SUM of count |
M |
13977 |
1901376 |
F |
19067 |
1768775 |
Using GROUP BY to aggregate across multiple groups
This is where the GROUP BY
clause comes in. As we saw in the above examples, without a GROUP BY
, an aggregate function treats the entire result set as a single group and returns a single value.
But with a GROUP BY
, we tell the SQLite interpreter, "Hey, perform those aggregate calculations on these groups".
To get a pivot-table-like answer, in which we perform an aggregate calculation for each unique value in given column, we simply GROUP BY
that column. Here it is for sex
:
SELECT sex, COUNT(*), SUM(count)
FROM baby_names
WHERE year = 2014
AND state = 'USA'
GROUP BY sex;
sex |
COUNT(*) |
SUM(count) |
F |
19067 |
1768775 |
M |
13977 |
1901376 |
Note that I include sex
in the SELECT
clause. In the previous lesson, I strongly advised that you not include columns that aren't being aggregated. However, we can think of any column in a GROUP BY
clause as being aggregated. Thus, it's fine to include a GROUP BY
column in the SELECT
clause as well.
If we hadn't done so in the previous example, the result would be a bit ambiguous:
SELECT COUNT(*), SUM(count)
FROM baby_names
WHERE year = 2014
AND state = 'USA'
GROUP BY sex;
COUNT(*) |
SUM(count) |
19067 |
1768775 |
13977 |
1901376 |
Let's try doing this same aggregate, but by year
:
SELECT year, COUNT(*), SUM(count)
FROM baby_names
WHERE year = 2014
AND state = 'USA'
GROUP BY year;
The answer:
year |
COUNT(*) |
SUM(count) |
2014 |
33044 |
3670151 |
The above answer is not very useful because the WHERE year = 2014
restricts the result to only rows with a year
of 2014. Simply remove that part of the conditional expression to get all 5 years:
SELECT year, COUNT(*), SUM(count)
FROM baby_names
WHERE state = 'USA'
GROUP BY year;
The result:
year |
COUNT(*) |
SUM(count) |
2010 |
34041 |
3686589 |
2011 |
33869 |
3646730 |
2012 |
33684 |
3643336 |
2013 |
33203 |
3626802 |
2014 |
33044 |
3670151 |
And to get a break down of boy versus girl names, by year, we simply GROUP BY
both columns. In the example below, I also alias the aggregate functions to produce nicer output:
SELECT year, sex,
COUNT(*) AS name_count,
SUM(count) AS baby_count
FROM baby_names
WHERE state = 'USA'
GROUP BY year, sex;
The result:
year |
sex |
name_count |
baby_count |
2010 |
F |
19800 |
1772738 |
2010 |
M |
14241 |
1913851 |
2011 |
F |
19540 |
1753500 |
2011 |
M |
14329 |
1893230 |
2012 |
F |
19468 |
1753922 |
2012 |
M |
14216 |
1889414 |
2013 |
F |
19191 |
1745339 |
2013 |
M |
14012 |
1881463 |
2014 |
F |
19067 |
1768775 |
2014 |
M |
13977 |
1901376 |
I'm not sure why there are so many girl names versus boy names: 19,067 vs 14,977 in 2014, a 21% difference. Or why there are more boy babies than girl babies – 1.90 million versus 1.77 million, i.e. 51.7% boys versus 48.3% girls.
But if we account for the fact that the Social Security Administration doesn't include names that had 5 or fewer babies, then it's possible that because girls seem to receive a wider variety of names, more girls receive rare names, and thus, more girl babies are likely to be uncounted.
Let's do a query for babies nationwide: a group COUNT
and SUM
, but for just the year 2014 and for names with fewer than 10 babies per sex. This will give us some insight to how many girl babies have rare names:
SELECT sex,
COUNT(*) AS name_count,
SUM(count) AS baby_count
FROM baby_names
WHERE state = 'USA'
AND count < 10
AND year = 2014
GROUP BY sex;
sex |
name_count |
baby_count |
F |
8222 |
53370 |
M |
5914 |
38259 |
There are a lot of rare girl names. But are there enough to make up the 130,000 gap between the total count
gap between boys and girls? Even if we assume that there are 20,000 girl names that were each given to 4 babies in 2014, that only adds up to 80,000 more girls. So it's unlikely that the fact that girls have a wider variety of names is enough to account for the gap between the SUM
of count
for each sex
.
As it turns out, the ratio for babies in the United States has long been 51 to 49 percent in favor of boys, according to this March 30, 2015 NPR report, "Why Are More Baby Boys Born Than Girls?". The summary of the report is: the imbalance may be due to the fact that more
female fetuses are lost during pregnancy.
Note: because of how the Social Security Administration tallies these baby numbers, it may just be total coincidence that we observe an approximation of the 51% to 49% boy vs. girl gap in the baby names data. If you want to pursue that topic further, I'd go with the Centers for Disease Control and Prevention's birth data.
Reproducing the totals
table
SELECT state, sex, year,
SUM(count) AS baby_count,
COUNT(*) AS name_count
FROM baby_names
GROUP BY state, sex, year
And to get the rows for 'All'
, which is simply a group count/sum without splitting it by sex
:
SELECT state, 'All', year,
SUM(count) AS baby_count,
COUNT(*) AS name_count
FROM baby_names
GROUP BY state, year
The above two queries is what was used to generate the totals
table – you can assume I just exported their results into one CSV, then created the totals
table via import.
The fun and danger of GROUP BY COUNT
(show how easy it is to mess up)
(show this after the main examples?)
(show a chart)
Counting by groups
In a previous example, we found the number of unique boy names by executing this query:
SELECT name FROM baby_names
WHERE sex = 'M'
GROUP BY name;
If we wanted to find the number of girls, we could easily alter the WHERE
condition:
SELECT name FROM baby_names
WHERE sex = 'F'
GROUP BY name;
But we could get both bits of information in a single query:
SELECT sex, COUNT(*)
FROM baby_names
GROUP BY sex;
Note that we don't have to rely on the count.
sex |
COUNT(*) |
F |
350993 |
M |
282615 |
Count unique names
SELECT COUNT(DISTINCT name)
FROM baby_names;
The result:
COUNT(DISTINCT name) |
46359 |
To count unique names per sex
:
SELECT sex, COUNT(DISTINCT name)
FROM baby_names
GROUP BY sex;
sex |
COUNT(DISTINCT name) |
F |
29229 |
M |
21193 |
To do it by year
:
SELECT year, COUNT(DISTINCT name)
FROM baby_names
GROUP BY year;
It seems that the total number of names has decreased over the years:
year |
COUNT(DISTINCT name) |
2010 |
31603 |
2011 |
31418 |
2012 |
31236 |
2013 |
30774 |
2014 |
30579 |
And we can do a count of unique names by year
and sex
:
SELECT sex, year, COUNT(DISTINCT name)
FROM baby_names
GROUP BY sex, year;
The downward trend of unique names appears to exist for both boys and girls:
sex |
year |
COUNT(DISTINCT name) |
F |
2010 |
19800 |
F |
2011 |
19540 |
F |
2012 |
19468 |
F |
2013 |
19191 |
F |
2014 |
19067 |
M |
2010 |
14241 |
M |
2011 |
14329 |
M |
2012 |
14216 |
M |
2013 |
14012 |
M |
2014 |
13977 |
Let's do one more: a count of unique names, by state
and year
, but just for the states of California and Texas:
SELECT state, year, COUNT(DISTINCT name)
FROM baby_names
WHERE state IN('CA', 'TX')
GROUP BY state, year;
It appears that Texas, unlike California and the rest of the nation, has a slight upward bump in unique names from 2010 to 2014.
MAX/MIN and GROUP BY
Top 10 girl names, nationwide, ordered by descending order of their maximum count over the 2010 to 2014 period. Note that we don't have to GROUP BY
the sex
or state
columns because we know those are restricted to F
and USA
:
SELECT name, MAX(count) AS max_count,
MIN(count) AS min_count,
AVG(count) AS avg_count
FROM baby_names
WHERE state = 'USA' AND sex = 'F'
GROUP BY name
ORDER BY max_count DESC
LIMIT 5;
name |
max_count |
min_count |
avg_count |
Isabella |
22883 |
16950 |
19266.8 |
Sophia |
22267 |
18490 |
20866.4 |
Emma |
20902 |
17322 |
19735.2 |
Olivia |
19674 |
17012 |
17924.6 |
Ava |
15586 |
15211 |
15441.4 |
Ordering it by the biggest difference between the MAX
and MIN
of their count
values.
SELECT name,
MAX(count) - MIN(count) AS max_change,
MAX(count) AS max_count,
MIN(count) AS min_count,
AVG(count) AS avg_count
FROM baby_names
WHERE state = 'USA' AND sex = 'F'
GROUP BY name
ORDER BY max_change DESC
LIMIT 5;
The result:
name |
max_change |
max_count |
min_count |
avg_count |
Harper |
6935 |
9564 |
2629 |
6461.6 |
Isabella |
5933 |
22883 |
16950 |
19266.8 |
Aria |
4990 |
5893 |
903 |
3421.0 |
Charlotte |
4694 |
10048 |
5354 |
7712.4 |
Alexis |
4061 |
8249 |
4188 |
5808.0 |
Note that this calculation doesn't tell us if the name is becoming more or less popular, since we don't know when the MAX
happened with respect to the MIN
value. We'd have to do a separate (non-aggregate) query based on the names given above:
SELECT name, year, count
FROM baby_names
WHERE name IN ('Harper', 'Isabella', 'Aria', 'Charlotte', 'Alexis')
AND state = 'USA'
AND sex = 'F'
ORDER BY name, year;
Rather than print out the table, I'll just show you what it looks like as a visualized pivot table on Google Sheets:
|
2010 |
2011 |
2012 |
2013 |
2014 |
Alexis |
8249 |
6510 |
5337 |
4756 |
4188 |
Aria |
903 |
1979 |
3226 |
5104 |
5893 |
Charlotte |
5354 |
6418 |
7468 |
9274 |
10048 |
Harper |
2629 |
4668 |
7179 |
8268 |
9564 |
Isabella |
22883 |
19870 |
19058 |
17573 |
16950 |
This ranking of names – by biggest difference between MAX
and MIN
– seems to contain two types of names:
- Very popular names in 2010, e.g.
'Isabella'
and 'Alexis'
, which steeply decline in popularity by 2015.
- Relatively uncommon names in 2010, e.g.
'Harper'
, 'Aria'
, and 'Charlotte'
, which become much more popular in 2015. The name 'Aria'
, in particular, sees a 6.5x increase.
AVG and GROUP BY
The boy baby names, nationwide, with the highest average count of babies over the years 2010 to 2014:
SELECT name,
AVG(count) AS average_count
FROM baby_names
WHERE sex = 'M'
AND state = 'USA'
GROUP BY name
ORDER BY average_count DESC
LIMIT 5;
name |
average_count |
Jacob |
19245.4 |
Mason |
17601.4 |
Noah |
17578.2 |
William |
16889.0 |
Ethan |
16820.4 |
Let's do it by MAX
:
SELECT name,
MAX(count) AS max_count
FROM baby_names
WHERE sex = 'M'
AND state = 'USA'
GROUP BY name
ORDER BY max_count DESC
LIMIT 5;
We see that the name 'Liam'
shows up in this ranking, which suggests that maybe either the name has just recently become popular, or is on a downturn, for 'Liam'
not to appear in the top 5 by average:
name |
max_count |
Jacob |
22082 |
Mason |
19488 |
Noah |
19144 |
Liam |
18342 |
Ethan |
17985 |
Mistakes
Common mistake:
SELECT name, MAX(count) AS max_count,
MIN(count) AS min_count
FROM baby_names
WHERE state = 'USA' ## this needs to have sex, as well
GROUP BY name
ORDER BY max_count DESC
LIMIT 10;