Table of contents
The problem with granular data
The SQL examples with the baby names data have, so far, been a little cumbersome. With just the 2010 to 2014 data, a single name
, because the baby_names
table enumerates every combination of name
, state
, year
, and sex
.
So for a name such as Taylor
, which is common for both sexes:
- 51 states (50 states plus
USA
)
- 2 sexes (
'M'
and 'F'
)
- 5 years (2010 to 2014)
– we can expect to see 510 different rows when doing this query:
SELECT * FROM baby_names
WHERE name = 'Taylor';
That query actually produces 460 rows, which means that for some combination of state, sex, and year, there were fewer than 5 babies named 'Taylor'
. Finding which scenario yielded fewer than five Taylor's is an interesting question on its own, to be tackled much later.
For now, we have a much more basic problem: How do we total the number of 'Taylor'
named babies, i.e. the total of the count
column? Actually, we'll save that one for the next lesson.
So let's step back even further and ask an even more basic question: Just how many different baby names _are there?
Remember to pivot
First, it's important to remember that even if you stopped reading, right now, and learned nothing else about SQL, you still have a way to find number of different baby names: just export the baby_names
table into a spreadsheet and create a pivot table. Then list the name
field as rows. That would pull out all the unique name
values.
But exporting and importing is a pain. Also, the data might just be too big to fit into some spreadsheets. It'd be nice if we could do that aggregation of the data within SQLite.
At it's most basic usage, this filtering of unique values is what the GROUP BY
clause can do.
How to GROUP BY a column
Here's the basic syntax for a GROUP BY
, without any other clauses except SELECT
and FROM
:
SELECT name FROM baby_names
GROUP BY name;
The example output:
name |
Aaban |
Aabha |
Aabriella |
Aadam |
Aadan |
If you look at the results log, you should see this:
46359 Rows returned from: SELECT name FROM baby_names
GROUP BY name (took 188ms)
In other words, there are exactly 46,359 unique baby names nationwide, among both boys and girls, as recorded by the Social Security Administration for the years 2010 to 2014.
But is that right? I mean, did we actually use GROUP BY
correctly, and are we interpreting the results log correctly? We don't really know, because we didn't previously know exactly how many unique baby names there are, and thus have no frame of reference.
So let's group by a column of which we do know the number of unique values. Well, we do know that the dataset spans from 2010
to 2014
. A grouping by year
should yield 5 rows:
SELECT year FROM baby_names
GROUP BY year;
The results log:
5 Rows returned from: SELECT year FROM baby_names
GROUP BY year; (took 123ms)
The output:
year |
2010 |
2011 |
2012 |
2013 |
2014 |
If you went through the American school system, or looked at the American flag recently, you probably remember the fact that we have 50 states. Let's confirm that 50 different states exist in the data:
SELECT state FROM baby_names
GROUP BY state;
The results log:
52 Rows returned from: SELECT state FROM baby_names
GROUP BY state; (took 158ms)
As it turns out, there are 52 different state
values: DC
for District of Columbia, and USA
for the rows representing nationwide data.
Let's do one more query to confirm an assumption most of us may have automatically made: does the SSA only categorize babies as either boy or girl, i.e. M
or F
?
SELECT sex FROM baby_names
GROUP BY sex;
The results log:
2 Rows returned from: SELECT sex FROM baby_names
GROUP BY sex; (took 899ms)
Where GROUP BY fits into the SQL statement
Before learning of the other nuances and features of GROUP BY
, let's see where it fits in with all the other SQL clauses we've learned so far.
GROUP BY and WHERE
The WHERE
clause goes before the GROUP BY
clause:
How do we find out the number of unique baby names just for boys? Filter upon sex = 'M'
:
(Note: in case this isn't already clear from previous examples, I don't need to display the sex
field via SELECT
in order to refer to it in the WHERE
clause):
SELECT name FROM baby_names
WHERE sex = 'M'
GROUP BY name;
The results log:
21193 Rows returned from: SELECT name FROM baby_names
WHERE sex = 'M'
GROUP BY name; (took 171ms)
We'll count the number of girl names in a later lesson. For now, let's filter by state
and count number of unique names where state = 'USA'
:
SELECT name FROM baby_names
WHERE state = 'USA'
GROUP BY name;
The results log:
46359 Rows returned from: SELECT name FROM baby_names
WHERE state = 'USA'
GROUP BY name; (took 683ms)
Note that the number of rows – 46,359 – is the exact same as it was when in our very first GROUP BY
query. This is not just a coincidence. This is a consequence of our data set. Rows with USA
refer to a nationwide count. Remember that the Social Security Administration omits rows in which the count
of babies is less than 5
. So if, for a relatively rare name, such as 'Obama'
, and given a relatively small state, such as Wyoming, if there were fewer than 5
babies, boy or girl, for all the years 2010 to 2014 – then that name would not show up for the state of Wyoming.
However, for every baby name that garnered 5 babies or more in Wyoming, by definition, it would be included in the USA
data – even if that name only appeared in Wyoming. In other words, every baby name that is recorded for every state will, by definition, be included in the USA
data. And that is why these two queries result in the exact number of rows for unique values in name
:
SELECT name FROM baby_names
GROUP BY name;
SELECT name FROM baby_names
WHERE state = 'USA'
GROUP BY name;
One more WHERE
example: let's restrict the count of unique names to those that begin with the letter A
and end either in the letter x
or z
:
SELECT name FROM baby_names
WHERE
name LIKE 'A%'
AND (name LIKE '%x' OR name LIKE '%z')
GROUP BY name;
The total number of rows is 51. Here's the sample output:
name |
Aaraiz |
Aariz |
Abdelaziz |
Abdiaziz |
Abdoulaziz |
Abdulaziz |
GROUP BY and ORDER BY
The GROUP BY
clause needs to appear before the ORDER BY
– which makes a little intuitive sense, in that the ordering operation should be done after the possible result rows are filtered and pruned.
To reuse the previous query and sort it in reverse alphabetical order, and then limit the results to 5
(we already know that LIMIT
comes after the ORDER BY
:
SELECT name FROM baby_names
WHERE
name LIKE 'A%'
AND (name LIKE '%x' OR name LIKE '%z')
GROUP BY name
ORDER BY name DESC
LIMIT 5;
Here's the output:
name |
Aziz |
Azeez |
Ayaz |
Ashaz |
Artez |
Grouping by an alias
This is just a refresher of the previous lesson on aliases, in which we learned that aliases can be used wherever column identifiers are used.
Ergo, to enumerate all the possible lengths of names in baby_names
and then list them in ascending order:
SELECT LENGTH(name) as z
FROM baby_names
GROUP BY z
ORDER BY z ASC;
Sample output:
Grouping by multiple columns
Finding unique values within a single column is interesting. Oftentimes, we want to find all unique combinations between the values of two or more fields. The syntax for this is to pass multiple field names into the GROUP BY clause, separated by commas.
Let's do a trivial example; we know that there are 2 and 5 unique values for sex
and year
, respectively. Doing a GROUP BY
on both columns should yield 10 rows:
SELECT sex, year
FROM baby_names
GROUP BY sex, year;
The results log:
10 Rows returned from: SELECT sex, year
FROM baby_names
GROUP BY sex, year; (took 1206ms)
The results:
sex |
year |
F |
2010 |
F |
2011 |
F |
2012 |
F |
2013 |
F |
2014 |
M |
2010 |
M |
2011 |
M |
2012 |
M |
2013 |
M |
2014 |
A quick note about GROUP BY and column order
A very good question to ask is: does it matter what order I list the columns to group by? In other words, does the previous query return the same results as the following query?
SELECT sex, year
FROM baby_names
GROUP BY year, sex;
Run it yourself to find out.
Here's my results log:
10 Rows returned from: SELECT sex, year
FROM baby_names
GROUP BY year, sex; (took 2318ms)
Here's the output – note that the order of the columns in the results will not change unless we change the SELECT
clause as well:
sex |
year |
F |
2010 |
M |
2010 |
F |
2011 |
M |
2011 |
F |
2012 |
M |
2012 |
F |
2013 |
M |
2013 |
F |
2014 |
M |
2014 |
First of all, the important thing to know is that the number of rows and columns remains the same, i.e. 10 rows and 2 columns. This should logically follow from the commutative rule of multiplication:
5 (years) * 2 (sexes) = 2 (sexes) * 5 (years)
However, the ordering of the rows has changed. When we prioritized grouping by sex
, the first 5 rows looked like this:
sex |
year |
F |
2010 |
F |
2011 |
F |
2012 |
F |
2013 |
F |
2014 |
When prioritizing by year
, the year values are essentially collected and ordered, and then for each year
value, the sex
value is shown. I've reordered the columns to make the difference more obvious:
year |
sex |
2010 |
F |
2010 |
M |
2011 |
F |
2011 |
M |
There's actually a performance difference too, but that requires discussing the topic of indexes…which we don't need to worry about at the moment. However, it's enough to appreciate how the SQLite software behaves based on how we prioritize things – in this case, the columns for GROUP BY
. And yet even though the calculation process is different, the content of the result remains the same.
Being selective while grouping
In the previous section, we've already seen that SELECT
can operate independently of GROUP BY
, in that the list of columns for SELECT
can be different in order than the one GROUP BY
.
However, it is also possible in SQLite for the SELECT
clause to list columns not referred to in the GROUP BY
:
SELECT sex, year
FROM baby_names
GROUP BY sex;
The result:
Or even simpler:
SELECT year
FROM baby_names
GROUP BY sex;
None of these make sense.
(more to come)