The Takeaways

By the end of this tutorial, make sure you can answer these questions:

  • How to find all unique values in a given column.

  • Where GROUP BY fits in with SELECT, WHERE, and ORDER BY.

Get the data

For this lesson, download the following file, which is a SQLite database built from the U.S. Social Security Administration file of popular baby names:

Unzip it, and open the sqlite file using the SQLite client of your choice (e.g. DB Browser for SQLite)

Table of contents

The problem with granular data

The SQL examples with the baby names data have, so far, been a little cumbersome. With just the 2010 to 2014 data, a single name, because the baby_names table enumerates every combination of name, state, year, and sex.

So for a name such as Taylor, which is common for both sexes:

– we can expect to see 510 different rows when doing this query:

SELECT * FROM baby_names
WHERE name = 'Taylor';

That query actually produces 460 rows, which means that for some combination of state, sex, and year, there were fewer than 5 babies named 'Taylor'. Finding which scenario yielded fewer than five Taylor's is an interesting question on its own, to be tackled much later.

For now, we have a much more basic problem: How do we total the number of 'Taylor' named babies, i.e. the total of the count column? Actually, we'll save that one for the next lesson.

So let's step back even further and ask an even more basic question: Just how many different baby names _are there?

Remember to pivot

First, it's important to remember that even if you stopped reading, right now, and learned nothing else about SQL, you still have a way to find number of different baby names: just export the baby_names table into a spreadsheet and create a pivot table. Then list the name field as rows. That would pull out all the unique name values.

But exporting and importing is a pain. Also, the data might just be too big to fit into some spreadsheets. It'd be nice if we could do that aggregation of the data within SQLite.

At it's most basic usage, this filtering of unique values is what the GROUP BY clause can do.

How to GROUP BY a column

Here's the basic syntax for a GROUP BY, without any other clauses except SELECT and FROM:

SELECT name FROM baby_names
GROUP BY name;

The example output:

name
Aaban
Aabha
Aabriella
Aadam
Aadan

If you look at the results log, you should see this:

46359 Rows returned from: SELECT name FROM baby_names
GROUP BY name (took 188ms)

In other words, there are exactly 46,359 unique baby names nationwide, among both boys and girls, as recorded by the Social Security Administration for the years 2010 to 2014.

But is that right? I mean, did we actually use GROUP BY correctly, and are we interpreting the results log correctly? We don't really know, because we didn't previously know exactly how many unique baby names there are, and thus have no frame of reference.

So let's group by a column of which we do know the number of unique values. Well, we do know that the dataset spans from 2010 to 2014. A grouping by year should yield 5 rows:

SELECT year FROM baby_names
GROUP BY year;

The results log:

5 Rows returned from: SELECT year FROM baby_names
GROUP BY year; (took 123ms)

The output:

year
2010
2011
2012
2013
2014

If you went through the American school system, or looked at the American flag recently, you probably remember the fact that we have 50 states. Let's confirm that 50 different states exist in the data:

SELECT state FROM baby_names
GROUP BY state;

The results log:

52 Rows returned from: SELECT state FROM baby_names
GROUP BY state; (took 158ms)

As it turns out, there are 52 different state values: DC for District of Columbia, and USA for the rows representing nationwide data.

Let's do one more query to confirm an assumption most of us may have automatically made: does the SSA only categorize babies as either boy or girl, i.e. M or F?

SELECT sex FROM baby_names
GROUP BY sex;

The results log:

2 Rows returned from: SELECT sex FROM baby_names
GROUP BY sex; (took 899ms)

Where GROUP BY fits into the SQL statement

Before learning of the other nuances and features of GROUP BY, let's see where it fits in with all the other SQL clauses we've learned so far.

GROUP BY and WHERE

The WHERE clause goes before the GROUP BY clause:

How do we find out the number of unique baby names just for boys? Filter upon sex = 'M':

(Note: in case this isn't already clear from previous examples, I don't need to display the sex field via SELECT in order to refer to it in the WHERE clause):

SELECT name FROM baby_names
WHERE sex = 'M'
GROUP BY name;

The results log:

  21193 Rows returned from: SELECT name FROM baby_names
  WHERE sex = 'M'
  GROUP BY name; (took 171ms)

We'll count the number of girl names in a later lesson. For now, let's filter by state and count number of unique names where state = 'USA':

SELECT name FROM baby_names
WHERE state = 'USA'
GROUP BY name;

The results log:

46359 Rows returned from: SELECT name FROM baby_names
WHERE state = 'USA'
GROUP BY name; (took 683ms)

Note that the number of rows – 46,359 – is the exact same as it was when in our very first GROUP BY query. This is not just a coincidence. This is a consequence of our data set. Rows with USA refer to a nationwide count. Remember that the Social Security Administration omits rows in which the count of babies is less than 5. So if, for a relatively rare name, such as 'Obama', and given a relatively small state, such as Wyoming, if there were fewer than 5 babies, boy or girl, for all the years 2010 to 2014 – then that name would not show up for the state of Wyoming.

However, for every baby name that garnered 5 babies or more in Wyoming, by definition, it would be included in the USA data – even if that name only appeared in Wyoming. In other words, every baby name that is recorded for every state will, by definition, be included in the USA data. And that is why these two queries result in the exact number of rows for unique values in name:

SELECT name FROM baby_names
GROUP BY name;
SELECT name FROM baby_names
WHERE state = 'USA'
GROUP BY name;

One more WHERE example: let's restrict the count of unique names to those that begin with the letter A and end either in the letter x or z:

SELECT name FROM baby_names
WHERE 
  name LIKE 'A%' 
  AND (name LIKE '%x' OR name LIKE '%z')
GROUP BY name;

The total number of rows is 51. Here's the sample output:

name
Aaraiz
Aariz
Abdelaziz
Abdiaziz
Abdoulaziz
Abdulaziz

GROUP BY and ORDER BY

The GROUP BY clause needs to appear before the ORDER BY – which makes a little intuitive sense, in that the ordering operation should be done after the possible result rows are filtered and pruned.

To reuse the previous query and sort it in reverse alphabetical order, and then limit the results to 5 (we already know that LIMIT comes after the ORDER BY:

SELECT name FROM baby_names
WHERE 
  name LIKE 'A%' 
  AND (name LIKE '%x' OR name LIKE '%z')
GROUP BY name
ORDER BY name DESC
LIMIT 5;

Here's the output:

name
Aziz
Azeez
Ayaz
Ashaz
Artez

Grouping by an alias

This is just a refresher of the previous lesson on aliases, in which we learned that aliases can be used wherever column identifiers are used.

Ergo, to enumerate all the possible lengths of names in baby_names and then list them in ascending order:

SELECT LENGTH(name) as z 
FROM baby_names
GROUP BY z
ORDER BY z ASC;

Sample output:

z
2
3
4
5

Grouping by multiple columns

Finding unique values within a single column is interesting. Oftentimes, we want to find all unique combinations between the values of two or more fields. The syntax for this is to pass multiple field names into the GROUP BY clause, separated by commas.

Let's do a trivial example; we know that there are 2 and 5 unique values for sex and year, respectively. Doing a GROUP BY on both columns should yield 10 rows:

SELECT sex, year
FROM baby_names
GROUP BY sex, year;

The results log:

10 Rows returned from: SELECT sex, year
FROM baby_names
GROUP BY sex, year; (took 1206ms)

The results:

sex year
F 2010
F 2011
F 2012
F 2013
F 2014
M 2010
M 2011
M 2012
M 2013
M 2014

A quick note about GROUP BY and column order

A very good question to ask is: does it matter what order I list the columns to group by? In other words, does the previous query return the same results as the following query?

SELECT sex, year
FROM baby_names
GROUP BY year, sex;

Run it yourself to find out.

Here's my results log:

10 Rows returned from: SELECT sex, year
FROM baby_names
GROUP BY year, sex; (took 2318ms)

Here's the output – note that the order of the columns in the results will not change unless we change the SELECT clause as well:

sex year
F 2010
M 2010
F 2011
M 2011
F 2012
M 2012
F 2013
M 2013
F 2014
M 2014

First of all, the important thing to know is that the number of rows and columns remains the same, i.e. 10 rows and 2 columns. This should logically follow from the commutative rule of multiplication:

      5 (years) * 2 (sexes) = 2 (sexes) * 5 (years)

However, the ordering of the rows has changed. When we prioritized grouping by sex, the first 5 rows looked like this:

sex year
F 2010
F 2011
F 2012
F 2013
F 2014

When prioritizing by year, the year values are essentially collected and ordered, and then for each year value, the sex value is shown. I've reordered the columns to make the difference more obvious:

year sex
2010 F
2010 M
2011 F
2011 M

There's actually a performance difference too, but that requires discussing the topic of indexes…which we don't need to worry about at the moment. However, it's enough to appreciate how the SQLite software behaves based on how we prioritize things – in this case, the columns for GROUP BY. And yet even though the calculation process is different, the content of the result remains the same.

Being selective while grouping

In the previous section, we've already seen that SELECT can operate independently of GROUP BY, in that the list of columns for SELECT can be different in order than the one GROUP BY.

However, it is also possible in SQLite for the SELECT clause to list columns not referred to in the GROUP BY:

SELECT sex, year
FROM baby_names
GROUP BY sex;

The result:

sex year
F 2014
M 2014

Or even simpler:

SELECT year
FROM baby_names
GROUP BY sex;
year
2014
2014

None of these make sense.

(more to come)