The Takeaways

By the end of this tutorial, make sure you can answer these questions:

  • How to count the number of rows in a table using COUNT.

  • How to count unique values in a column.

  • Where aggregate functions can and can be used in a SQL statement.

  • How to find the biggest or smallest value in a column.

  • How to find the sum and average of a column.

  • Why it is bad practice, when doing an aggregate query, to include columns. that are not being aggregated.

Get the data

For this lesson, download the following file, which is a SQLite database built from the U.S. Social Security Administration file of popular baby names:

Unzip it, and open the sqlite file using the SQLite client of your choice (e.g. DB Browser for SQLite)

Table of contents

Basic syntax of aggregate functions

Aggregate functions are used to perform calculations such as the sum or average of a group of values. They all take at least one argument – typically the name of the column of which we want to sum or average the values.

SELECT SUM(some_column)
FROM some_table;

How aggregate functions change the result

Aggregate functions will return only the aggregate value of the resulting rows. In the above example, even if some_table has 20,000 rows, the query result be a single value, i.e. something like this:

SUM(some_column)
92999

In other words, aggregate functions fundamentally change the form of the result.

Where can aggregate functions be used?

Aggregate functions, unlike other core functions, can only be called as part of the SELECT clause.

So this query will not work (nor does it make any sense):

SELECT some_column
FROM some_table
where SUM(some_column) > 100;

To put it more simply: just keep your aggregate functions at the beginning of the SELECT statement.

The COUNT function

Note: this section reiterates the previous introduction, except using the SSA baby name data, as well as describing the COUNT function.

The most basic kind of aggregation is to simply count the number of rows. Previously, we were able to get a count of result rows by looking at the results log after doing a query:

/image

However, that was only a side effect of doing a standard query, and a convenience provided by the GUI client. As we've noticed so far when learning SQL, we should be explicit in asking for exactly what we want.

How to use COUNT in a query

Using the COUNT function, we can query the database to return the total number of rows in a table. Aggregate functions are used only after the SELECT clause.

The COUNT function takes a single argument. The most common is to simply use the star – this tells COUNT to provide a count of all the rows in the result:

SELECT COUNT(*)
FROM baby_names;

The result:

COUNT(*)
633608

Of course, we can count more specific things, such as: how many girl baby names were there in California in the year 2012?

SELECT COUNT(*)
FROM baby_names
WHERE state = 'CA'
  AND sex = 'F'
  AND year = 2012;

The result:

COUNT(*)
4088

The results of aggregate functions

Notice how COUNT and other aggregate functions collapse the result rows into a single value.

Without an aggregate function, SELECT * FROM baby_names; should have returned all 633,608 result rows. However, an aggregate function is applied across all the rows in the result set and returns a single aggregate value. In other words, counting up 633,608 individual rows returns just a single answer: "there are 633,608 rows."

Note: However, using GROUP BY, we can specify how the results should be grouped and aggregated, if we want, for example, to do a COUNT of result rows per state. This is where things get tricky.

We'll learn how to use GROUP BY in concert with aggregate functions in the next tutorial. For now, let's just learn about the other aggregate functions.

The problem with including columns that aren't being aggregated

Let's repeat the above query, except include the other columns to provide some context. And while we're at it, we'll alias the COUNT function call:

SELECT state, sex, year, 
  COUNT(*) as total_count
FROM baby_names
WHERE state = 'CA'
  AND sex = 'F'
  AND year = 2012;

The result:

state sex year total_count
CA F 2012 4088

That makes sense, as state and sex and year show the only possible values that were allowed by the WHERE clause.

But this is a bit deceptive of how the query actually works. It's better to illustrate with an example: what happens when we want to do a count for all years? In which case, we just remove the AND year = 2012 from the clause:

SELECT state, sex, year, 
  COUNT(*) as total_count
FROM baby_names
WHERE state = 'CA'
  AND sex = 'F';
state sex year total_count
CA F 2014 20109

Notice how the year column is listed as 2014, which makes the result seem as if it is specifically for only 2014. Remember that aggregate functions reduce the result set to a single value. We know that year contains 5 different possible values (i.e. years 2010 through 2014), but the query is designed to return only a single row. So why is it 2014? It may just be random, for all we know.

The upshot is: do not include columns that aren't being aggregated.

I'll reiterate this at the end of the lesson, so moving on…

COUNT(*) vs COUNT(column)

The COUNT function takes in a single argument. The use of the star tells COUNT to simply count all rows in the result set. However, we can pass in the name of a column to get a count of non-NULL values:

SELECT COUNT(state)
FROM baby_names;

This actually results in the same number (i.e. 633608) as before. This is because there are no non-NULL values in state, i.e. non-existent values. We actually haven't covered NULL yet and this particular database includes no NULL values…so, we'll revisit this issue later.

The difference between the count column and the COUNT function

As a side-note, it's worth wondering how the interpreter tells the difference between count, the column, and the COUNT function. Recall that column identifiers and SQLite keywords are interpreted case-insensitively. So the following three queries are equivalent and valid:

SELECT count(count) FROM baby_names;

SELECT COUNT(COUNT) FROM baby_names;

SELECT COUNT("COUNT") FROM baby_names;

The short answer: because that's just part of the SQL syntax and grammar. The interpreter knows that count, by itself, must refer to some kind of identifier, i.e. a column name. And invoking count(whatever), count is interpreted as a function call.

Another takeaway: try not to name your columns something like count, or sum, or even name, because it gets confusing. Obviously, I didn't follow that rule with the SSA baby names database.

Using DISTINCT to count unique values

Sometimes we don't care about the total number of rows, but the number of unique values within a particular column. The keyword DISTINCT – followed by a column name – can be used as an argument in any aggregate function call (though it mostly makes sense for the COUNT function).

Let's practice it on a column of which we know the unique number of values:

SELECT COUNT(DISTINCT sex)
FROM baby_names;

Result:

COUNT(DISTINCT sex)
2

To get the number of unique names:

SELECT COUNT(DISTINCT name)
FROM baby_names;

The result:

COUNT(DISTINCT name)
46359

The result of COUNT and DISTINCT is similar to the count delivered by the results log when doing a GROUP BY the name column:

SELECT name
FROM baby_names
GROUP BY name;
46359 Rows returned from: SELECT name
FROM baby_names
GROUP BY name; (took 178ms)

Of course, the difference is that COUNT(DISTINCT name) returns a single value: the count of rows returned by the GROUP BY query.

In the next lesson, we'll find out how useful it can be to do COUNT(DISTINCT) and a GROUP BY clause. But first, let's get acquainted with the rest of the aggregate functions.

Other aggregate functions

There aren't many built-in aggregate functions in SQLite. Here's the official list; I believe all of them are generally useful for data work, though I won't cover GROUP_CONCAT because it's a bit specialized (and can be confusing).

The SUM function

The SUM function returns the sum of a group's values. It takes in one argument, typically the name of the column of which we want to sum the values. Here is the summation of all the count values in baby_names:

SELECT SUM(count) 
FROM baby_names;

The result:

SUM(count)
33723213

This is a non-nonsensical number. Remember that the baby_names table contains a row for every state and for the U.S. (i.e. state = 'USA') as a whole. So we're in effect, double-counting the counts.

So let's use a WHERE clause to remove the USA rows:

SELECT SUM(count)
FROM baby_names
WHERE state != 'USA';

The answer:

SUM(count)
15449605

Or, to just sum the USA rows:

SELECT SUM(count)
FROM baby_names
WHERE state = 'USA';

The answer:

SUM(count)
18273608

Why the huge disparity? Again, it has something to do with how the Social Security Administration removes rows in which the count is less than 5 – the individual states have more names/year/sex records that would fall under that exemption.

What happens when you try to sum values that are not numbers?

SELECT SUM(name)
FROM baby_names;

SQLite treats non-numbers as 0.0. Since the name column is completely full of non-numbers, the answer should be predictable:

SUM(name)
0.0

The AVG function

The AVG function returns an average from a group of values. The following query will find the average count (again, the count column, not the COUNT function) per name, for the USA in the year 2014:

SELECT AVG(count)
FROM baby_names
WHERE state = 'USA'
  AND year = 2014;
AVG(count)
53.2240959710105

With AVG, it's common to use ROUND if you don't need the decimal precision:

SELECT ROUND(AVG(count))
FROM baby_names
WHERE state = 'USA'
  AND year = 2014;
ROUND(AVG(count))
53

In other words, there are about 53 babies for each boy name, nationwide, in the year 2014.

The MAX and MIN functions

The MAX and MIN functions return the biggest and smallest value, respectively, of a given group of values.

We already know the minimum number of babies – i.e. count – for any given name, state, year, and sex, is 5. But we can confirm that using the MIN function on count:

SELECT MIN(count)
FROM baby_names;

The result:

MIN(count)
5

To find the highest number of babies that has ever been named 'Daniel', at least between 2010 and 2014, for any single U.S. state:

SELECT MAX(count)
FROM baby_names
WHERE state != 'USA'
  AND name = 'Daniel';

The answer:

MAX(count)
3131

The difference between MAX/MIN and LIMIT & ORDER

If you remember the lesson on the LIMIT and ORDER clauses, you could argue that we had a perfectly fine way to find the biggest and smallest value for any given column.

For example, to find the smallest quantity of count:

SELECT count
FROM baby_names
ORDER BY count ASC
LIMIT 1;

And the biggest number of babies named 'Daniel' in a single U.S. state:

SELECT count
FROM baby_names
WHERE state != 'USA'
  AND name = 'Daniel'
ORDER BY count DESC
LIMIT 1;

Both the MIN/MAX and the LIMIT & ORDER BY queries return the same answer. So why use one over the other? Well, the LIMIT and ORDER BY queries have the advantage of actually returning the entire row, which we can do by doing SELECT *, and thus giving us needed context, such as: what year and state the 3131 Daniels were born into:

SELECT *
FROM baby_names
WHERE state != 'USA'
  AND name = 'Daniel'
ORDER BY count DESC
LIMIT 1;
state sex year name count
CA M 2010 Daniel 3131

That seems significantly more useful than what MAX can provide. So read on for one example of where MAX and MIN have their uses.

Performing multiple aggregate functions

Just like how more than one of the non-aggregate functions – e.g. LENGTH, LTRIM, UPPER, etc. – can be used in a SELECT clause, the same applies to aggregate functions.

For example, to find the most and the least number of boy babies named 'Daniel', nationwide, as well as the average and the sum across all the years in the baby_names table:

SELECT SUM(count), ROUND(AVG(count)),
  MIN(count), MAX(count)
FROM baby_names
WHERE state = 'USA'
  AND name = 'Daniel'
  AND sex = 'M';
SUM(count) ROUND(AVG(count)) MIN(count) MAX(count)
73346 14669.0 13829 15815

So this is one example of how MAX/MIN is different than using ORDER BY & LIMIT 1 – the latter is limited to either showing the biggest or smallest of a value. So while both approaches can theoretically result in the same actual answer, what we use simply depends on the kind of query we want to make.

Or, to repeat something I've said before: there's always more than one way to get an answer.

Do not include columns that aren't being aggregated

I realize I covered this topic earlier, but it is very easy to mess up as we get into more complicated queries. This is why I believe it is worth reiterating before we finish this tutorial.

Recall the query needed to count all the rows in a given table:

SELECT COUNT(*)
FROM baby_names;

And the answer:

COUNT(*)
633608

What happens if we include all the columns along with the COUNT function?

SELECT *, COUNT(*)
FROM baby_names;

The result:

state sex year name count COUNT(*)
USA M 2014 Zyrin 5 633608

How is one to interpret this result? That the boy name of Zyrin had 633,608 rows? Of course not. Why did the database choose to return that specific name, year, sex, and state? Who knows?

And that's why it is best practice to only include the columns that you are counting/averaging/summing/etc.

Note: For those of you who are curious about SQL in general to an unhealthy degree, SQLite is one of the few variations of SQL that allow the mixing of aggregate and non-aggregate columns. However, many other flavors of SQL do not. Or, if they do, the results are even more arbitrary than in SQLite. This is why I strongly suggest, out of habit, that you follow the rule of not mixing non-aggregate and aggregate columns. Because other SQL variations – notably PostgreSQL – will not let you be as sloppy.


References

Aggregate Functions | SQLite documentation