Table of contents
Basic syntax of aggregate functions
Aggregate functions are used to perform calculations such as the sum or average of a group of values. They all take at least one argument – typically the name of the column of which we want to sum or average the values.
SELECT SUM(some_column)
FROM some_table;
How aggregate functions change the result
Aggregate functions will return only the aggregate value of the resulting rows. In the above example, even if some_table
has 20,000 rows, the query result be a single value, i.e. something like this:
In other words, aggregate functions fundamentally change the form of the result.
Where can aggregate functions be used?
Aggregate functions, unlike other core functions, can only be called as part of the SELECT
clause.
So this query will not work (nor does it make any sense):
SELECT some_column
FROM some_table
where SUM(some_column) > 100;
To put it more simply: just keep your aggregate functions at the beginning of the SELECT
statement.
The COUNT function
Note: this section reiterates the previous introduction, except using the SSA baby name data, as well as describing the COUNT
function.
The most basic kind of aggregation is to simply count the number of rows. Previously, we were able to get a count of result rows by looking at the results log after doing a query:
However, that was only a side effect of doing a standard query, and a convenience provided by the GUI client. As we've noticed so far when learning SQL, we should be explicit in asking for exactly what we want.
How to use COUNT
in a query
Using the COUNT
function, we can query the database to return the total number of rows in a table. Aggregate functions are used only after the SELECT clause.
The COUNT function takes a single argument. The most common is to simply use the star – this tells COUNT
to provide a count of all the rows in the result:
SELECT COUNT(*)
FROM baby_names;
The result:
Of course, we can count more specific things, such as: how many girl baby names were there in California in the year 2012?
SELECT COUNT(*)
FROM baby_names
WHERE state = 'CA'
AND sex = 'F'
AND year = 2012;
The result:
The results of aggregate functions
Notice how COUNT
and other aggregate functions collapse the result rows into a single value.
Without an aggregate function, SELECT * FROM baby_names;
should have returned all 633,608 result rows. However, an aggregate function is applied across all the rows in the result set and returns a single aggregate value. In other words, counting up 633,608 individual rows returns just a single answer: "there are 633,608 rows."
Note: However, using GROUP BY
, we can specify how the results should be grouped and aggregated, if we want, for example, to do a COUNT
of result rows per state
. This is where things get tricky.
We'll learn how to use GROUP BY
in concert with aggregate functions in the next tutorial. For now, let's just learn about the other aggregate functions.
The problem with including columns that aren't being aggregated
Let's repeat the above query, except include the other columns to provide some context. And while we're at it, we'll alias the COUNT
function call:
SELECT state, sex, year,
COUNT(*) as total_count
FROM baby_names
WHERE state = 'CA'
AND sex = 'F'
AND year = 2012;
The result:
state |
sex |
year |
total_count |
CA |
F |
2012 |
4088 |
That makes sense, as state
and sex
and year
show the only possible values that were allowed by the WHERE
clause.
But this is a bit deceptive of how the query actually works. It's better to illustrate with an example: what happens when we want to do a count for all years? In which case, we just remove the AND year = 2012
from the clause:
SELECT state, sex, year,
COUNT(*) as total_count
FROM baby_names
WHERE state = 'CA'
AND sex = 'F';
state |
sex |
year |
total_count |
CA |
F |
2014 |
20109 |
Notice how the year
column is listed as 2014
, which makes the result seem as if it is specifically for only 2014
. Remember that aggregate functions reduce the result set to a single value. We know that year
contains 5 different possible values (i.e. years 2010 through 2014), but the query is designed to return only a single row. So why is it 2014
? It may just be random, for all we know.
The upshot is: do not include columns that aren't being aggregated.
I'll reiterate this at the end of the lesson, so moving on…
COUNT(*) vs COUNT(column)
The COUNT
function takes in a single argument. The use of the star tells COUNT
to simply count all rows in the result set. However, we can pass in the name of a column to get a count of non-NULL
values:
SELECT COUNT(state)
FROM baby_names;
This actually results in the same number (i.e. 633608
) as before. This is because there are no non-NULL
values in state
, i.e. non-existent values. We actually haven't covered NULL
yet and this particular database includes no NULL
values…so, we'll revisit this issue later.
The difference between the count column and the COUNT function
As a side-note, it's worth wondering how the interpreter tells the difference between count
, the column, and the COUNT
function. Recall that column identifiers and SQLite keywords are interpreted case-insensitively. So the following three queries are equivalent and valid:
SELECT count(count) FROM baby_names;
SELECT COUNT(COUNT) FROM baby_names;
SELECT COUNT("COUNT") FROM baby_names;
The short answer: because that's just part of the SQL syntax and grammar. The interpreter knows that count
, by itself, must refer to some kind of identifier, i.e. a column name. And invoking count(whatever)
, count
is interpreted as a function call.
Another takeaway: try not to name your columns something like count
, or sum
, or even name
, because it gets confusing. Obviously, I didn't follow that rule with the SSA baby names database.
Using DISTINCT to count unique values
Sometimes we don't care about the total number of rows, but the number of unique values within a particular column. The keyword DISTINCT
– followed by a column name – can be used as an argument in any aggregate function call (though it mostly makes sense for the COUNT
function).
Let's practice it on a column of which we know the unique number of values:
SELECT COUNT(DISTINCT sex)
FROM baby_names;
Result:
To get the number of unique names:
SELECT COUNT(DISTINCT name)
FROM baby_names;
The result:
COUNT(DISTINCT name) |
46359 |
The result of COUNT
and DISTINCT
is similar to the count delivered by the results log when doing a GROUP BY
the name
column:
SELECT name
FROM baby_names
GROUP BY name;
46359 Rows returned from: SELECT name
FROM baby_names
GROUP BY name; (took 178ms)
Of course, the difference is that COUNT(DISTINCT name)
returns a single value: the count of rows returned by the GROUP BY
query.
In the next lesson, we'll find out how useful it can be to do COUNT(DISTINCT)
and a GROUP BY
clause. But first, let's get acquainted with the rest of the aggregate functions.
Other aggregate functions
There aren't many built-in aggregate functions in SQLite. Here's the official list; I believe all of them are generally useful for data work, though I won't cover GROUP_CONCAT
because it's a bit specialized (and can be confusing).
The SUM function
The SUM
function returns the sum of a group's values. It takes in one argument, typically the name of the column of which we want to sum the values. Here is the summation of all the count
values in baby_names
:
SELECT SUM(count)
FROM baby_names;
The result:
This is a non-nonsensical number. Remember that the baby_names
table contains a row for every state and for the U.S. (i.e. state = 'USA'
) as a whole. So we're in effect, double-counting the counts.
So let's use a WHERE
clause to remove the USA
rows:
SELECT SUM(count)
FROM baby_names
WHERE state != 'USA';
The answer:
Or, to just sum the USA
rows:
SELECT SUM(count)
FROM baby_names
WHERE state = 'USA';
The answer:
Why the huge disparity? Again, it has something to do with how the Social Security Administration removes rows in which the count
is less than 5 – the individual states have more names/year/sex records that would fall under that exemption.
What happens when you try to sum values that are not numbers?
SELECT SUM(name)
FROM baby_names;
SQLite treats non-numbers as 0.0. Since the name
column is completely full of non-numbers, the answer should be predictable:
The AVG function
The AVG
function returns an average from a group of values. The following query will find the average count
(again, the count
column, not the COUNT
function) per name, for the USA
in the year
2014
:
SELECT AVG(count)
FROM baby_names
WHERE state = 'USA'
AND year = 2014;
AVG(count) |
53.2240959710105 |
With AVG
, it's common to use ROUND
if you don't need the decimal precision:
SELECT ROUND(AVG(count))
FROM baby_names
WHERE state = 'USA'
AND year = 2014;
In other words, there are about 53 babies for each boy name, nationwide, in the year 2014.
The MAX and MIN functions
The MAX
and MIN
functions return the biggest and smallest value, respectively, of a given group of values.
We already know the minimum number of babies – i.e. count
– for any given name, state, year, and sex, is 5. But we can confirm that using the MIN
function on count
:
SELECT MIN(count)
FROM baby_names;
The result:
To find the highest number of babies that has ever been named 'Daniel'
, at least between 2010 and 2014, for any single U.S. state:
SELECT MAX(count)
FROM baby_names
WHERE state != 'USA'
AND name = 'Daniel';
The answer:
The difference between MAX/MIN and LIMIT & ORDER
If you remember the lesson on the LIMIT
and ORDER
clauses, you could argue that we had a perfectly fine way to find the biggest and smallest value for any given column.
For example, to find the smallest quantity of count
:
SELECT count
FROM baby_names
ORDER BY count ASC
LIMIT 1;
And the biggest number of babies named 'Daniel'
in a single U.S. state:
SELECT count
FROM baby_names
WHERE state != 'USA'
AND name = 'Daniel'
ORDER BY count DESC
LIMIT 1;
Both the MIN/MAX
and the LIMIT
& ORDER BY
queries return the same answer. So why use one over the other? Well, the LIMIT
and ORDER BY
queries have the advantage of actually returning the entire row, which we can do by doing SELECT *
, and thus giving us needed context, such as: what year
and state
the 3131
Daniels were born into:
SELECT *
FROM baby_names
WHERE state != 'USA'
AND name = 'Daniel'
ORDER BY count DESC
LIMIT 1;
state |
sex |
year |
name |
count |
CA |
M |
2010 |
Daniel |
3131 |
That seems significantly more useful than what MAX
can provide. So read on for one example of where MAX
and MIN
have their uses.
Just like how more than one of the non-aggregate functions – e.g. LENGTH
, LTRIM
, UPPER
, etc. – can be used in a SELECT
clause, the same applies to aggregate functions.
For example, to find the most and the least number of boy babies named 'Daniel'
, nationwide, as well as the average and the sum across all the years in the baby_names
table:
SELECT SUM(count), ROUND(AVG(count)),
MIN(count), MAX(count)
FROM baby_names
WHERE state = 'USA'
AND name = 'Daniel'
AND sex = 'M';
SUM(count) |
ROUND(AVG(count)) |
MIN(count) |
MAX(count) |
73346 |
14669.0 |
13829 |
15815 |
So this is one example of how MAX/MIN
is different than using ORDER BY
& LIMIT 1
– the latter is limited to either showing the biggest or smallest of a value. So while both approaches can theoretically result in the same actual answer, what we use simply depends on the kind of query we want to make.
Or, to repeat something I've said before: there's always more than one way to get an answer.
Do not include columns that aren't being aggregated
I realize I covered this topic earlier, but it is very easy to mess up as we get into more complicated queries. This is why I believe it is worth reiterating before we finish this tutorial.
Recall the query needed to count all the rows in a given table:
SELECT COUNT(*)
FROM baby_names;
And the answer:
What happens if we include all the columns along with the COUNT
function?
SELECT *, COUNT(*)
FROM baby_names;
The result:
state |
sex |
year |
name |
count |
COUNT(*) |
USA |
M |
2014 |
Zyrin |
5 |
633608 |
How is one to interpret this result? That the boy name of Zyrin
had 633,608 rows? Of course not. Why did the database choose to return that specific name, year, sex, and state? Who knows?
And that's why it is best practice to only include the columns that you are counting/averaging/summing/etc.
Note: For those of you who are curious about SQL in general to an unhealthy degree, SQLite is one of the few variations of SQL that allow the mixing of aggregate and non-aggregate columns. However, many other flavors of SQL do not. Or, if they do, the results are even more arbitrary than in SQLite. This is why I strongly suggest, out of habit, that you follow the rule of not mixing non-aggregate and aggregate columns. Because other SQL variations – notably PostgreSQL – will not let you be as sloppy.
Previous tutorials
Using GROUP BY to create aggregates in SQL – With the use of the GROUP BY clause, we gain the ability to aggregate our data based on values in a given column or columns. At the very least, this let's us count the number of unique values in that column.