The syntax for retrieving and displaying data from a SQLite tables
Data to use: San Francisco restaurants inspection data
If you're struggling with learning SQL, it helps to take a step back and think about why we're even learning it in the first place. This guide is meant to be a walkthrough of how I think when exploring a dataset and how that translate into actual SQL commands.
So treat this as an iterative exploration of the data, starting with relatively simple syntax to find the simple things we first want to know about.
Throughout this guide, I'll try to link to relevant tutorials. Try to follow this walkthrough and think about how to solve some of the questions I pose, but feel free to check the tutorials for more detailed explanations.
The San Francisco Department of Health is responsible for food facility inspections. Here is the landing page for the inspections program, which includes a chart describing their scoring system: 90 to 100 points is "Good". Getting below a 85 means there were probably "several high-risk violations".
If you look closely, you'll see a link to a public-facing searchable database:
Yeah, that URL isn't particularly human-friendly. But the search is straightforward. Here's the detail page for a particular restaurant:
To download the raw data, they host a ZIP file containing 3 files.
If you want to skip the trouble of importing the CSV data into SQLite, feel free to use my database file here:
The SQLite database contains 3 tables from the 3 CSV files:
businesses- unique identifiers, names, addresses, geospatial coordinates, and other data related to each individual business
inspections- contains a row for each inspection, including the business_id of the inspected business, the score it received (if applicable), the date of the inspection, and the reason for the inspection
violations- contains a row for every violation, including the severity and a short description of the violation. It contains business_id and an inspection date which can presumably be used to join against the inspections data.
Here's a PDF user manual of what's in the data.
So what should we first ask about a new database? No matter what table we start with, it's always worth asking: how many rows are in the table?.
If we start with the
businesses table, the answer to this query is ostensibly a rough estimate of the restaurants, eateries, and other food-serving establishments in all of San Francisco. And just how many is that, exactly? 5,000? 25,000? Knowing even a ballpark estimate will quickly lead to other questions.
So let's start with a
If you're using a client such as DB Browser for SQLite, then you're used to the convenience of its data navigating features. For you, finding out how many rows in a given table is just a matter of pressing the Browse Data function and looking at the big number in the lower left of the window:
So why bother writing a query to count rows for a table?
Don't get me wrong; DB Browser is great (and generously given away via open source). But it's just one tool and one approach to data work –specifically, to allow spreadsheet-users to interact with SQL without having to learn complicated SQL commands.
We also crave simplicity, but because we aspire to do complicated and intensive data analysis, we should gravitate towards being able to directly command our computers (or, in this case, a database).
What does directly commanding our SQLite database look like? Well, you can always work straight from the command-line. It's just you, and the interpreter waiting for you to ask it a question:
Actually, this is probably not the time to discover the CLI – at least when you're trying to learn SQL. But just keep in mind that it is possible to move past the GUI, and that comes both with the familiarity of typing things out and being able to think about how to explore the data, regardless of interface.
So, back to getting a count of rows:
SELECT COUNT(*) FROM businesses;
How to specify the quantity and arrangement of data rows returned by the SQL database.
A few keywords to prettify and organize the results of SQL queries.
Let's look at a few of the rows:
SELECT * FROM businesses LIMIT 5;
The output is too wide to comfortably show on this page so I'll do a
PRAGMA table_info(businesses); call to get the schema layout:
Most of the columns aren't that interesting, other than perhaps
longitude, if we want to later map things. The
owner_address is potentially interesting. But for now, let's just stick to the
SELECT name FROM businesses LIMIT 5;
I'm not sure why the first entry contains a blank name…but it's a sign for things to come, in terms of discovering just how messy this dataset can be…
|100% Dessert Cafe|
|1001 Castro St|
|101 Super Mart, Inc.|
How to retrieve rows based on whether they match specified values.
The WHERE clause is where SQL starts to become very interesting for searching large datasets. With WHERE, we can now filter the results according to conditions we explicitly set.
How many places serve up pizza in San Francisco? One way to get a lower-bound estimate is to filter for all businesses with
'PIZZA' in the name:
SELECT name FROM businesses WHERE name = 'PIZZA' LIMIT 10;
The above query returns 0 results because the query is limited to finding all places in which the name is literally
'PIZZA', which doesn't include "Pizza Hut" or even "Pizza" (i.e. non-uppercase).
Real-world data is often messy, so we need messy ways of matching values, because matching only on exact values can unintentionally filter out relevant data.
This is where the
LIKE clause and wildcards come in handy. We can specify all names that begin with
'PIZZA' and not worry about the capitalization:
SELECT name FROM businesses WHERE name LIKE 'PIZZA%' LIMIT 10;
|PIZZA HUT #758280|
|Pizza Joint & Grill|
|Pizza Joint and Grill|
|Pizza Orgasmica & Brewing Co.|
We get everything from
'PIZZA HUT' to
'Pizza Orgasmica & Brewing Co.'. But our wildcard will leave out places in which
'pizza' is at the middle or end of a name, such as
'Dominos Pizza'. That's easy to fix:
SELECT name FROM businesses WHERE name LIKE '%PIZZA%' LIMIT 5;
That change in the query nets us places as exotic as
'AT&T -Room 2130 Derby Grill/Port Walk Pizza '.
To answer our original question - how many pizza places are in San Francisco? - let's use a
COUNT function just to get the number of results:
SELECT COUNT(*) FROM businesses WHERE name LIKE '%PIZZA%';
OK, now that we've seen a bit of the
businesses table, let's look at what the
inspections table has. Let's repeat some of our previous exploration technique and first find out how many rows the
inspections table has:
SELECT COUNT(*) FROM inspections;
More than 30,000, which is definitely more than the 7,000+ businesses. What does this mean exactly? Intuitively, we can guess that each business can have one or more inspections, i.e. a one-to-many relationship. Let's check out the first few rows of
SELECT * FROM inspections LIMIT 5;
|10||94||2014-07-29||Routine - Unscheduled|
|10||92||2014-01-14||Routine - Unscheduled|
|10||98||2012-11-14||Routine - Unscheduled|
business_id column should look familiar – it shows up in the
businesses table. Here, it appears to repeat itself, as if all of these inspections were on a single business. Another key takeaway is that not all inspections result in a
Score being given.
However, the main takeaway is that there appears to be no
name column in
inspections, so we can't do a simple query to find the inspections for a given
SELECT * FROM inspections WHERE name = 'PIZZA HUT';
Let's practice our aggregate queries, beyond
First, we don't even know how old or how freshly updated this dataset is. So let's use
MAX on the
SELECT MIN(date), MAX(date) FROM inspections;
Seeing how I'm writing this on October 20, 2015, i.e.
2015-10-20, I'm not sure how there's already an inspection with the date of
2015-12-30. So right away, we've learned that the data is not in perfect shape. Or that our assumptions – i.e. every record in
inspections corresponds to an inspection that has actually occurred.
I'm curious to what that is, so let's just do a plain
WHERE using the result from the previous query:
SELECT * FROM inspections WHERE date = '2015-12-30';
|34181||2015-12-30||Routine - Scheduled|
Uh OK, looks like events that are "Scheduled" are in this table. Let's do one more query and see how many future events there are:
SELECT * FROM inspections WHERE date > '2015-10-20';
|34181||2015-12-30||Routine - Scheduled|
OK, only two events are actually in the future. You'd think there'd be more than that in all of the city of San Francisco…so this is the kind of thing we would ask a public information officer about it. My bet is that it's a typo, but who knows…
Let's focus on the score of these inspections, notably, their range and average. We can start by using the average, minimum, and maximum functions:
SELECT AVG(Score), MAX(Score), MIN(Score) FROM inspections;
Let's throw in aliases and a rounding function to make the headers more attractive:
SELECT ROUND(AVG(Score)) AS avg_score, MAX(Score) AS max_score, MIN(Score) AS min_score FROM inspections;
We saw that there were multiple kinds of reports. Do they all have scores? An easy way to find out is to include the
type column in the output and
GROUP BY it. Let's throw in a
COUNT function to get a count of report types, too:
With the use of the GROUP BY clause, we gain the ability to aggregate our data based on values in a given column or columns. At the very least, this let’s us count the number of unique values in that column.
SELECT type, COUNT(*) as total_count, ROUND(AVG(Score)) AS avg_score, MAX(Score) AS max_score, MIN(Score) AS min_score FROM inspections GROUP BY type;
|Administrative or Document Review||6|
|Emergency Response Investigation||1|
|Foodborne Illness Investigation||104|
|Non-inspection site visit||1045|
|Routine - Scheduled||87||86.0||86||86|
|Routine - Unscheduled||17678||92.0||100||42|
As it turns out, there are only 3 types of reports that are scored:
Routine - Scheduled
Routine - Unscheduled
We can re-jigger our previous query to only include reports that have an actual
This tutorial is from last year’s class. Even though it uses MySQL, the syntax is pretty much the same.
SELECT type, COUNT(*) as total_count, ROUND(AVG(Score)) AS avg_score, MAX(Score) AS max_score, MIN(Score) AS min_score FROM inspections WHERE Score IS NOT NULL GROUP BY type;
|Routine - Scheduled||1||86.0||86||86|
|Routine - Unscheduled||17375||92.0||100||42|
1 report of the non-routine-unscheduled type has a score. That could be a data entry error. Either way, grouping by
type is not particularly interesting.
So let's group by time period. Let's get the aggregate tally in the previous query, but for each different year. There is no year column so we have to make our own from the
date column. I'll just go with the easy
Let's test it out by itself:
SELECT SUBSTR(date, 1, 4) AS year FROM inspections LIMIT 5;
That works. Let's do a group count to get a breakdown of inspections by year, because we don't even know how far back the records go:
SELECT SUBSTR(date, 1, 4) AS year, COUNT(*) FROM inspections GROUP BY year;
We know from a previous query that the earliest
2012-10-01, so it makes sense that
2012 is short. And
2015 has two months yet to go (as of publication date). So it looks like the SF health department averages roughly 10,000 inspection activities a year.
Now let's see if the average or range of scores has changed per given year:
SELECT SUBSTR(date, 1, 4) AS year, COUNT(*) as total_count, ROUND(AVG(Score)) AS avg_score, MAX(Score) AS max_score, MIN(Score) AS min_score FROM inspections WHERE Score IS NOT NULL GROUP BY year;
Looks like things are pretty boring with a year-by-year analysis. We could try a grouping by year and month to see if there are any interesting peaks or valleys (i.e. holiday seasons):
SELECT SUBSTR(date, 1, 7) AS year_month, COUNT(*) as total_count, ROUND(AVG(Score)) AS avg_score, MAX(Score) AS max_score, MIN(Score) AS min_score FROM inspections WHERE Score IS NOT NULL GROUP BY year_month;
Without graphing the data, we can't tell if there are any noticeable cyclical trends. Though it does seem that the months of December have significantly fewer inspections.
How about the day of the week? We might suspect that inspections on weekends, or maybe even on a Monday, can differ based on how on top of things a restaurant is.
This requires use of SQLite's
STRFTIME function, which, if you've never used before…it's one that you gradually learn, after much effort. I'll just show the result with
STRFTIME and passing in
'%w' as the first argument:
SELECT STRFTIME('%w', date) as week_day, COUNT(*) as total_count, ROUND(AVG(Score)) AS avg_score, MAX(Score) AS max_score, MIN(Score) AS min_score FROM inspections WHERE Score IS NOT NULL GROUP BY week_day;
0 stand for? That is, does the beginning of the week start on the Monday or Sunday? (the definition differs between software and languages, which yes, is very annoying).
We could look it up, but I think it's safe to say that
0 stands for Sunday, given how few inspections happen then. Likewise, a weekday of
6 seems to correspond to Saturday.
Let's go back to the
businesses table. It contains human-readable identifiers for places, such as the restaurant's
inspections table contains important information about how those restaurants have performed in health inspections.
Let's pretend we only care about In-N-Out Burger (BTW, you can see its inspection history on the SF health department's webpage). I pick In-N-Out Burger because I know there is only one in all of San Francisco and I love In-N-Out Burger, as do all great people.
This is the query I would use to get the relevant fields from
businesses: name, address, latitude, longitude, and business_id:
SELECT business_id, name, address, latitude, longitude FROM businesses WHERE name LIKE 'In-N-Out Burger';
|5104||In-N-Out Burger||333 Jefferson St||37.807923||-122.417887|
We can use the result (i.e.
business_id = 5104) to retrieve all of the inspections associated with In-N-Out. When it comes to the
inspections table, we care about the
date of the inspection, the
type of inspection, and when it happened:
SELECT date, type, Score FROM inspections WHERE business_id = '5104';
|2012-12-24||Routine - Unscheduled||100|
|2013-10-10||Routine - Unscheduled||100|
|2014-04-03||Routine - Unscheduled||98|
|2015-05-20||Routine - Unscheduled||100|
Looks like In-N-Out did pretty good, though apparently someone had something to complain about…
But one of the inspections yielded a score of less than a perfect
100, which implies that some problem was found.
But how do we find those specific problems found in a given inspection? That's where the
violations table comes in.
OK, we know the drill. First, how many rows are there?
SELECT COUNT(*) FROM violations;
Roughly 44,000, which means on average, there are a few more than 1 violation per inspection.
What are the data columns in the
SELECT * FROM violations LIMIT 3;
|10||2014-01-14||103154||Low Risk||Unclean or degraded floors walls or ceilings|
|10||2014-01-14||103145||Low Risk||Improper storage of equipment utensils or linens|
|10||2014-01-14||103119||Moderate Risk||Inadequate and inaccessible handwashing facilities|
I'll skip the exploration via aggregates – let's pretend I need to know, in a hurry, what sin In-N-Out has committed. How do we match up violation to business? Looks like
violations also keeps track of
business_id; modifying our previous query using
SELECT date, risk_category, description FROM violations WHERE business_id = '5104';
|2014-04-03||Low Risk||Unapproved or unmaintained equipment or utensils|
…Meh…? Well good to know it's not vermin. Again, you can check out In-N-Out's record on the SF health data site.
All this manual querying makes it difficult to do anything more than just a one-by-one lookup. That's where joins come in.
This tutorial is from last year’s class. Even though it uses MySQL, the syntax is pretty much the same.
SELECT businesses.name AS biz_name, businesses.address AS biz_address, inspections.Score AS score, CAST(inspections.date AS CHAR) AS inspection_date, violations.description AS violation_description FROM businesses INNER JOIN inspections ON businesses.business_id = inspections.business_id INNER JOIN violations ON inspections.business_id = violations.business_id AND inspections.date = violations.date WHERE inspection_date > "2014" AND biz_name LIKE '%STARBUCKS%' AND score < 90 AND violations.risk_category = 'High Risk' ORDER BY score;
|Starbucks Coffee Co||0264 Kearny St||85||2014-01-03||Improper cooling methods|
|Starbucks Coffee||201 Powell St||87||2015-03-17||Unclean hands or improper use of gloves|
|STARBUCKS COFFEE #9219||333 MARKET St||89||2014-10-03||High risk food holding temperature|
|STARBUCKS COFFEE CO||1231 MARKET St||89||2014-12-30||Unclean hands or improper use of gloves|
|STARBUCKS||350 RHODE ISLAND||89||2014-03-19||No hot water or running water|
SELECT (CASE WHEN name LIKE '%STARBUCKS%' THEN 'Starbucks' ELSE 'Peets' END) AS "company", address, business_id FROM businesses WHERE name LIKE '%STARBUCKS%' OR name LIKE '%PEET%'