Data to use: San Francisco restaurants inspection data

What's in the data?
- Getting the data
- What's in the data
What's in the table of businesses?
- A sidenote about GUIs and conveniences
- Let's just SELECT and COUNT
  - Pizza, pizza?
Inspecting the inspections table
The violations table
- Manually querying violations for a given restaurant
Joins
Ignore all this stuff below for now!
- Dirtiest Starbucks?
- Applying a categorization to group by

If you're struggling with learning SQL, it helps to take a step back and think about why we're even learning it in the first place. This guide is meant to be a walkthrough of how I think when exploring a dataset and how that translate into actual SQL commands.

So treat this as an iterative exploration of the data, starting with relatively simple syntax to find the simple things we first want to know about.

Throughout this guide, I'll try to link to relevant tutorials. Try to follow this walkthrough and think about how to solve some of the questions I pose, but feel free to check the tutorials for more detailed explanations.

SELECTing rows FROM data tables

The syntax for retrieving and displaying data from a SQLite tables

Aggregate functions in SQL

How to calculate sum, average, and other aggregates with the GROUP BY clause.

What's in the data?

The San Francisco Department of Health is responsible for food facility inspections. Here is the landing page for the inspections program, which includes a chart describing their scoring system: 90 to 100 points is "Good". Getting below a 85 means there were probably "several high-risk violations".

If you look closely, you'll see a link to a public-facing searchable database:

https://101g-xnet.sfdph.org:8443/ords/f?p=132:1

Yeah, that URL isn't particularly human-friendly. But the search is straightforward. Here's the detail page for a particular restaurant:

imgur

Getting the data

To download the raw data, they host a ZIP file containing 3 files.

If you want to skip the trouble of importing the CSV data into SQLite, feel free to use my database file here:

sf_food_program_db.sqlite

What's in the data

The SQLite database contains 3 tables from the 3 CSV files:

businesses - unique identifiers, names, addresses, geospatial coordinates, and other data related to each individual business
inspections - contains a row for each inspection, including the business_id of the inspected business, the score it received (if applicable), the date of the inspection, and the reason for the inspection
violations - contains a row for every violation, including the severity and a short description of the violation. It contains business_id and an inspection date which can presumably be used to join against the inspections data.

Here's a PDF user manual of what's in the data.

What's in the table of businesses?

So what should we first ask about a new database? No matter what table we start with, it's always worth asking: how many rows are in the table?.

If we start with the businesses table, the answer to this query is ostensibly a rough estimate of the restaurants, eateries, and other food-serving establishments in all of San Francisco. And just how many is that, exactly? 5,000? 25,000? Knowing even a ballpark estimate will quickly lead to other questions.

So let's start with a SELECT and COUNT.

A sidenote about GUIs and conveniences

If you're using a client such as DB Browser for SQLite, then you're used to the convenience of its data navigating features. For you, finding out how many rows in a given table is just a matter of pressing the Browse Data function and looking at the big number in the lower left of the window:

So why bother writing a query to count rows for a table?

Don't get me wrong; DB Browser is great (and generously given away via open source). But it's just one tool and one approach to data work –specifically, to allow spreadsheet-users to interact with SQL without having to learn complicated SQL commands.

We also crave simplicity, but because we aspire to do complicated and intensive data analysis, we should gravitate towards being able to directly command our computers (or, in this case, a database).

What does directly commanding our SQLite database look like? Well, you can always work straight from the command-line. It's just you, and the interpreter waiting for you to ask it a question:

blue cli of fun

Actually, this is probably not the time to discover the CLI – at least when you're trying to learn SQL. But just keep in mind that it is possible to move past the GUI, and that comes both with the familiarity of typing things out and being able to think about how to explore the data, regardless of interface.

Let's just SELECT and COUNT

So, back to getting a count of rows:

SELECT COUNT(*)
FROM businesses;

LIMIT and ORDER BY in SQL Queries

How to specify the quantity and arrangement of data rows returned by the SQL database.

Limiting and ordering SQL results [Fall 2014] | fall2014.padjo.org

A few keywords to prettify and organize the results of SQL queries.

Let's look at a few of the rows:

SELECT *
FROM businesses
LIMIT 5;

The output is too wide to comfortably show on this page so I'll do a PRAGMA table_info(businesses); call to get the schema layout:

cid	name	type	notnull
0	business_id	INTEGER	1
1	name	VARCHAR	0
2	address	VARCHAR	0
3	city	VARCHAR	0
4	postal_code	INTEGER	0
5	latitude	FLOAT	0
6	longitude	FLOAT	0
7	phone_number	BIGINT	0
8	TaxCode	VARCHAR	0
9	business_certificate	INTEGER	0
10	application_date	DATE	0
11	owner_name	VARCHAR	0
12	owner_address	VARCHAR	0
13	owner_city	VARCHAR	0
14	owner_state	VARCHAR	0
15	owner_zip	VARCHAR	0

Most of the columns aren't that interesting, other than perhaps address, latitude, and longitude, if we want to later map things. The owner_address is potentially interesting. But for now, let's just stick to the name column:

SELECT name
FROM businesses
LIMIT 5;

I'm not sure why the first entry contains a blank name…but it's a sign for things to come, in terms of discovering just how messy this dataset can be…

name

100% Dessert Cafe
1001 Castro St
101 Super Mart, Inc.
1058 Hoagie

Using the WHERE clause to filter data in SQL

How to retrieve rows based on whether they match specified values.

Conditional filtering with WHERE [Fall 2014] | fall2014.padjo.org

The WHERE clause is where SQL starts to become very interesting for searching large datasets. With WHERE, we can now filter the results according to conditions we explicitly set.

Pizza, pizza?

How many places serve up pizza in San Francisco? One way to get a lower-bound estimate is to filter for all businesses with 'PIZZA' in the name:

SELECT name
FROM businesses
WHERE name = 'PIZZA'
LIMIT 10;

The above query returns 0 results because the query is limited to finding all places in which the name is literally'PIZZA', which doesn't include "Pizza Hut" or even "Pizza" (i.e. non-uppercase).

Using LIKE, IN, BETWEEN, and wildcards to match multiple values in SQL

Real-world data is often messy, so we need messy ways of matching values, because matching only on exact values can unintentionally filter out relevant data.

This is where the LIKE clause and wildcards come in handy. We can specify all names that begin with 'PIZZA' and not worry about the capitalization:

SELECT name
FROM businesses
WHERE name LIKE 'PIZZA%'
LIMIT 10;

name
PIZZA HUT
PIZZA HUT #758280
PIZZA INFERNO
Pizza Express
Pizza Joint
Pizza Joint
Pizza Joint & Grill
Pizza Joint and Grill
Pizza Orgasmica
Pizza Orgasmica & Brewing Co.

We get everything from 'PIZZA HUT' to 'Pizza Orgasmica & Brewing Co.'. But our wildcard will leave out places in which 'pizza' is at the middle or end of a name, such as 'Dominos Pizza'. That's easy to fix:

SELECT name
FROM businesses
WHERE name LIKE '%PIZZA%'
LIMIT 5;

That change in the query nets us places as exotic as 'AT&T -Room 2130 Derby Grill/Port Walk Pizza [145178]'.

To answer our original question - how many pizza places are in San Francisco? - let's use a COUNT function just to get the number of results:

SELECT COUNT(*)
FROM businesses
WHERE name LIKE '%PIZZA%';

Inspecting the inspections table

OK, now that we've seen a bit of the businesses table, let's look at what the inspections table has. Let's repeat some of our previous exploration technique and first find out how many rows the inspections table has:

SELECT COUNT(*)
FROM inspections;

More than 30,000, which is definitely more than the 7,000+ businesses. What does this mean exactly? Intuitively, we can guess that each business can have one or more inspections, i.e. a one-to-many relationship. Let's check out the first few rows of inspections:

SELECT *
FROM inspections
LIMIT 5;

business_id	Score	date	type
10		2014-08-07	Reinspection/Followup
10	94	2014-07-29	Routine - Unscheduled
10		2014-01-24	Reinspection/Followup
10	92	2014-01-14	Routine - Unscheduled
10	98	2012-11-14	Routine - Unscheduled

OK, that business_id column should look familiar – it shows up in the businesses table. Here, it appears to repeat itself, as if all of these inspections were on a single business. Another key takeaway is that not all inspections result in a Score being given.

However, the main takeaway is that there appears to be no name column in inspections, so we can't do a simple query to find the inspections for a given 'PIZZA HUT':

SELECT *
FROM inspections
WHERE name = 'PIZZA HUT';

Practicing aggregates

Let's practice our aggregate queries, beyond COUNT.

Aggregate functions in SQL

How to calculate sum, average, and other aggregates with the GROUP BY clause.

First, we don't even know how old or how freshly updated this dataset is. So let's use MIN and MAX on the date field

SELECT MIN(date), MAX(date)
FROM inspections;

MIN(date)	MAX(date)
2012-10-01	2015-12-30

Seeing how I'm writing this on October 20, 2015, i.e. 2015-10-20, I'm not sure how there's already an inspection with the date of 2015-12-30. So right away, we've learned that the data is not in perfect shape. Or that our assumptions – i.e. every record in inspections corresponds to an inspection that has actually occurred.

I'm curious to what that is, so let's just do a plain SELECT and WHERE using the result from the previous query:

SELECT *
FROM inspections
WHERE date = '2015-12-30';

business_id	Score	date	type
34181		2015-12-30	Routine - Scheduled

Uh OK, looks like events that are "Scheduled" are in this table. Let's do one more query and see how many future events there are:

SELECT *
FROM inspections
WHERE date > '2015-10-20';

business_id	Score	date	type
76245		2015-11-20	New Ownership
34181		2015-12-30	Routine - Scheduled

OK, only two events are actually in the future. You'd think there'd be more than that in all of the city of San Francisco…so this is the kind of thing we would ask a public information officer about it. My bet is that it's a typo, but who knows…

Calculating the characteristics of a dataset

Let's focus on the score of these inspections, notably, their range and average. We can start by using the average, minimum, and maximum functions:

SELECT AVG(Score),
  MAX(Score),
  MIN(Score)
FROM inspections;

AVG(Score)	MAX(Score)	MIN(Score)
91.5912412959659	100	42

Let's throw in aliases and a rounding function to make the headers more attractive:

SELECT ROUND(AVG(Score)) AS avg_score,
  MAX(Score) AS max_score,
  MIN(Score) AS min_score
FROM inspections;

avg_score	max_score	min_score
92.0	100	42

We saw that there were multiple kinds of reports. Do they all have scores? An easy way to find out is to include the type column in the output and GROUP BY it. Let's throw in a COUNT function to get a count of report types, too:

Using GROUP BY to create aggregates in SQL

With the use of the GROUP BY clause, we gain the ability to aggregate our data based on values in a given column or columns. At the very least, this let’s us count the number of unique values in that column.

SELECT 
  type,
  COUNT(*) as total_count,
  ROUND(AVG(Score)) AS avg_score,
  MAX(Score) AS max_score,
  MIN(Score) AS min_score
FROM inspections
GROUP BY type;

type	total_count	avg_score	max_score	min_score
Administrative or Document Review	6
Complaint	1894
Complaint Reinspection/Followup	200
Emergency Response Investigation	1
Foodborne Illness Investigation	104
Multi-agency Investigation	4
New Construction	1290
New Ownership	1917
Non-inspection site visit	1045
Reinspection/Followup	6722	85.0	85	85
Routine - Scheduled	87	86.0	86	86
Routine - Unscheduled	17678	92.0	100	42
Special Event	2

As it turns out, there are only 3 types of reports that are scored:

Reinspection/Followup
Routine - Scheduled
Routine - Unscheduled

We can re-jigger our previous query to only include reports that have an actual Score:

The null state in databases | fall2014.padjo.org

This tutorial is from last year’s class. Even though it uses MySQL, the syntax is pretty much the same.

SELECT 
  type,
  COUNT(*) as total_count,
  ROUND(AVG(Score)) AS avg_score,
  MAX(Score) AS max_score,
  MIN(Score) AS min_score
FROM inspections
WHERE Score IS NOT NULL
GROUP BY type;

type	total_count	avg_score	max_score	min_score
Reinspection/Followup	1	85.0	85	85
Routine - Scheduled	1	86.0	86	86
Routine - Unscheduled	17375	92.0	100	42

Strangely, only 1 report of the non-routine-unscheduled type has a score. That could be a data entry error. Either way, grouping by type is not particularly interesting.

Create a substring to group by

So let's group by time period. Let's get the aggregate tally in the previous query, but for each different year. There is no year column so we have to make our own from the date column. I'll just go with the easy SUBSTR function:

Functions for transforming text and numbers in SQL

Think of these as spreadsheet functions.

Let's test it out by itself:

SELECT SUBSTR(date, 1, 4) AS year
FROM inspections
LIMIT 5;

That works. Let's do a group count to get a breakdown of inspections by year, because we don't even know how far back the records go:

SELECT SUBSTR(date, 1, 4) AS year,
  COUNT(*)
FROM inspections
GROUP BY year;

year	COUNT(*)
2012	2929
2013	10789
2014	10820
2015	6412

We know from a previous query that the earliest date is 2012-10-01, so it makes sense that 2012 is short. And 2015 has two months yet to go (as of publication date). So it looks like the SF health department averages roughly 10,000 inspection activities a year.

Now let's see if the average or range of scores has changed per given year:

SELECT
  SUBSTR(date, 1, 4) AS year,
  COUNT(*) as total_count,
  ROUND(AVG(Score)) AS avg_score,
  MAX(Score) AS max_score,
  MIN(Score) AS min_score
FROM inspections
WHERE Score IS NOT NULL
GROUP BY year;

year	total_count	avg_score	max_score	min_score
2012	1477	92.0	100	52
2013	6077	91.0	100	42
2014	6309	92.0	100	46
2015	3514	92.0	100	54

Looks like things are pretty boring with a year-by-year analysis. We could try a grouping by year and month to see if there are any interesting peaks or valleys (i.e. holiday seasons):

SELECT
  SUBSTR(date, 1, 7) AS year_month,
  COUNT(*) as total_count,
  ROUND(AVG(Score)) AS avg_score,
  MAX(Score) AS max_score,
  MIN(Score) AS min_score
FROM inspections
WHERE Score IS NOT NULL
GROUP BY year_month;

year_month	total_count	avg_score	max_score	min_score
2012-10	636	93.0	100	55
2012-11	451	91.0	100	55
2012-12	390	91.0	100	52
2013-01	539	91.0	100	58
2013-02	464	92.0	100	60
2013-03	533	92.0	100	54
2013-04	596	92.0	100	56
2013-05	503	91.0	100	54
2013-06	478	91.0	100	54
2013-07	416	90.0	100	49
2013-08	549	93.0	100	55
2013-09	553	91.0	100	51
2013-10	584	92.0	100	42
2013-11	478	91.0	100	58
2013-12	384	91.0	100	55
2014-01	456	91.0	100	62
2014-02	681	92.0	100	59
2014-03	722	92.0	100	67
2014-04	827	93.0	100	60
2014-05	640	91.0	100	46
2014-06	738	92.0	100	53
2014-07	281	90.0	100	63
2014-08	399	91.0	100	63
2014-09	500	92.0	100	55
2014-10	457	91.0	100	47
2014-11	361	91.0	100	56
2014-12	247	89.0	100	54
2015-01	380	91.0	100	60
2015-02	385	92.0	100	57
2015-03	533	92.0	100	55
2015-04	520	92.0	100	54
2015-05	376	91.0	100	62
2015-06	388	92.0	100	64
2015-07	267	91.0	100	54
2015-08	380	92.0	100	60
2015-09	285	92.0	100	59

Without graphing the data, we can't tell if there are any noticeable cyclical trends. Though it does seem that the months of December have significantly fewer inspections.

How about the day of the week? We might suspect that inspections on weekends, or maybe even on a Monday, can differ based on how on top of things a restaurant is.

This requires use of SQLite's STRFTIME function, which, if you've never used before…it's one that you gradually learn, after much effort. I'll just show the result with STRFTIME and passing in '%w' as the first argument:

SELECT
  STRFTIME('%w', date) as week_day,
  COUNT(*) as total_count,
  ROUND(AVG(Score)) AS avg_score,
  MAX(Score) AS max_score,
  MIN(Score) AS min_score
FROM inspections
WHERE Score IS NOT NULL
GROUP BY week_day;

week_day	total_count	avg_score	max_score	min_score
0	351	97.0	100	77
1	2982	91.0	100	46
2	3868	91.0	100	53
3	3955	91.0	100	50
4	3285	92.0	100	42
5	2691	93.0	100	51
6	245	97.0	100	79

What does 0 stand for? That is, does the beginning of the week start on the Monday or Sunday? (the definition differs between software and languages, which yes, is very annoying).

We could look it up, but I think it's safe to say that 0 stands for Sunday, given how few inspections happen then. Likewise, a weekday of 6 seems to correspond to Saturday.

Manually querying across tables

Let's go back to the businesses table. It contains human-readable identifiers for places, such as the restaurant's name and address. The inspections table contains important information about how those restaurants have performed in health inspections.

Let's pretend we only care about In-N-Out Burger (BTW, you can see its inspection history on the SF health department's webpage). I pick In-N-Out Burger because I know there is only one in all of San Francisco and I love In-N-Out Burger, as do all great people.

This is the query I would use to get the relevant fields from businesses: name, address, latitude, longitude, and business_id:

SELECT business_id, name, address, latitude, longitude
FROM businesses
WHERE name LIKE 'In-N-Out Burger';

business_id	name	address	latitude	longitude
5104	In-N-Out Burger	333 Jefferson St	37.807923	-122.417887

We can use the result (i.e. business_id = 5104) to retrieve all of the inspections associated with In-N-Out. When it comes to the inspections table, we care about the date of the inspection, the type of inspection, and when it happened:

SELECT date, type, Score
FROM inspections
WHERE business_id = '5104';

date	type	Score
2012-12-24	Routine - Unscheduled	100
2013-10-10	Routine - Unscheduled	100
2014-04-03	Routine - Unscheduled	98
2014-04-03	Complaint
2015-05-20	Routine - Unscheduled	100

Looks like In-N-Out did pretty good, though apparently someone had something to complain about…

But one of the inspections yielded a score of less than a perfect 100, which implies that some problem was found.

But how do we find those specific problems found in a given inspection? That's where the violations table comes in.

The violations table

OK, we know the drill. First, how many rows are there?

SELECT COUNT(*)
FROM violations;

Roughly 44,000, which means on average, there are a few more than 1 violation per inspection.

What are the data columns in the violations table?

SELECT *
FROM violations
LIMIT 3;

business_id	date	ViolationTypeID	risk_category	description
10	2014-01-14	103154	Low Risk	Unclean or degraded floors walls or ceilings
10	2014-01-14	103145	Low Risk	Improper storage of equipment utensils or linens
10	2014-01-14	103119	Moderate Risk	Inadequate and inaccessible handwashing facilities

Manually querying violations for a given restaurant

I'll skip the exploration via aggregates – let's pretend I need to know, in a hurry, what sin In-N-Out has committed. How do we match up violation to business? Looks like violations also keeps track of business_id; modifying our previous query using inspections:

SELECT date, risk_category, description
FROM violations
WHERE business_id = '5104';

date	risk_category	description
2014-04-03	Low Risk	Unapproved or unmaintained equipment or utensils

…Meh…? Well good to know it's not vermin. Again, you can check out In-N-Out's record on the SF health data site.

Joins

All this manual querying makes it difficult to do anything more than just a one-by-one lookup. That's where joins come in.

Inner Join in SQL | fall2014.padjo.org