The SF health dataset

Table of contents

If you're struggling with learning SQL, it helps to take a step back and think about why we're even learning it in the first place. This guide is meant to be a walkthrough of how I think when exploring a dataset and how that translate into actual SQL commands.

So treat this as an iterative exploration of the data, starting with relatively simple syntax to find the simple things we first want to know about.

Throughout this guide, I'll try to link to relevant tutorials. Try to follow this walkthrough and think about how to solve some of the questions I pose, but feel free to check the tutorials for more detailed explanations.

The syntax for retrieving and displaying data from a SQLite tables

How to calculate sum, average, and other aggregates with the GROUP BY clause.

What's in the data?

The San Francisco Department of Health is responsible for food facility inspections. Here is the landing page for the inspections program, which includes a chart describing their scoring system: 90 to 100 points is "Good". Getting below a 85 means there were probably "several high-risk violations".

If you look closely, you'll see a link to a public-facing searchable database:

https://101g-xnet.sfdph.org:8443/ords/f?p=132:1

Yeah, that URL isn't particularly human-friendly. But the search is straightforward. Here's the detail page for a particular restaurant:

imgur

Getting the data

To download the raw data, they host a ZIP file containing 3 files.

If you want to skip the trouble of importing the CSV data into SQLite, feel free to use my database file here:

What's in the data

The SQLite database contains 3 tables from the 3 CSV files:

Here's a PDF user manual of what's in the data.

What's in the table of businesses?

So what should we first ask about a new database? No matter what table we start with, it's always worth asking: how many rows are in the table?.

If we start with the businesses table, the answer to this query is ostensibly a rough estimate of the restaurants, eateries, and other food-serving establishments in all of San Francisco. And just how many is that, exactly? 5,000? 25,000? Knowing even a ballpark estimate will quickly lead to other questions.

So let's start with a SELECT and COUNT.

A sidenote about GUIs and conveniences

If you're using a client such as DB Browser for SQLite, then you're used to the convenience of its data navigating features. For you, finding out how many rows in a given table is just a matter of pressing the Browse Data function and looking at the big number in the lower left of the window:

image

So why bother writing a query to count rows for a table?

Don't get me wrong; DB Browser is great (and generously given away via open source). But it's just one tool and one approach to data work –specifically, to allow spreadsheet-users to interact with SQL without having to learn complicated SQL commands.

We also crave simplicity, but because we aspire to do complicated and intensive data analysis, we should gravitate towards being able to directly command our computers (or, in this case, a database).

What does directly commanding our SQLite database look like? Well, you can always work straight from the command-line. It's just you, and the interpreter waiting for you to ask it a question:

blue cli of fun

Actually, this is probably not the time to discover the CLI – at least when you're trying to learn SQL. But just keep in mind that it is possible to move past the GUI, and that comes both with the familiarity of typing things out and being able to think about how to explore the data, regardless of interface.

Let's just SELECT and COUNT

So, back to getting a count of rows:

SELECT COUNT(*)
FROM businesses;

How to specify the quantity and arrangement of data rows returned by the SQL database.

A few keywords to prettify and organize the results of SQL queries.

Let's look at a few of the rows:

SELECT *
FROM businesses
LIMIT 5;

The output is too wide to comfortably show on this page so I'll do a PRAGMA table_info(businesses); call to get the schema layout:

cid name type notnull dflt_value pk
0 business_id INTEGER 1   0
1 name VARCHAR 0   0
2 address VARCHAR 0   0
3 city VARCHAR 0   0
4 postal_code INTEGER 0   0
5 latitude FLOAT 0   0
6 longitude FLOAT 0   0
7 phone_number BIGINT 0   0
8 TaxCode VARCHAR 0   0
9 business_certificate INTEGER 0   0
10 application_date DATE 0   0
11 owner_name VARCHAR 0   0
12 owner_address VARCHAR 0   0
13 owner_city VARCHAR 0   0
14 owner_state VARCHAR 0   0
15 owner_zip VARCHAR 0   0

Most of the columns aren't that interesting, other than perhaps address, latitude, and longitude, if we want to later map things. The owner_address is potentially interesting. But for now, let's just stick to the name column:

SELECT name
FROM businesses
LIMIT 5;

I'm not sure why the first entry contains a blank name…but it's a sign for things to come, in terms of discovering just how messy this dataset can be…

name
 
100% Dessert Cafe
1001 Castro St
101 Super Mart, Inc.
1058 Hoagie

How to retrieve rows based on whether they match specified values.

The WHERE clause is where SQL starts to become very interesting for searching large datasets. With WHERE, we can now filter the results according to conditions we explicitly set.

Pizza, pizza?

How many places serve up pizza in San Francisco? One way to get a lower-bound estimate is to filter for all businesses with 'PIZZA' in the name:

SELECT name
FROM businesses
WHERE name = 'PIZZA'
LIMIT 10;

The above query returns 0 results because the query is limited to finding all places in which the name is literally'PIZZA', which doesn't include "Pizza Hut" or even "Pizza" (i.e. non-uppercase).

Real-world data is often messy, so we need messy ways of matching values, because matching only on exact values can unintentionally filter out relevant data.

This is where the LIKE clause and wildcards come in handy. We can specify all names that begin with 'PIZZA' and not worry about the capitalization:

SELECT name
FROM businesses
WHERE name LIKE 'PIZZA%'
LIMIT 10;
name
PIZZA HUT
PIZZA HUT #758280
PIZZA INFERNO
Pizza Express
Pizza Joint
Pizza Joint
Pizza Joint & Grill
Pizza Joint and Grill
Pizza Orgasmica
Pizza Orgasmica & Brewing Co.

We get everything from 'PIZZA HUT' to 'Pizza Orgasmica & Brewing Co.'. But our wildcard will leave out places in which 'pizza' is at the middle or end of a name, such as 'Dominos Pizza'. That's easy to fix:

SELECT name
FROM businesses
WHERE name LIKE '%PIZZA%'
LIMIT 5;

That change in the query nets us places as exotic as 'AT&T -Room 2130 Derby Grill/Port Walk Pizza [145178]'.

To answer our original question - how many pizza places are in San Francisco? - let's use a COUNT function just to get the number of results:

SELECT COUNT(*)
FROM businesses
WHERE name LIKE '%PIZZA%';

Inspecting the inspections table

OK, now that we've seen a bit of the businesses table, let's look at what the inspections table has. Let's repeat some of our previous exploration technique and first find out how many rows the inspections table has:

SELECT COUNT(*)
FROM inspections;

More than 30,000, which is definitely more than the 7,000+ businesses. What does this mean exactly? Intuitively, we can guess that each business can have one or more inspections, i.e. a one-to-many relationship. Let's check out the first few rows of inspections:

SELECT *
FROM inspections
LIMIT 5;
business_id Score date type
10   2014-08-07 Reinspection/Followup
10 94 2014-07-29 Routine - Unscheduled
10   2014-01-24 Reinspection/Followup
10 92 2014-01-14 Routine - Unscheduled
10 98 2012-11-14 Routine - Unscheduled

OK, that business_id column should look familiar – it shows up in the businesses table. Here, it appears to repeat itself, as if all of these inspections were on a single business. Another key takeaway is that not all inspections result in a Score being given.

However, the main takeaway is that there appears to be no name column in inspections, so we can't do a simple query to find the inspections for a given 'PIZZA HUT':

SELECT *
FROM inspections
WHERE name = 'PIZZA HUT';

Practicing aggregates

Let's practice our aggregate queries, beyond COUNT.

How to calculate sum, average, and other aggregates with the GROUP BY clause.

First, we don't even know how old or how freshly updated this dataset is. So let's use MIN and MAX on the date field

SELECT MIN(date), MAX(date)
FROM inspections;
MIN(date) MAX(date)
2012-10-01 2015-12-30

Seeing how I'm writing this on October 20, 2015, i.e. 2015-10-20, I'm not sure how there's already an inspection with the date of 2015-12-30. So right away, we've learned that the data is not in perfect shape. Or that our assumptions – i.e. every record in inspections corresponds to an inspection that has actually occurred.

I'm curious to what that is, so let's just do a plain SELECT and WHERE using the result from the previous query:

SELECT *
FROM inspections
WHERE date = '2015-12-30';
business_id Score date type
34181   2015-12-30 Routine - Scheduled
       

Uh OK, looks like events that are "Scheduled" are in this table. Let's do one more query and see how many future events there are:

SELECT *
FROM inspections
WHERE date > '2015-10-20';
business_id Score date type
76245   2015-11-20 New Ownership
34181   2015-12-30 Routine - Scheduled

OK, only two events are actually in the future. You'd think there'd be more than that in all of the city of San Francisco…so this is the kind of thing we would ask a public information officer about it. My bet is that it's a typo, but who knows…

Calculating the characteristics of a dataset

Let's focus on the score of these inspections, notably, their range and average. We can start by using the average, minimum, and maximum functions:

SELECT AVG(Score),
  MAX(Score),
  MIN(Score)
FROM inspections;
AVG(Score) MAX(Score) MIN(Score)
91.5912412959659 100 42

Let's throw in aliases and a rounding function to make the headers more attractive:

SELECT ROUND(AVG(Score)) AS avg_score,
  MAX(Score) AS max_score,
  MIN(Score) AS min_score
FROM inspections;
avg_score max_score min_score
92.0 100 42

We saw that there were multiple kinds of reports. Do they all have scores? An easy way to find out is to include the type column in the output and GROUP BY it. Let's throw in a COUNT function to get a count of report types, too:

With the use of the GROUP BY clause, we gain the ability to aggregate our data based on values in a given column or columns. At the very least, this let’s us count the number of unique values in that column.

SELECT 
  type,
  COUNT(*) as total_count,
  ROUND(AVG(Score)) AS avg_score,
  MAX(Score) AS max_score,
  MIN(Score) AS min_score
FROM inspections
GROUP BY type;
type total_count avg_score max_score min_score
Administrative or Document Review 6      
Complaint 1894      
Complaint Reinspection/Followup 200      
Emergency Response Investigation 1      
Foodborne Illness Investigation 104      
Multi-agency Investigation 4      
New Construction 1290      
New Ownership 1917      
Non-inspection site visit 1045      
Reinspection/Followup 6722 85.0 85 85
Routine - Scheduled 87 86.0 86 86
Routine - Unscheduled 17678 92.0 100 42
Special Event 2      

As it turns out, there are only 3 types of reports that are scored:

We can re-jigger our previous query to only include reports that have an actual Score:

The null state in databases | fall2014.padjo.org

This tutorial is from last year’s class. Even though it uses MySQL, the syntax is pretty much the same.

SELECT 
  type,
  COUNT(*) as total_count,
  ROUND(AVG(Score)) AS avg_score,
  MAX(Score) AS max_score,
  MIN(Score) AS min_score
FROM inspections
WHERE Score IS NOT NULL
GROUP BY type;
type total_count avg_score max_score min_score
Reinspection/Followup 1 85.0 85 85
Routine - Scheduled 1 86.0 86 86
Routine - Unscheduled 17375 92.0 100 42

Strangely, only 1 report of the non-routine-unscheduled type has a score. That could be a data entry error. Either way, grouping by type is not particularly interesting.

Create a substring to group by

So let's group by time period. Let's get the aggregate tally in the previous query, but for each different year. There is no year column so we have to make our own from the date column. I'll just go with the easy SUBSTR function:

Think of these as spreadsheet functions.

Let's test it out by itself:

SELECT SUBSTR(date, 1, 4) AS year
FROM inspections
LIMIT 5;

That works. Let's do a group count to get a breakdown of inspections by year, because we don't even know how far back the records go:

SELECT SUBSTR(date, 1, 4) AS year,
  COUNT(*)
FROM inspections
GROUP BY year;
year COUNT(*)
2012 2929
2013 10789
2014 10820
2015 6412

We know from a previous query that the earliest date is 2012-10-01, so it makes sense that 2012 is short. And 2015 has two months yet to go (as of publication date). So it looks like the SF health department averages roughly 10,000 inspection activities a year.

Now let's see if the average or range of scores has changed per given year:

SELECT
  SUBSTR(date, 1, 4) AS year,
  COUNT(*) as total_count,
  ROUND(AVG(Score)) AS avg_score,
  MAX(Score) AS max_score,
  MIN(Score) AS min_score
FROM inspections
WHERE Score IS NOT NULL
GROUP BY year;
year total_count avg_score max_score min_score
2012 1477 92.0 100 52
2013 6077 91.0 100 42
2014 6309 92.0 100 46
2015 3514 92.0 100 54

Looks like things are pretty boring with a year-by-year analysis. We could try a grouping by year and month to see if there are any interesting peaks or valleys (i.e. holiday seasons):

SELECT
  SUBSTR(date, 1, 7) AS year_month,
  COUNT(*) as total_count,
  ROUND(AVG(Score)) AS avg_score,
  MAX(Score) AS max_score,
  MIN(Score) AS min_score
FROM inspections
WHERE Score IS NOT NULL
GROUP BY year_month;
year_month total_count avg_score max_score min_score
2012-10 636 93.0 100 55
2012-11 451 91.0 100 55
2012-12 390 91.0 100 52
2013-01 539 91.0 100 58
2013-02 464 92.0 100 60
2013-03 533 92.0 100 54
2013-04 596 92.0 100 56
2013-05 503 91.0 100 54
2013-06 478 91.0 100 54
2013-07 416 90.0 100 49
2013-08 549 93.0 100 55
2013-09 553 91.0 100 51
2013-10 584 92.0 100 42
2013-11 478 91.0 100 58
2013-12 384 91.0 100 55
2014-01 456 91.0 100 62
2014-02 681 92.0 100 59
2014-03 722 92.0 100 67
2014-04 827 93.0 100 60
2014-05 640 91.0 100 46
2014-06 738 92.0 100 53
2014-07 281 90.0 100 63
2014-08 399 91.0 100 63
2014-09 500 92.0 100 55
2014-10 457 91.0 100 47
2014-11 361 91.0 100 56
2014-12 247 89.0 100 54
2015-01 380 91.0 100 60
2015-02 385 92.0 100 57
2015-03 533 92.0 100 55
2015-04 520 92.0 100 54
2015-05 376 91.0 100 62
2015-06 388 92.0 100 64
2015-07 267 91.0 100 54
2015-08 380 92.0 100 60
2015-09 285 92.0 100 59

Without graphing the data, we can't tell if there are any noticeable cyclical trends. Though it does seem that the months of December have significantly fewer inspections.

How about the day of the week? We might suspect that inspections on weekends, or maybe even on a Monday, can differ based on how on top of things a restaurant is.

This requires use of SQLite's STRFTIME function, which, if you've never used before…it's one that you gradually learn, after much effort. I'll just show the result with STRFTIME and passing in '%w' as the first argument:

SELECT
  STRFTIME('%w', date) as week_day,
  COUNT(*) as total_count,
  ROUND(AVG(Score)) AS avg_score,
  MAX(Score) AS max_score,
  MIN(Score) AS min_score
FROM inspections
WHERE Score IS NOT NULL
GROUP BY week_day;
week_day total_count avg_score max_score min_score
0 351 97.0 100 77
1 2982 91.0 100 46
2 3868 91.0 100 53
3 3955 91.0 100 50
4 3285 92.0 100 42
5 2691 93.0 100 51
6 245 97.0 100 79

What does 0 stand for? That is, does the beginning of the week start on the Monday or Sunday? (the definition differs between software and languages, which yes, is very annoying).

We could look it up, but I think it's safe to say that 0 stands for Sunday, given how few inspections happen then. Likewise, a weekday of 6 seems to correspond to Saturday.

Manually querying across tables

Let's go back to the businesses table. It contains human-readable identifiers for places, such as the restaurant's name and address. The inspections table contains important information about how those restaurants have performed in health inspections.

Let's pretend we only care about In-N-Out Burger (BTW, you can see its inspection history on the SF health department's webpage). I pick In-N-Out Burger because I know there is only one in all of San Francisco and I love In-N-Out Burger, as do all great people.

This is the query I would use to get the relevant fields from businesses: name, address, latitude, longitude, and business_id:

SELECT business_id, name, address, latitude, longitude
FROM businesses
WHERE name LIKE 'In-N-Out Burger';
business_id name address latitude longitude
5104 In-N-Out Burger 333 Jefferson St 37.807923 -122.417887

We can use the result (i.e. business_id = 5104) to retrieve all of the inspections associated with In-N-Out. When it comes to the inspections table, we care about the date of the inspection, the type of inspection, and when it happened:

SELECT date, type, Score
FROM inspections
WHERE business_id = '5104';
date type Score
2012-12-24 Routine - Unscheduled 100
2013-10-10 Routine - Unscheduled 100
2014-04-03 Routine - Unscheduled 98
2014-04-03 Complaint  
2015-05-20 Routine - Unscheduled 100

Looks like In-N-Out did pretty good, though apparently someone had something to complain about…

But one of the inspections yielded a score of less than a perfect 100, which implies that some problem was found.

But how do we find those specific problems found in a given inspection? That's where the violations table comes in.

The violations table

OK, we know the drill. First, how many rows are there?

SELECT COUNT(*)
FROM violations;

Roughly 44,000, which means on average, there are a few more than 1 violation per inspection.

What are the data columns in the violations table?

SELECT *
FROM violations
LIMIT 3;
business_id date ViolationTypeID risk_category description
10 2014-01-14 103154 Low Risk Unclean or degraded floors walls or ceilings
10 2014-01-14 103145 Low Risk Improper storage of equipment utensils or linens
10 2014-01-14 103119 Moderate Risk Inadequate and inaccessible handwashing facilities
         

Manually querying violations for a given restaurant

I'll skip the exploration via aggregates – let's pretend I need to know, in a hurry, what sin In-N-Out has committed. How do we match up violation to business? Looks like violations also keeps track of business_id; modifying our previous query using inspections:

SELECT date, risk_category, description
FROM violations
WHERE business_id = '5104';
date risk_category description
2014-04-03 Low Risk Unapproved or unmaintained equipment or utensils

Meh…? Well good to know it's not vermin. Again, you can check out In-N-Out's record on the SF health data site.

Joins

All this manual querying makes it difficult to do anything more than just a one-by-one lookup. That's where joins come in.

Inner Join in SQL | fall2014.padjo.org

This tutorial is from last year’s class. Even though it uses MySQL, the syntax is pretty much the same.

Ignore all this stuff below for now!

Dirtiest Starbucks?

SELECT businesses.name AS biz_name, 
  businesses.address AS biz_address,
  inspections.Score AS score,
  CAST(inspections.date AS CHAR)  AS inspection_date,
  violations.description AS violation_description
FROM businesses
INNER JOIN inspections
    ON businesses.business_id = inspections.business_id
INNER JOIN violations
    ON inspections.business_id = violations.business_id 
       AND inspections.date = violations.date 

WHERE inspection_date > "2014" 
  AND biz_name LIKE '%STARBUCKS%'
  AND score < 90
  AND violations.risk_category = 'High Risk'
ORDER BY score;
biz_name biz_address score inspection_date violation_description
Starbucks Coffee Co 0264 Kearny St 85 2014-01-03 Improper cooling methods
Starbucks Coffee 201 Powell St 87 2015-03-17 Unclean hands or improper use of gloves
STARBUCKS COFFEE #9219 333 MARKET St 89 2014-10-03 High risk food holding temperature
STARBUCKS COFFEE CO 1231 MARKET St 89 2014-12-30 Unclean hands or improper use of gloves
STARBUCKS 350 RHODE ISLAND 89 2014-03-19 No hot water or running water

Applying a categorization to group by

SELECT 
  (CASE WHEN  
    name LIKE '%STARBUCKS%'
    THEN 'Starbucks'
    ELSE 'Peets' END) AS  "company",
  address, business_id

FROM businesses 
  
WHERE 
  name  LIKE '%STARBUCKS%' 
  OR name LIKE '%PEET%'