These tutorials cover the same ground as this year’s SQL tutorials, just with a different dataset.
Practice SQL and find out more about your name using the Social Security Administration’s baby name data
This week we'll be covering a lot of SQL syntax. It won't be much fun. To get a preview of what I want you to be able to do (soon): the SQL midterm from last year.
Relevant tutorials for this week:
SELECTing rows FROM data tables – The syntax for retrieving and displaying data from a SQLite tables
LIMIT and ORDER BY in SQL Queries – How to specify the quantity and arrangement of data rows returned by the SQL database.
Using the WHERE clause to filter data in SQL – How to retrieve rows based on whether they match specified values.
More Boolean Expressions to Filter SQL Queries – How to filter SQL data using comparison operators, such as "greater than" and "not equal to". Mostly, this is a review of how tricky logical expressions can be.
New for Thursday:
Using LIKE, IN, BETWEEN, and wildcards to match multiple values in SQL – Real-world data is often messy, so we need messy ways of matching values, because matching only on exact values can unintentionally filter out relevant data.
Functions for transforming text and numbers in SQL – Think of these as spreadsheet functions.
Aliasing Columns and Tables in SQL Databases – A short lesson on how to give human-readable names for otherwise messily-named values and identifiers.
Using GROUP BY to create aggregates in SQL – With the use of the GROUP BY clause, we gain the ability to aggregate our data based on values in a given column or columns. At the very least, this let's us count the number of unique values in that column.
Phillip Reese is the computer-assisted reporting genius at the Sacramento Bee and one of my former colleagues. He and fellow Bee reporter Cynthia Hubert were finalists for the 2014 Pulitzer in Investigative Reporting for their reporting on a Las Vegas mental hospital that bused more than 1,500 psychiatric patients out to 48 states in 5 years.
Please read the following stories from that Pulitzer finalist series, as well as a couple of others from Reese's past work:
TODO: Move this to its own tutorial
…and then get the h*ll out and into a more usable data environment, such as a spreadsheet.
In the section below, I describe how to extract all the reported robberies in San Francisco from 2011
to 2014
, starting with the database of all the reports:
With the San Francisco Police incident data, we were restricted to looking at it in bite-sized chunks because it doesn't all fit into a spreadsheet.
However, with SQL, we can now just download ALL the San Francisco Police incident data as one big CSV (this ends up being 300+MB)
Then import as CSV into your SQL client (this can admittedly be difficult depending on the client and its interface).
At this point, we don't know enough how to work with date and time values properly in SQLite. So I recommend treating Date
as merely a text field, excerpting it to get the text that represents a year (i.e. '2014'
). And then filtering on that substring.
First, we figure out how to even select the year from the Date
field. Notice how I use LIMIT
so that we don't run the query on every row.
SELECT SUBSTR(Date, 7, 4) FROM incidents LIMIT 5
Test it out on a condition of being equal to '2015'
SELECT * FROM incidents
WHERE SUBSTR(Date, 7, 4) = '2014'
LIMIT 5
Then get all the '2014'
rows by removing the LIMIT
clause:
SELECT * FROM incidents
WHERE SUBSTR(Date, 7, 4) = '2014'
Use BETWEEN
to find incidents within a range of years:
SELECT * FROM incidents
WHERE SUBSTR(Date, 7, 4) BETWEEN '2011' AND '2014'
Filter by a Category
SELECT * FROM incidents
WHERE
SUBSTR(Date, 7, 4) BETWEEN '2011' AND '2014'
AND Category = "ROBBERY"
With as much SQL as we know now, it's just easier to export this and import into a proper spreadsheet.
As an aside, it's fun to explore Google Trends, which gives insight to the kinds of things people across the world are searching for on Google.
The following chart purports to show the relative frequency of searches for "news" and "sex", respectively, in the United States, according to Eastern Time (the filtering by world region and time zone are important):
I also mention a great talk by Peter Norvig, Google's research director, the "Unreasonable Effectiveness of Data"
Here it is in paper form. His chapter on n-grams in the Beautiful Data book is also a great, accessible read, if you're curious to how Google can figure out what people actually mean even when they make typos.
These tutorials cover the same ground as this year’s SQL tutorials, just with a different dataset.