The week of Tuesday, October 13

Assignments

Baby steps with SQLite and baby names data

Due: Thursday, October 15
Points: 5

Practice SQL and find out more about your name using the Social Security Administration’s baby name data

SQL Basic Boot Camp

This week we'll be covering a lot of SQL syntax. It won't be much fun. To get a preview of what I want you to be able to do (soon): the SQL midterm from last year.

Relevant tutorials for this week:

New for Thursday:

Guest speaker for October 15: Phillip Reese

sac bee cover story

Phillip Reese is the computer-assisted reporting genius at the Sacramento Bee and one of my former colleagues. He and fellow Bee reporter Cynthia Hubert were finalists for the 2014 Pulitzer in Investigative Reporting for their reporting on a Las Vegas mental hospital that bused more than 1,500 psychiatric patients out to 48 states in 5 years.

Please read the following stories from that Pulitzer finalist series, as well as a couple of others from Reese's past work:

Basic use of SQL to just get the data…

TODO: Move this to its own tutorial

…and then get the h*ll out and into a more usable data environment, such as a spreadsheet.

In the section below, I describe how to extract all the reported robberies in San Francisco from 2011 to 2014, starting with the database of all the reports:

With the San Francisco Police incident data, we were restricted to looking at it in bite-sized chunks because it doesn't all fit into a spreadsheet.

However, with SQL, we can now just download ALL the San Francisco Police incident data as one big CSV (this ends up being 300+MB)

Then import as CSV into your SQL client (this can admittedly be difficult depending on the client and its interface).

At this point, we don't know enough how to work with date and time values properly in SQLite. So I recommend treating Date as merely a text field, excerpting it to get the text that represents a year (i.e. '2014'). And then filtering on that substring.

First, we figure out how to even select the year from the Date field. Notice how I use LIMIT so that we don't run the query on every row.

SELECT SUBSTR(Date, 7, 4) FROM incidents LIMIT 5

Test it out on a condition of being equal to '2015'

SELECT * FROM incidents 
WHERE SUBSTR(Date, 7, 4) = '2014'
LIMIT 5

Then get all the '2014' rows by removing the LIMIT clause:

SELECT * FROM incidents 
WHERE SUBSTR(Date, 7, 4) = '2014'

Use BETWEEN to find incidents within a range of years:

SELECT * FROM incidents 
WHERE SUBSTR(Date, 7, 4) BETWEEN  '2011' AND '2014'

Filter by a Category

SELECT * FROM incidents 
WHERE 
  SUBSTR(Date, 7, 4) BETWEEN  '2011' AND '2014'
  AND Category = "ROBBERY"

With as much SQL as we know now, it's just easier to export this and import into a proper spreadsheet.

As an aside, it's fun to explore Google Trends, which gives insight to the kinds of things people across the world are searching for on Google.

The following chart purports to show the relative frequency of searches for "news" and "sex", respectively, in the United States, according to Eastern Time (the filtering by world region and time zone are important):

I also mention a great talk by Peter Norvig, Google's research director, the "Unreasonable Effectiveness of Data"

Here it is in paper form. His chapter on n-grams in the Beautiful Data book is also a great, accessible read, if you're curious to how Google can figure out what people actually mean even when they make typos.

References

SQL tutorials from Fall 2014 | fall2014.padjo.org

These tutorials cover the same ground as this year’s SQL tutorials, just with a different dataset.