Assignment

Due: Thursday, October 15

Points: 5

Deliverables

Make a Google Document in your PADJO2015 folder named: PADJO2015-BABY-KANYE.

Then answer the questions in the problem set below

Requirements

The answers to each problem should include the SQL queries you used to get the data.
If the problem requires a visualization, embed a screenshot of the visualization and include the link to it.
You will have to create a separate spreadsheet to do the necessary pivot tables and charting for this exercise. But the document named PADJO2015-BABY-KANYE is the one that gets graded.

Problem set

(Note: this section just contains the problems to do. Keep reading until the end to see more instructions as well as an example in action)

Create a Google Document in your PADJO2015 folder. For each of these problems, paste the SQL query (or queries) needed to get the data. Then post a link and screenshot to a visualization, as required.

Find the year, and how many babies, for which your name was the most popular across the United States (i.e. with a state of USA). Write the query to show only the columns for year and the number of babies.
Using the totals table (this is the only problem that requires it), select all of the records for the state of California, for both boys and girls. Create a visualization in which we see the change-over-time in boy versus girl babies, according to the SSA data. Select only the rows pertaining to either boy or girl babies (there's a third value for sex) in the totals table.
For the U.S., select all rows with your name for all available years, export to CSV, import into a spreadsheet, and make a line chart.
Find the 5 states for which your name was most popular in the year 2014. Then, use that list of states in a second query to get all of the records for your name/sex, export it, create a pivot chart. Hint: this requires two separate queries.
Think of some other name (or names), tied to someone (or something) famous, that you hypothesize might have noticeably gone up or down over the years, either as a nationwide or just a statewide trend. Query the database for the data behind that name, export it to a spreadsheet, create a visualization (i.e. a histogram or line chart). The name "Elvis" during Elvis Presley's rise to fame would be interesting, for example (but do something else besides Elvis).

The tutorials

You'll probably need to at least read these tutorials on SQL:

SELECTing rows FROM data tables – The syntax for retrieving and displaying data from a SQLite tables
LIMIT and ORDER BY in SQL Queries – How to specify the quantity and arrangement of data rows returned by the SQL database.
Using the WHERE clause to filter data in SQL – How to retrieve rows based on whether they match specified values.
More Boolean Expressions to Filter SQL Queries – How to filter SQL data using comparison operators, such as "greater than" and "not equal to". Mostly, this is a review of how tricky logical expressions can be.

Instructions and Background Information

Download the data:

I've taken the baby name data from the U.S. Social Security Administration, filtered it to the years 1950 to 2014, and compiled it into one SQLite database:

ssa_baby_names--1950_2014--all.sqlite.zip

This SQLite database contains two tables, totals and baby_names:

The `totals` table

This is actually an aggregate of the baby_names records so that we can see how many total babies and names were recorded per state and year and sex.

Here's a sample query:

SELECT * from totals 
WHERE year > 2011 AND year < 2014 
AND state = 'IA'

And the result:

state	sex	year	baby_count	name_count
IA	All	2012	27458	1261
IA	F	2012	12440	664
IA	M	2012	15018	597
IA	All	2013	27733	1282
IA	F	2013	12628	668
IA	M	2013	15105	614

The `baby_names` table

The baby_names table contains one row for each baby name – by sex, year of birth, and U.S. state – in which at least 5 babies were given that name.

Examples of what the data contains, in English:

A record for the 27 boy babies with the name of "Zeus" born in California in the year 2010.
A record for the 13 boy babies with the name of "Zeus" born in Texas in the year 2010.
A record for the 68 boy babies with the name of "Zeus" born in nationwide (i.e. state is equal to USA) in the year 2010.

Here's a SQL query to get the above records:

SELECT * 
FROM baby_names 
WHERE year = 2010 
  AND name = "Zeus"

And here's what the table looks like:

state	sex	year	name	count
CA	M	2010	Zeus	27
TX	M	2010	Zeus	13
USA	M	2010	Zeus	68

Inconsistencies in the data

First of all, why are there 68 boys recorded with the name Zeus nationwide, but only two state records for California and Texas, totaling to 40 boys named Zeus? Because the SSA does not include records for names for a given year, region, and sex, in which the total number of babies is fewer than 5.

This is for privacy reasons: if you had a friend named "ZeusTheMagicDragon". And SSA records showed that only one such boy was born in the entire history of the U.S., you'd be able to know what state and year your friend was born in.

OK, the above information isn't particularly relevant to doing this exercise, but it may explain a few holes in the data…

Example: Charting Kanye, Beyonce, and Taylor

Besides reading the tutorials, here's an example of querying and charting the baby data that might help you.

Let's hypothesize that the names "Kanye" and "Beyonce" were very rare baby names until the rise of Kanye West and Beyoncé Knowles, respectively. Let's also hypothesize that "Taylor", as a female name, became more popular with the rise of Taylor Swift.

To keep the analysis simple, we'll look at the numbers nationwide. And let's just at years starting from 1980.

How would we show that as a chart?

We would need to query the baby_names table for the name, year, and count values.

And we would want a combination of these conditions:

The state is USA and
The year is greater than or equal to 1980
and then any of the following three conditions:
- The name is Kanye and the sex is 'M'
- The name is Beyonce and the sex is 'F'
- The name is Taylor and the sex is 'F'

The query would look something like this (note that ordering the query's results is not necessary):

SELECT name, year, count
FROM baby_names
WHERE 
  state = 'USA'
  AND year >= 1980
  AND (
    (name = 'Kanye' AND sex = 'M') 
    OR (name = 'Beyonce' AND sex = 'F')
    OR (name = 'Taylor' AND sex = 'F')
  )
ORDER BY year, name

Exporting the data in DB Browser for SQLite

When the query successfully executes and you get the results, in the lower right, there's an icon to click (it's the only one). It should show an option to Export to CSV:

Creating a pivot table

This you should already know how to do:

Basic Aggregation with Pivot Tables

Pivot tables are an effective tool for quickly summarizing the facts from a mass of records

Here is what I produced for this particular exercise if you need a refresher:

This is the imported CSV data.
This is the pivot table, with the names listed as rows and the years listed as columns.
This is the line chart (I've customized the axes to have a logarithmic scale as Taylor is a much more popular name, generally, than Kanye or Beyonce):

Baby steps with SQLite and baby names data