Assignment

Due: Thursday, October 15
Points: 5

Deliverables

Make a Google Document in your PADJO2015 folder named: PADJO2015-BABY-KANYE.

Then answer the questions in the problem set below

Requirements

  • The answers to each problem should include the SQL queries you used to get the data.

  • If the problem requires a visualization, embed a screenshot of the visualization and include the link to it.

  • You will have to create a separate spreadsheet to do the necessary pivot tables and charting for this exercise. But the document named PADJO2015-BABY-KANYE is the one that gets graded.

Problem set

(Note: this section just contains the problems to do. Keep reading until the end to see more instructions as well as an example in action)

Create a Google Document in your PADJO2015 folder. For each of these problems, paste the SQL query (or queries) needed to get the data. Then post a link and screenshot to a visualization, as required.

  1. Find the year, and how many babies, for which your name was the most popular across the United States (i.e. with a state of USA). Write the query to show only the columns for year and the number of babies.
  2. Using the totals table (this is the only problem that requires it), select all of the records for the state of California, for both boys and girls. Create a visualization in which we see the change-over-time in boy versus girl babies, according to the SSA data. Select only the rows pertaining to either boy or girl babies (there's a third value for sex) in the totals table.
  3. For the U.S., select all rows with your name for all available years, export to CSV, import into a spreadsheet, and make a line chart.
  4. Find the 5 states for which your name was most popular in the year 2014. Then, use that list of states in a second query to get all of the records for your name/sex, export it, create a pivot chart. Hint: this requires two separate queries.
  5. Think of some other name (or names), tied to someone (or something) famous, that you hypothesize might have noticeably gone up or down over the years, either as a nationwide or just a statewide trend. Query the database for the data behind that name, export it to a spreadsheet, create a visualization (i.e. a histogram or line chart). The name "Elvis" during Elvis Presley's rise to fame would be interesting, for example (but do something else besides Elvis).

The tutorials

You'll probably need to at least read these tutorials on SQL:

Instructions and Background Information

Download the data:

I've taken the baby name data from the U.S. Social Security Administration, filtered it to the years 1950 to 2014, and compiled it into one SQLite database:

This SQLite database contains two tables, totals and baby_names:

The totals table

This is actually an aggregate of the baby_names records so that we can see how many total babies and names were recorded per state and year and sex.

Here's a sample query:

SELECT * from totals 
WHERE year > 2011 AND year < 2014 
AND state = 'IA'

And the result:

state sex year baby_count name_count
IA All 2012 27458 1261
IA F 2012 12440 664
IA M 2012 15018 597
IA All 2013 27733 1282
IA F 2013 12628 668
IA M 2013 15105 614

The baby_names table

The baby_names table contains one row for each baby name – by sex, year of birth, and U.S. state – in which at least 5 babies were given that name.

Examples of what the data contains, in English:

Here's a SQL query to get the above records:

SELECT * 
FROM baby_names 
WHERE year = 2010 
  AND name = "Zeus"

And here's what the table looks like:

state sex year name count
CA M 2010 Zeus 27
TX M 2010 Zeus 13
USA M 2010 Zeus 68

Inconsistencies in the data

First of all, why are there 68 boys recorded with the name Zeus nationwide, but only two state records for California and Texas, totaling to 40 boys named Zeus? Because the SSA does not include records for names for a given year, region, and sex, in which the total number of babies is fewer than 5.

This is for privacy reasons: if you had a friend named "ZeusTheMagicDragon". And SSA records showed that only one such boy was born in the entire history of the U.S., you'd be able to know what state and year your friend was born in.

OK, the above information isn't particularly relevant to doing this exercise, but it may explain a few holes in the data…

Example: Charting Kanye, Beyonce, and Taylor

Besides reading the tutorials, here's an example of querying and charting the baby data that might help you.

Let's hypothesize that the names "Kanye" and "Beyonce" were very rare baby names until the rise of Kanye West and Beyoncé Knowles, respectively. Let's also hypothesize that "Taylor", as a female name, became more popular with the rise of Taylor Swift.

To keep the analysis simple, we'll look at the numbers nationwide. And let's just at years starting from 1980.

How would we show that as a chart?

We would need to query the baby_names table for the name, year, and count values.

And we would want a combination of these conditions:

The query would look something like this (note that ordering the query's results is not necessary):

SELECT name, year, count
FROM baby_names
WHERE 
  state = 'USA'
  AND year >= 1980
  AND (
    (name = 'Kanye' AND sex = 'M') 
    OR (name = 'Beyonce' AND sex = 'F')
    OR (name = 'Taylor' AND sex = 'F')
  )
ORDER BY year, name

Exporting the data in DB Browser for SQLite

When the query successfully executes and you get the results, in the lower right, there's an icon to click (it's the only one). It should show an option to Export to CSV:

image

Creating a pivot table

This you should already know how to do:

Basic Aggregation with Pivot Tables

Pivot tables are an effective tool for quickly summarizing the facts from a mass of records

Here is what I produced for this particular exercise if you need a refresher: