Pivot tables are an effective tool for quickly summarizing the facts from a mass of records
Make a Google Document in your PADJO2015 folder named: PADJO2015-BABY-KANYE.
Then answer the questions in the problem set below
The answers to each problem should include the SQL queries you used to get the data.
If the problem requires a visualization, embed a screenshot of the visualization and include the link to it.
You will have to create a separate spreadsheet to do the necessary pivot tables and charting for this exercise. But the document named PADJO2015-BABY-KANYE is the one that gets graded.
(Note: this section just contains the problems to do. Keep reading until the end to see more instructions as well as an example in action)
Create a Google Document in your PADJO2015 folder. For each of these problems, paste the SQL query (or queries) needed to get the data. Then post a link and screenshot to a visualization, as required.
state
of USA
). Write the query to show only the columns for year and the number of babies.totals
table (this is the only problem that requires it), select all of the records for the state of California, for both boys and girls. Create a visualization in which we see the change-over-time in boy versus girl babies, according to the SSA data. Select only the rows pertaining to either boy or girl babies (there's a third value for sex
) in the totals
table.You'll probably need to at least read these tutorials on SQL:
SELECTing rows FROM data tables – The syntax for retrieving and displaying data from a SQLite tables
LIMIT and ORDER BY in SQL Queries – How to specify the quantity and arrangement of data rows returned by the SQL database.
Using the WHERE clause to filter data in SQL – How to retrieve rows based on whether they match specified values.
More Boolean Expressions to Filter SQL Queries – How to filter SQL data using comparison operators, such as "greater than" and "not equal to". Mostly, this is a review of how tricky logical expressions can be.
Download the data:
I've taken the baby name data from the U.S. Social Security Administration, filtered it to the years 1950 to 2014, and compiled it into one SQLite database:
This SQLite database contains two tables, totals
and baby_names
:
totals
tableThis is actually an aggregate of the baby_names
records so that we can see how many total babies and names were recorded per state and year and sex.
Here's a sample query:
SELECT * from totals
WHERE year > 2011 AND year < 2014
AND state = 'IA'
And the result:
state | sex | year | baby_count | name_count |
---|---|---|---|---|
IA | All | 2012 | 27458 | 1261 |
IA | F | 2012 | 12440 | 664 |
IA | M | 2012 | 15018 | 597 |
IA | All | 2013 | 27733 | 1282 |
IA | F | 2013 | 12628 | 668 |
IA | M | 2013 | 15105 | 614 |
baby_names
tableThe baby_names
table contains one row for each baby name – by sex, year of birth, and U.S. state – in which at least 5 babies were given that name.
Examples of what the data contains, in English:
state
is equal to USA
) in the year 2010.Here's a SQL query to get the above records:
SELECT *
FROM baby_names
WHERE year = 2010
AND name = "Zeus"
And here's what the table looks like:
state | sex | year | name | count |
---|---|---|---|---|
CA | M | 2010 | Zeus | 27 |
TX | M | 2010 | Zeus | 13 |
USA | M | 2010 | Zeus | 68 |
First of all, why are there 68 boys recorded with the name Zeus nationwide, but only two state records for California and Texas, totaling to 40 boys named Zeus? Because the SSA does not include records for names for a given year, region, and sex, in which the total number of babies is fewer than 5.
This is for privacy reasons: if you had a friend named "ZeusTheMagicDragon". And SSA records showed that only one such boy was born in the entire history of the U.S., you'd be able to know what state and year your friend was born in.
OK, the above information isn't particularly relevant to doing this exercise, but it may explain a few holes in the data…
Besides reading the tutorials, here's an example of querying and charting the baby data that might help you.
Let's hypothesize that the names "Kanye" and "Beyonce" were very rare baby names until the rise of Kanye West and Beyoncé Knowles, respectively. Let's also hypothesize that "Taylor", as a female name, became more popular with the rise of Taylor Swift.
To keep the analysis simple, we'll look at the numbers nationwide. And let's just at years starting from 1980
.
How would we show that as a chart?
We would need to query the baby_names
table for the name
, year
, and count
values.
And we would want a combination of these conditions:
state
is USA
andyear
is greater than or equal to 1980
name
is Kanye
and the sex
is 'M'
name
is Beyonce
and the sex
is 'F'
name
is Taylor
and the sex
is 'F'
The query would look something like this (note that ordering the query's results is not necessary):
SELECT name, year, count
FROM baby_names
WHERE
state = 'USA'
AND year >= 1980
AND (
(name = 'Kanye' AND sex = 'M')
OR (name = 'Beyonce' AND sex = 'F')
OR (name = 'Taylor' AND sex = 'F')
)
ORDER BY year, name
When the query successfully executes and you get the results, in the lower right, there's an icon to click (it's the only one). It should show an option to Export to CSV:
This you should already know how to do:
Pivot tables are an effective tool for quickly summarizing the facts from a mass of records
Here is what I produced for this particular exercise if you need a refresher: