Pivot tables are an effective tool for quickly summarizing the facts from a mass of records
Download the data here: usgs-us-with-fips-quakes.csv
Then import it into a Google Spreadsheet. You shouldn't have to change any of the default options. The delimiter, which of course is a comma for our data set, should be automatically detectable by Google's import process.
The result of the import should look something like this, i.e. a typical spreadsheet:
Creating a chart from a spreadsheet is a matter of selecting the data-to-be visualized. Then on the menubar, select Insert, then select the Chart… menu item.
Selecting the data is the part that requires thinking. For now, let's just highlight the mag column. In modern spreadsheets, the easiest way to highlight an entire column is to click the column's header.
In the animated GIF below, I demonstrate highlighting the data column and choosing the Insert > Chart… menu item:
As soon as we try to create a chart, the first thing Google Sheets offers us is a selection of "Recommended" charts. How does it know what to recommend? Well, we don't know, which is why in general, we need to learn how to choose chart types for ourselves. But a good guess is that the Google Sheets program sees that our data selection is a single column of numbers and knows, by convention, that there are a few likely choices for visualization options.
That a computer can predict what chart type we might use should be a hint that the very structure of the data selection is critical in what type of visualizations can be created:
The histogram is actually the best choice, but for now, I choose the one that seems to be a straightforward bar chart, in which each earthquake is represented by a very thin line. The height of the line (i.e. the y-axis) depicts the mag value of each earthquake:
It's actually not too bad. We can see with a quick glance that other than very few earthquakes break the 5.0 mark. So this is definitely a visual refinement of what we started out with: a bunch of text, in which reading all the mag values is very cumbersome. But that's about the limit of our insights.
Before moving on to the histogram-type chart, let's try charting a different part of the data. Highlight both the longitude and latitude columns, and then selecting Insert > Chart…
The very first recommended chart should be a scatter plot. As I mentioned in the previous lesson, a map can be thought of as a scatter plot with longitude and latitude on the x- and y-axis, respectively:
For whatever reason, though, I can't get Google to not switch those two things around. Oh well, you can mentally transpose the variables yourself, then imagine drawing the outlines of the United States onto the scatter plot. And you basically have a geospatial map. Moving back to magnitudes and histograms…
Think of a histogram as a special kind of bar chart that shows how the data points are distributed.
This is similar in purpose to the binned maps we saw in the previous lesson:
Earthquakes similar in geospatial location were represented as a single bubble/square/hexagon, and the color of this clustered shape indicated how many earthquakes were in that cluster. In the example above, the redder the square, the more earthquakes in that area. This density map depicts the distribution of earthquakes along two dimensions – latitude and longitude – and color is used to show the amount (i.e. the density) of the distribution.
In contrast, a histogram depicts only one dimension and uses the height of the bar (i.e. the y-axis) to show the amount of each distribution.
Remember our first attempt at charting the mag column? We can see the outliers, but we can't easily see the distribution of the earthquakes by magnitude, e.g. how many earthquakes are M4.0+?, etc.
So highlight the mag column, select the menu action to Insert > Chart…, and then from the list of recommended chart types, select Histogram.
You should get something similar to this:
It's definitely a lot fewer bars to look at, and it tells the general story of the distribution: there are way more M4.0-or-weaker earthquakes. But it's a little awkward because of the bins, which you can see in the x-axis where the ticks are spaced un-neatly at intervals of 0.26.
It's is a bid weird how every third bar is a major drop-off from the first two. I can't tell if that's because the data itself is that weird, or just a quirk in the graphing process, in which some values are not rounded the way they should be.
Either way, the default bins that Google chose aren't great. So let's make our own.
I didn't go demonstrate this in the previous lesson, but it's not just the shape of the bins (e.g. hexagons versus squares), but the size of the bins that can be varied, and which can make a big difference in how the data is interpreted.
Below are three variations of sizes of the hexbins:
If the bins – often referred to as buckets – are too big, then we lose valuable geographic distinctions: the bins for California earthquakes spill across the border and make the entire West coast a block of red. But make the buckets too small, and we're back right at the original problem: having too much data to visually interpret.
This same problem applies to histograms. In our first attempt, the default bucket size was too small…or just too weird. Either way, the visual story was jumbled.
So let's go for "big" bucket sizes for our histogram of magnitudes. Round all magnitude values to the nearest integer. This is better done explicitly, by making a new column and using a formula. But we get to formulas later, let's see what magic the Google charting software can do for us.
Repeat the part where you select the mag columns and perform an Insert Chart. When the recommended visualizations pop up, select the Histogram option again, but then, click on the Customization tab.
Then scroll down until you see the Buckets sub-menu. In the input field for Bucket size, enter a value of
This is the resulting chart:
As we guessed earlier, earthquakes M4.0-and-weaker make up the vast majority of earthquakes. And earthquakes M5.0+ and above were very rare.
Let's customize the bucket size again and see what a smaller size looks like. This is especially appropriate with our given data set because, if you recall from geology class, the power of an earthquake is logarithmic, i.e. a M4.0 earthquake is 10 times the amplitude of a M3.0 earthquake.
So let's choose a fraction of a number for the bucket. Here's what a bucket size of 0.2 looks like:
Is this particularly profound? Maybe, if you care that there's a large drop-off after M3.4+. Otherwise, the story is not much different from the integer-bucketed histogram, at least to us laypersons.
We can use histograms using non-numerical columns, such as the STUSPS columns, in which case we don't care about "bucket size". This kind of histogram shows a distribution by category.
In the case of STUSPS, the category is the name of the U.S. state the quake was centered in. Highlight the STUSPS column and repeat the steps to create a histogram:
This sorting is not optimal. And also, not all 50 states are fit into the chart or even labeled. This is a limitation of the charting software. Stupid Google you might think. But actually, stupid us for trying to graph all 50 states. Some are clearly not worth showing as they have so little earthquakes throughout their history.
One way to get around the limitations of the histogram is to use a pie chart. Except Edward Tufte thinks pie charts are bad. But maybe because he hasn't seen them in faux interactive 3-D:
Nope, still bad.
If only we could limit the data to only the top 10-or-so states by total earthquake activity, we'd have a nice chart, pie or histogram. Unfortunately, I don't think there's an easy way to do kind of filtering with vanilla spreadsheets. But it's easy with Pivot Tables, which we will get to soon enough in the next lesson.
In the previous lesson, I came up with a lame attempt to show earthquakes-by-ear using the CartoDB map:
So many of the earthquake points overlap that it's hard to tell if 2011 was a dominant year in earthquakes, or if the charting software just happened to place 2011 earthquakes on top of the later quakes by default.
So right now, we just want to figure out: how many earthquakes were there per year?
Unfortunately, while our dataset does have a time column, there is no column for just year. So let's make one.
There's several approaches we could take in order to derive the year from each time value. Just to warn you: working with time – or rather, the datatypes and values that purport to represent time – is one of those things that continues to be very frustrating for me, even as an experienced programmer.
Rather than jump into that pain, let's tackle the problem in the most literal, straightforward way.
Here is a sample value from the time column:
What part of that text string represents the year? The first 4 characters, e.g.
2015. So to create a year column, we just need to extract the first 4 characters of every value in the time column.
A2 is the reference to a cell, and
2 refers to number of characters to extract, starting from the left side.
At this point, I'm going to assume you kind of know what a spreadsheet formula is and how they work, generally. If not, check out Google's help page here.
Below is an animated GIF in which I create a year column. Note these things:
Any way, here's the GIF:
Let's get back to chart-making. Highlight our new year column and create a time series – which is basically a histogram except having some kind of time value in the x-axis.
It should look something like this:
Note: I've adjusted a few of the chart editor options, such as removing the Legend and title for the y-axis. I'm going to be doing that for all the charts in this lesson. But I will not describe this customization in detail. Figuring out how to twiddle the various chart editor buttons and options, and obsessing over it to an unhealthy degree, is something you can do on your own time.
So, contrary to what we could see on the map, the years 2014 and 2015 (remember that 2015 is only 3/4 done at this point) have significantly more earthquakes than the previous years. This simple bar chart ended up being pretty valuable.
Is it worth looking at earthquakes by year and month? Maybe. Since earthquakes aren't seasonal (we think), it's interesting to see if the number of earthquakes per month has increased over the years. For all we know, with the time-series-by-year, 2014 and 2015 each had some kind of exceptional earthquake, in which many aftershocks were triggered, but were not significantly different on a typical month-to-month basis.
How do we extract the year and month values? Take a look again at the time column:
09 represents the month (i.e. the 9th month, i.e. September), I think you can figure out the formula using
LEFT for yourself. Here's the resulting time series:
At a quick glance, it does look like the overall frequency of earthquakes has markedly increased from 2014 onwards.
Let's now go from smaller picture (i.e. earthquakes per year-month) to a broader picture: how many M3.0+ earthquakes occurred in the period of 2011 through 2013, i.e. 3 years, compared to the period of 2014 through September 2015, i.e. less than 2 years? The previous histogram makes it fairly obvious that 2014 is a new era in earthquake activity. But now we want to show, in a more black-and-white fashion, how much more dramatic the change is.
In terms of spreadsheet-steps, we basically want to create a new column that designates whether an earthquake is before 2014 or 2014 and later. We can derive this from the existing year column but we need to use a different formula.
Enter the IF function. In the example below, the formula tests cell K2 (which, in my example, is where the year column is located) to see if it is less than 2014.
If that is the case, the resulting value will be the text string,
"2011-2013". Else, the value will be the text string,
IF(K2 < "2014", "2011-2013", "2014-2015")
Again, not to get into too much detail about formula syntax here, but the double-quotation marks are important. If you have a little knowledge of the pickiness of programming languages, you might wonder why we use the quoted-character string
"2014" instead of the number
2014 – it's because that year column was derived from a text formula, i.e.
Entering this formula and repeating it is the same as before. Your spreadsheet should look something like this (note the new era column):
What can we do with this new era column? We could create a histogram; since there are only two values –
2014-2015 – the result is a two-bar bar chart:
If that's a little tame for you, go for the pie chart version:
That's still not very dramatic. But think about the insight that it obscures: the blue piece of the pie represents far less time than the red piece. So the fact that it is
54.3% of the pie is very significant. But our choice of bucket size somewhat obscures it.
Let's keep the era column for now and move on to one more kind of categorization:
It's pretty well known that California is a big state for earthquakes. Repeat the steps we did for era, but this time, derive a column named
is_CA, which is either
"Not California" based on the value of the STUSPS column.
Assuming the STUSPS column is in column I, here's a hint to the formula:
Then make a pie chart that looks like this:
Again, not a terribly interesting insight. But it's good to get some practice with formulas.
IF are only a few of the possible functions we can use to transform text. But keep in mind why we need to transform text in the first place: sometimes we want to filter and sort data by categories that didn't exist in the original data, such as year.
Here is Google's searchable list of spreadsheet functions, many of which are the same in Excel. It's not worth memorizing, per se, but it's definitely worth browsing so you get an idea of what is possible.
The spreadsheet techniques we've learned were fairly basic, but pretty powerful. Other than getting more familiarized with the variety of spreadsheet functions available, there's not much more to learn (at least for this class) in terms of spreadsheets, specifically.
However, there were a few things we weren't able to easily do. One of them was to just filter the earthquake data so that we could compare states with a non-trivial number of earthquakes, rather than getting something that results in this mess of a pie chart:
I use the phrase "just filter" a bit facetiously. Filtering is one of the most fundamental things we do not just as journalists, but as humans – it is our constant challenge to make sense of the world's flood of information.
Making charts from data – that is, simplifying data for visual interpretation – is one example of why and how we filter. But being able to filter data even before we visualize data is itself an essential skill.
In the next lesson, we learn about pivot tables, which are a built-in function of modern spreadsheets. In a nutshell, pivot tables allow us to easily summarize and simplify data, which we then create charts from. It doesn't sound very sexy but it is all we need for finding the most interesting insights in this earthquake data.
Pivot tables are an effective tool for quickly summarizing the facts from a mass of records
A short overview of the importance of spreadsheets
A walkthrough of basic spreadsheet and pivot table usage (Part 1 of 3) We have a lot of ways to easily create beautiful, elaborate visualizations. But let’s see what we can do when we prioritize the story of the data over its visual presentation.
A walkthrough of using Pivot Tables to summarize earthquake data across two dimensions, allowing for even more insightful histograms (Part 3 of 3)
A searchable list of functions that can be used in Google Sheets.
Google’s somewhat-brief documentation on how to work with formulas in Google Sheets.
Documentation and sample usage for
LEFT, which returns a substring from the beginning of a specified string.
Documentation and sample usage for
IF, which returns one value if a logical expression is
TRUE and another if it is