The majority of the walkthroughs on this mini-site deal with processing walls of text. If you're new and excited about data, processing text was probably not what you signed up for. Maybe you thought you'd be making fun maps like this:
Fortunately, learning about text goes hand-in-hand with making interesting data visualizations. This walkthrough shows how to quickly get from a dense text file to building an interactive time-lapse map (via CartoDB) in an easy 5 minutes.
How can a data visualization be built so easily? The first step is just understanding that data is text.
For this exercise, we will use the real-time earthquake feed from the United States Geological Survey's Earthquake Hazards Program.
The USGS has a several formats for their real-time data feeds; we want the __Spreadsheet__format. You can choose from the various slices. For this demonstration, I'm using "Significant Earthquakes in the Past 30 Days." Clicking the link should automatically download the file to your computer.
Open the file in your text editor to see what the file contains:
CSV, which stands for comma-separated values, is a popular text format for storing data. The commas act as delimiters and denote the "columns" in the data. But for humans, it's very hard to "see" as data, which is why we use spreadsheets.
A spreadsheet program such as Excel knows how to interpret those commas: when you open the earthquakes CSV with Excel, the column-delimiting commas that were present in the text file don't appear. Things get a little column when a value in a data column contains a comma, but the general upshot is that most of the data we will ever encounter is stored as plain text.
Most data-handling programs – from Excel to fancy online visualization tools – live with this reality. So the other upshot is that no matter what you want to do with your data: analyze it, map it, chart it, stuff it into a database, make a web app from it – you just need to get it into this spreadsheet-friendly text format.
So if this earthquakes CSV works in Excel, it should just work in Google Sheets. Inside a new Google spreadsheet, click File in the menubar, then Import…, then Upload. It will give you some options, such as specifying the Separator character. You can choose Comma, but the Detect automatically should work just fine.
(You can also try choosing another option, such as Tab, to see how misinterpreting the separator character is the difference between a spreadsheet and a blob of text)
For the most part, Google Sheets and Excel are interchangeable. I usually go with Google Sheets because of its cleaner interface with the online collaboration conveniently built-in. Either way, there's no obstacle in exporting my work into CSV so that any Excel user can use it, and vice versa.
With structured data, visualizing it is a matter of knowing which columns are relevant. When it comes to earthquakes, an obvious visualization is where the earthquakes took place. The relevant columns are latitude and longitude.
There are a few hacks that can be done to make a map via Google Sheets, but let's just move to an application in which mapping is a built-in feature.
Google Fusion Tables is a hybrid between a spreadsheet and database; it's not as easy-to-use or edit as the former, but it has a few more "power" features, including the ability to store and efficiently manage far more data than a Google spreadsheet. For whatever reason, Google has included a built-in mapping tool in Fusion Tables but not Sheets. But again, our data is just plain text CSV and is portable across the two different apps.
The easiest way to create a Fusion Table is to go to your Google Drive at drive.google.com (note: if you are trying this from a Stanford account, you may have to do it from your personal Google account). The Fusion Tables option is a bit buried: click the New button, and then More.
Creating a new Fusion Table will prompt you to upload a file; go ahead and upload your earthquakes CSV. Then you can click through the Next button and accept all the default options. The result should look very similar to a spreadsheet:
Fusion Tables will try to guess if your data contains geospatial data – i.e. the "latitude" and "longitude" columns in our earthquakes CSV – and then automatically create a map based on the data.
You should see a tab titled "Map of latitude". Click it to see the map:
Fusion Tables has a lot of functionality and options; you can check out Google's tutorial for more information.
Let's use another of Google's tools: My Maps. Go back to your Google Drive and click the New button, then More to see the My Maps option. Click on the option to import your own data, and it will prompt you to select a file.
Unlike Fusion Tables, My Maps won't automatically create a map based off of the latitude and longitude columns and will prompt you to confirm which columns to use:
The result is a mapping interface more geared toward the average consumer who needs to make a custom map. There's more tools for styling the map but (I'm assuming) less support for the kinds of huge datasets that Fusion Tables can manage:
If CSV is truly a portable data format, then there's no reason to stick with Google's family of tools. Let's try CartoDB, a popular geodata visualization tool.
In the CartoDB dashboard, select the Connect dataset tab.
You can upload the earthquakes data you've downloaded. Or you could use the option to paste in a URL; here's the URL for All Earthquakes, Past 30 Days:
After the dataset has been uploaded, CartoDB should take you to a map view:
CartoDB is geared specifically toward making interactive, animated data maps. I'm not familiar with all of its options and tools so I won't go into detail as you can look at their tutorials yourself. But since the earthquake data has a time column, it makes sense to try out CartoDB's "torque" map, which you can try out by clicking the "wizards" button in the sidebar:
Here's the CartoDB page for that map.
If you're not a front-end programmer or designer, the good news is that there's a variety of sophisticated, user-friendly software that can produce great visualizations. But they need you, the user, to provide the data.
Finding that data, understanding its structure, and wrangling it into spreadsheet form is ultimately the hard and painful work of the data journalist. But it's important to remember, during the grind and suffering of data wrangling, that when the data wrangling is done, you have a vast array of applications that can then build something interesting and valuable.