TK creating a database from scratch
This is the Social Security Administration baby names data that we've used in previous tutorials, except now you'll see how to create a database from scratch, including the process of downloading the file and importing the relevant text file.
Unlike the previous tutorials, we will only create a single table consisting of a single year's worth of data. You can assume that getting all of the data is just repeatedly following the import process as described below.
From the Social Security Administration homepage, click on the National data file: https://www.ssa.gov/oact/babynames/names.zip
Unzip the file, it should contain text files for every year, prefixed with yob
(i.e. "year of birth"):
Open any one of the text files to see what they contain. For example, this is what the first five lines of yob1990.txt looks like:
Jessica,F,46466
Ashley,F,45549
Brittany,F,36535
Amanda,F,34406
Samantha,F,25864
The main takeaway is that there are no column headers. But we can imagine what they might be: name
, sex
, and babies
.
CREATE TABLE
statementCreating a new database table requires a whole different syntax. For starters, we are no longer SELECT
-ing data. Instead, we use the CREATE TABLE
statement:
CREATE TABLE "the_table_name"(
"column_1" DATATYPE,
"column_2" ANOTHER_DATATYPE
);
The SQLite documentation on creating tables is here, but it may be a bit dense. Basically, the CREATE TABLE
statement consists of the name of the table and its schema, i.e. its columns – including the name and the type of data contained in each column.
A typical CREATE TABLE
statement consists of these elements:
CREATE TABLE
.INTEGER
. Data types, such as INTEGER
, TEXT
, VARCHAR
, and so forth, are semi-special SQL keywords, so you don't have to capitalize them, but you should for easier readability.So to create
CREATE TABLE babynames(
"name" VARCHAR,
"sex" CHAR,
"babies" INTEGER
);
What's VARCHAR
and CHAR
? In other flavors of SQL, they are the keywords for columns that contain text strings of varied or mostly-constant length, which is an important detail for massive databases but doesn't matter for what we're doing. But I do it out of habit in the above snippet: you can see that I've made the sex
column CHAR
datatype because it's either 'M'
or 'F'
, i.e. a single character. Whereas the text strings name
column can vary with length.
Technically, SQLite doesn't care. According to the SQLite documentation, there are only three It only have a few It doesn't really matter, you could even do this:
CREATE TABLE babynames(
"name" TEXT,
"sex" TEXT,
"babies" INT
);
After executing the CREATE TABLE
statement and then going to the tab to browse the database structure, you'll see something like this:
Note: In a previous example, I named the 'babies'
column as 'count'
, but have not bothered to fix the screenshot. You get the idea.
Before we actually import data into the table, let's pretend we messed up the CREATE TABLE
statement and we have a table with misnamed or ill-defined columns. As it turns out in SQLite, there's a limited subset of commands we can do to alter/fix an existing table: rename the table or add columns. We cover adding columns in the next chapter – and apparently we can't rename columns.
It's frequently easier just to delete the table and start over. So that's what we'll do with the DROP TABLE
statement.
The syntax to drop a table is much simpler than creating one:
DROP TABLE babynames;
Note that in a SQLite GUI client, such as SQLite Browser, you can usually delete a table by right-clicking its name and selecting from a pop-up menu. But if you want to drop a table because you want to immediately re-create it, it's almost always faster to execute the DROP TABLE
and CREATE TABLE
commands via copy-and-paste:
DROP TABLE babynames;
CREATE TABLE babynames(
"name" VARCHAR,
"sex" CHAR,
"babies" INTEGER
);
Note: If you run a CREATE TABLE
query when the table already exists, you'll get an error message. Same thing if you try to drop a non-existent table.
For the remainder of these tutorials, the only time we'll be dependent on a GUI client is for the importing of data. This is because of how complicated it is for programs to parse comma-delimited (or any kind of delimited) text files. So SQLite, the language and database, do not have built-in, robust parsers that we can easily invoke with just SQL code as we can when creating, dropping, and querying tables.
Luckily, DB Browser for SQLite has point-and-click importing – as do other free SQLite clients, such as the Firefox SQLite Manager plugin
The following instructions are for DB Browser for SQLite on Mac, though DB Browser is cross-platform so the steps should hopefully be the same.
In the menubar, click File > Import > Table from CSV…
Select the file to import. For the remainder of this tutorial, I'll be using the yob1990.txt
file, i.e. names for babies in 1990. If for some reason you can't get this from the official Social Security Administration site, you can download it from my mirror:
stash.padjo.org/dumps/2015/yob1990.txt
After selecting the text file to import, the GUI client brings up a new dialog box with a couple of important settings to mind:
babynames
The DB Browser client will ask you if you want to really insert into an existing table. Click Yes.
After the import is finished, you can select the Browse Data view to see the data inside the table:
Note: There's no year column for this babynames
table. I've left it out for now because we don't need it for this exercise. But generally, that kind of context should be kept. I guarantee you if you re-open this database next week, you'll have no idea what year that set of baby names data came from.
The DB Browser client, in an attempt to share some familiar user-facing features with Excel, has a feature in which changes we make to the data are not automatically "saved" to file. We have to manually do this by going to the menubar and selecting File > Write Changes.
Or just use the Cmd-S keyboard shortcut.
We'll do review of basic SQL in another lesson. For now, let's just make sure the babynames
table contains what we think it does:
SELECT *
FROM babynames
WHERE sex = 'F'
ORDER BY babies DESC
LIMIT 5;
name | sex | babies |
---|---|---|
Jessica | F | 46466 |
Ashley | F | 45549 |
Brittany | F | 36535 |
Amanda | F | 34406 |
Samantha | F | 25864 |
SELECT *
FROM babynames
WHERE name LIKE 'Z%'
ORDER BY babies DESC
LIMIT 5;
name | sex | babies |
---|---|---|
Zachary | M | 20424 |
Zachery | M | 1228 |
Zachariah | M | 757 |
Zackary | M | 743 |
Zackery | M | 543 |
OK, now on to the Florida state salaries data. We'll go through the exact same steps as described above: create a table, then import into it.
The state of Florida has set up a landing page for its [state payroll data]http://floridahasarighttoknow.myflorida.com/search_state_payroll). For now, we are going to focus on the State University System – as of November 2015, this consists of "Payroll Information provided by Florida's 12 State Universities"
Here's a screenshot of the public-facing database:
If you look at the right-side of the state university payroll webpage, you'll see a link to Export All.
But don't click that. Instead, download my snapshot of the data:
florida-salaries-2015-11–cleaned.csv
It's roughly the same thing, except you don't have to deal with some funky formatting (line endings/file encoding) of the original file. Plus, I've renamed the headers so that multiple-word column headers, such as 'Last Name'
, are now single words, e.g. 'LastName
. This makes it slightly easier to reference them in SQL commands.
Here's what the data looks like, with the columns transposed to show all of the titles and a couple of sample rows.
In particular, the ClassTitle
, FirstName
, and AnnualSalary
columns are most interesting to us.
Column name | Row 9 | Row 100 | Row 25000 |
---|---|---|---|
University | FAMU | FAMU | FSU |
BudgetEntity | Educational & General | Educational & General | Auxiliaries |
PositionNumber | 19257000 | 0 | 60518 |
LastName | ABRAMS | BARE | HILL |
FirstName | ROBERT | CEIL | WILLIAM |
MI | H | P | . |
EmployeeType | SALARIED | OPS | SALARIED |
FTE | .77 | 0 | 1 |
ClassCode | 9001 | 9001 | 6526 |
ClassTitle | PROFESSOR | PROFESSOR | CUSTODIAL WORKER |
AnnualSalary | 158980 | 21279 | |
OPSTermAmount | 4500 |
Other columns of interest:
University
- the abbreviation for the university, e.g. 'FSU'
for Florida State University.BudgetEntity
FTE
- presumably this stands for full-time-equivalent – a value of 1.0
represents a full-time employee. Less than 1.0
might indicate a part-time employee.OPSTermAmount
- Payment for Other Personal Services Employment, i.e. temporary/short-term positions. That's why the professor in row 100 above has no annual salary, but is shown receiving $4,500 for the OPSTermAmount
.Can you write the SQL command to create the empty table? Here's a hint to start it off:
CREATE TABLE "florida_positions" (
"University" VARCHAR,
"some other column" SOMETYPE
);
Note that the FTE
column can contain a decimal number (i.e. to indicate a part-time position); we can use the FLOAT
data type to describe it (i.e. floating-point number).
Here's what I ended up with:
CREATE TABLE "florida_positions" (
"University" VARCHAR,
"BudgetEntity" VARCHAR,
"PositionNumber" VARCHAR,
"LastName" VARCHAR,
"FirstName" VARCHAR,
"MI" VARCHAR,
"EmployeeType" VARCHAR,
"FTE" FLOAT,
"ClassCode" VARCHAR,
"ClassTitle" VARCHAR,
"AnnualSalary" INTEGER,
"OPSTermAmount" INTEGER
);
Again, make sure you spell the name of the table correctly.
Unlike the baby names data, this file does come with file headers. So also make sure the Column names in first line checkbox is checked:
After the import is finished, remember to select File > Write Changes, i.e. save the database, i.e. keyboard shortcut Cmd-S.
Let's run some queries to see if it works. When dealing with salaries, you can't go wrong by just seeing who is paid the most:
SELECT *
FROM florida_positions
ORDER BY AnnualSalary DESC
LIMIT 5;
Oops, this doesn't look right: all rows in which AnnualSalary
was blank show up at top, due to the way SQLite orders things:
University | BudgetEntity | PositionNumber | LastName | FirstName | MI | EmployeeType | FTE | ClassCode | ClassTitle | AnnualSalary | OPSTermAmount |
---|---|---|---|---|---|---|---|---|---|---|---|
FAMU | Educational & General | 0 | ADAMS | ANGELA | D | OPS | 0.0 | 9003 | ASSISTANT PROFESSOR | 2550 | |
FAMU | Educational & General | 0 | ADAMS | JAMES | A | OPS | 0.0 | 9001 | PROFESSOR | 6000 | |
FAMU | Educational & General | 0 | ADAMS | SHAWANNA | D | OPS | 0.0 | 9004 | INSTRUCTOR | 2400 | |
FAMU | Educational & General | 0 | ADETU | COLLINS | . | OPS | 0.0 | 9004 | INSTRUCTOR | 4001 | |
FAMU | Educational & General | 0 | AKINS | MONICA | Q | OPS | 0.0 | 9004 | INSTRUCTOR | 600 |
For now, we can use a WHERE
clause to filter the records to include only valid AnnualSalary
values:
SELECT *
FROM florida_positions
WHERE AnnualSalary != ''
ORDER BY AnnualSalary DESC
LIMIT 5;
University | BudgetEntity | PositionNumber | LastName | FirstName | MI | EmployeeType | FTE | ClassCode | ClassTitle | AnnualSalary | OPSTermAmount |
---|---|---|---|---|---|---|---|---|---|---|---|
UF | Contracts & Grants | 00000000 | FRIEDMAN | WILLIAM | A | SALARIED | 0.97 | 9001 | PROFESSOR | 961281 | |
UF | Contracts & Grants | 00000805 | GUZICK | DAVID | S | SALARIED | 1.0 | 9001 | PROFESSOR | 905000 | |
UF | Contracts & Grants | 00000000 | KESHAVARZI | SASSAN | . | SALARIED | 0.97 | 9003 | ASSISTANT PROFESSOR | 730685 | |
UF | Contracts & Grants | 00000000 | CHALAM | KAKARLA | V | SALARIED | 0.96 | 9001 | PROFESSOR | 680685 | |
UF | Contracts & Grants | 00005361 | FUCHS | WESLEY | K | SALARIED | 0.77 | 9001 | PROFESSOR | 661340 |
Well, I was surprised. My assumption was that head football coaches would dominate the top-paid employees, especially in a football-loving state such as Florida.
Let's do a search for WILLIAM FRIEDMAN
using the state of Florida's official university salaries database to confirm our results:
Doing a Google search for professor salaries in Florida brings up this 2011 Orlando Sentinel article, "Here's what Florida professors earn". Friedman is mentioned as the second-highest-paid employee at the time. But this comes with some caveats about the data:
But faculty leaders such as Tom Auxter, president of United Faculty of Florida, pointed out that the database is misleading in that it doesn't attempt to explain why about 20 professors make more than $500,000 a year…Some are paid large salaries because they bring in significant amounts of money through grants and other revenue sources for research. These other sources often provide at least part of their income, a fact that can easily be overlooked when reading the governor's online database.
…UF professor William Friedman is the second-highest-paid university employee. He earns about $808,000 as UF's director of the Preston Wells Center for Brain Tumor Therapy.
So the posted salary amount isn't necessarily state money, though there's no breakdown. I don't think that'll affect our gender gap analysis, though.
Here's a summary view that groups all the records by University
and counts/totals various interesting columns:
SELECT
University,
COUNT(*) AS num_positions,
SUM(FTE) AS total_ftes,
SUM(AnnualSalary) AS total_salary,
SUM(OpsTermAmount) AS total_ops_money
FROM florida_positions
GROUP BY University
ORDER BY total_salary DESC;
Looks like University of Florida is the biggest, followed by University of South Florida and Florida State:
University | num_positions | total_ftes | total_salary | total_ops_money |
---|---|---|---|---|
UF | 28230 | 13029.1000000002 | 1107624366.0 | 59872807.0 |
USF | 14051 | 6507.12000000004 | 455211836.0 | 33948786.0 |
FSU | 11872 | 5977.07 | 390502925.0 | 22904955.0 |
FIU | 8292 | 4694.62999999999 | 335808488.0 | 11897985.0 |
UCF | 7059 | 5489.98999999996 | 322528069.0 | 10654698.0 |
FAU | 5453 | 2508.53 | 178094535.0 | 8632644.0 |
FAMU | 2403 | 1659.93 | 105909139.0 | 3221786.0 |
UNF | 2283 | 1730.29 | 98778082.0 | 1711024.0 |
FGCU | 1916 | 1177.94 | 77051199.0 | 1784315.0 |
UWF | 1707 | 1228.77 | 70879208.0 | 1222309.0 |
NCF | 345 | 251.64 | 14697859.0 | 105542.0 |
FPU | 173 | 130.0 | 9470242.0 | 164200.0 |