This is a SQL-based introduction to the data and analysis behind the Wall Street Journal's Pulitzer-winning "Medicare Unmasked" investigative project. It also doubles as a helpful guide if you're attempting the midterm based on the WSJ Medicare's investigation.
Below is a screenshot of the PDF version of the WSJ story, Taxpayers face big tab for unusual doctor billings.
The sections highlighted in purple contain the facts, numbers, and assertions that can be directly derived from the 2012 Medicare billing database using SQL.
To follow along to the walkthrough (which is somewhere midway in this gigantic page of text), you can download my version of the 2012 dataset as a SQLite database, which is 700MB zipped and 2GB+ unzipped:
Read on for more background about the WSJ investigation and the Medicare data.
Or if you want, just jump right to the SQL.
G0166
]"Even if the The Wall Street Journal didn't win a 2015 Pulitzer for Investigative Reporting, its Medicare Unmasked project is a pinnacle in investigative, computer-assisted reporting. Not only does it illuminate a previously opaque yet vital and costly part of our bureaucracy, it decisively and empirically rooted out bad players and abusive practices.
The project's impact will be felt for years. The WSJ's parent company, Dow Jones & Co., successfully sued the government to make public Medicare billing data for the first time since 1979. The Centers for Medicare and Medicaid Services announced that billing data will now be released on an annual basis.
Dr. Bob Kocher, who advised the Obama administration and served on the National Economic Council, told the New York Times that “This is actually the most useful data set that Medicare has ever released.” Like any important public dataset, there has been concerned pushback. The American Medical Association has opposed releasing the data because of "it does not provide a complete picture" and individual doctors have complained about being unfairly targeted. The New Yorker's (albeit poorly-titled) piece, What Big Data Can’t Tell Us About Health Care is important reading.
This walkthrough of the 2012 Medicare dataset and is a guided learning tour through a fantastic investigation. And it is a primer on how anyone – not just an investigative newsroom – can effectively interpret and analyze the data and further the effectiveness of transparency in the civic sphere.
This tutorial assumes you've been able to finish the Baby Steps Midterm, which included far more complicated and convoluted SQL joins. In comparison, the SQL needed to derive many of the convincing assertions used by the WSJ is straightforward.
In a technical sense.
However, the complicated and messy content of the Medicare data more than makes up for the simpler SQL queries. You'll generally find that it takes queries followed by more queries to even figure out if whatever query you wanted to do is not stymied by incomplete or inaccurate data.
This kind of uncertainty and frustration is common with important, real-world datasets. And it's not something that can be easily overcome with technical skill. However, the WSJ investigation provides a wealth of demonstrative examples that not only show the limitations and pitfalls of the data, but how to effectively refine your journalistic inquiry to find important insights in the Medicare dataset.
The main intent of this tutorial is to push you into reading and fully appreciating the scope and complexity of the WSJ's investigation. By no means is the theme, "10 SQL Queries That Will Win You a Pulitzer!" – but that said, a significant amount of the investigative insights can be found in the data and by closely reading the WSJ articles.
However, a substantial amount of the WSJ investigation comes from external reporting and research. In understanding data-driven investigations, it is important to know your way around a database. But it is perhaps even more vital to notice the things that can't be found in a database.=
To follow along in this walkthrough, you can download my SQLite database here:
It contains both conveniences and contrivances from what you can find on the CMS site. For starters, it's only the 2012 data. CMS has recently posted the 2013 file, as well as updates to the 2012 data that weren't available at the time of the WSJ Medicare Unmasked project. I haven't actually downloaded it but the new fields seem quite promising for new avenues of inquiry:
10/05/15 UPDATE: Original CY 2012 Medicare Physician and Other Supplier Aggregate table has been updated to include demographic and health information associated with the provider’s beneficiary panel. This provider-level summary now includes aggregated information on beneficiary age, sex, race, Medicare and Medicaid entitlement, sixteen (16) chronic conditions and risk scores.
In June 2015, there was an update that added more granular detail to the database. As reported by the WSJ:
In a June 2015 data release, Medicare revised its methodology to include additional payments made to providers in 2012 and 2013. An earlier version of the 2012 data, released in April 2014, suppressed payments for services rendered to small numbers of patients. For this reason, the total payment figures for 2012 shown here may differ from total payments shown in the original release of this interactive.
Again, you can download it here:
What I provide for this walkthrough is the data shortly after it was made available to the public. Is it exactly the same data as the WSJ used in its investigation? I have no way of knowing (short of giving them a phone call, I guess). But virtually all of the relevant numbers and assertions made in the WSJ investigation can be reproduced to the exact value…so…it's exact enough for our purposes.
However, I've made some important changes, partly out of convenience for non-data-wrangling experts, and partly to make the queries deliberately more complicated than if you work with the raw data straight from CMS – for the educational value, of course.
Before I talk about the changes I've made to the source data, it's easiest to start with the actual schema and sample rows of the downloadable database:
providers
tableThis table is derived from the spreadsheet titled Medicare Physician and Other Supplier Aggregate table, CY2012, Microsoft Excel (.xlsx), which can be found in the Summary Tables heading of the CMS site.
The spreadsheet contains one row for every provider (i.e. unique npi
) with boilerplate information (name, address, etc.) and a grand tally of number of patients, services rendered, and Medicare payments received:
Column name | Row 1000 |
---|---|
npi | 1003074006 |
nppes_provider_last_org_name | DALY |
nppes_provider_first_name | TIMOTHY |
nppes_provider_mi | M |
nppes_credentials | |
nppes_provider_gender | M |
nppes_entity_code | I |
nppes_provider_street1 | 1730 CHEW ST |
nppes_provider_street2 | |
nppes_provider_city | ALLENTOWN |
nppes_provider_zip | 181045549 |
nppes_provider_state | PA |
nppes_provider_country | US |
provider_type | Family Practice |
medicare_participation_indicator | Y |
provider_number_of_hcpcs | 8.0 |
total_unique_benes | 117.0 |
total_provider_services | 295.0 |
total_submitted_chrg_amt | 39755.0 |
total_medicare_allowed_amt | 19294.6999999999 |
total_medicare_payment_amt | 13890.5299999999 |
payments
tableThis comes from the massive tab-delimited text file titled, Medicare Physician and Other Supplier PUF, CY2012, Tab Delimited format. Here's CMS's description of it, which can be summed up as: "Too big for Excel"
[Note: This Compressed ZIP package contains the tab delimited data file (Medicare_Provider_Util_Payment_PUF_CY2012.txt) which is 1.7GB uncompressed and contains more than 9 million records, thus importing this file into Microsoft Excel will result in an incomplete loading of data. Use of database or statistical software is required; a SAS® read-in statement is supplied. Additionally, this ZIP package contains the following supporting documents: CMS_AMA_CPT_license_agreement.pdf, and Medicare-Provider-Util-Payment-PUF-SAS-Infile.sas]
Whereas the providers
table consists of one row per provider, payments
contains as many rows per individual provider as needed to list each individual service billed for.
In other words, looking at the massive payments
file is the only way to know specifically the kind of treatments doctors billed Medicare for.
Column name | Row 100 |
---|---|
npi | 1003000936 |
nppes_provider_last_org_name | STELLINGWORTH |
nppes_provider_first_name | MARK |
nppes_provider_mi | A |
nppes_credentials | MD |
nppes_provider_gender | M |
nppes_entity_code | I |
nppes_provider_street1 | 2390 W CONGRESS ST |
nppes_provider_street2 | |
nppes_provider_city | LAFAYETTE |
nppes_provider_zip | 705064205 |
nppes_provider_state | LA |
nppes_provider_country | US |
provider_type | Cardiology |
medicare_participation_indicator | Y |
place_of_service | F |
hcpcs_code | 93351 |
hcpcs_description | Stress tte complete |
line_srvc_cnt | 38 |
bene_unique_cnt | 38 |
bene_day_srvc_cnt | 38 |
average_Medicare_allowed_amt | 82.98 |
stdev_Medicare_allowed_amt | 0 |
average_submitted_chrg_amt | 230 |
stdev_submitted_chrg_amt | 0 |
average_Medicare_payment_amt | 65.702631579 |
stdev_Medicare_payment_amt" | 2.9417750497 |
The main change is, well, instead of trying to import these massive data files into SQL yourself, my downloadable database is a SQLite database. I've also done the work of indexing the most relevant fields, because even the act of indexing the data can be computationally-intensive.
The main content change I've made is that payments
includes significantly less data than what you can get from the raw Medicare Physician and Other Supplier PUF, CY2012, Tab Delimited format.
However, what's been left out are redundant fields that can be derived from joining against providers
via npi
. In my opinion, the CMS data file is a bit wasteful for repeating a provider's name, address, etc. for each row in payments
, since, as far as I can tell, these values always remain constant.
On the other hand, having this denormalized structure makes it easy for people who don't know how to do SQL joins to do aggregations on the payments
data alone. But this being a SQL walkthrough, I wanted to make it have more SQL join action.
OK, back to the journalism side of things. If you looked at the data schema above and had no clue what any of those fields meant – welcome to the world of just about anybody who is not involved professionally in the healthcare industry or is already an experienced healthcare reporter.
One obvious course of action is to read the copious documentation on the CMS website. However, the Wall Street Journal's investigation is itself a kind of documentation.
So the following walkthrough is an attempt to have our cake (i.e. domain knowledge) and eat it too:
So let's examine the data behind the facts of the first story in the WSJ series: Taxpayers Face Big Tab For Unusual Doctor Billings.
The walkthrough consists of excerpts of the stories that I believe to rely directly on the Medicare 2012 database. Those excerpts are broken up into smaller assertions that can be fact-checked individually.
The very first and third paragraphs can be verified via SQL:
Ronald S. Weaver isn’t a cardiologist. Yet 98% of the $2.3 million that the Los Angeles doctor’s practice received from Medicare in 2012 was for a cardiac procedure, according to recently released government data
…The government data show that out of the thousands of cardiology providers who treated Medicare patients in 2012, just 239 billed for the procedure, and they used it on fewer than 5% of their patients on average. The 141 cardiologists at the Cleveland Clinic, renowned for heart care, performed it on just six patients last year. Dr. Weaver’s clinic administered it to 99.5% of his Medicare patients — 615 in all — billing the federal health-insurance program for the elderly and disabled 16,619 times, according to the data.
The providers
table has information about a doctor's name and specialty, the latter represented in the provider_type
field. We want to do a query for a "Ronald Weaver" who practices in the state of California:
(I'm using aliases so that the incredibly long column names don't mess up my web page layout)
SELECT
nppes_provider_first_name AS first_name,
nppes_provider_mi AS mid_name,
nppes_provider_last_org_name AS last_name,
nppes_provider_gender AS gender,
nppes_provider_state AS state,
provider_type
FROM providers
WHERE
last_name = 'WEAVER'
AND first_name = 'RONALD'
AND state = 'CA';
Lucky for us, it looks like there's only one doctor named Ronald Weaver in California and he is indeed not a cardiologist:
first_name | mid_name | last_name | gender | state | provider_type |
---|---|---|---|---|---|
RONALD | S | WEAVER | M | CA | Internal Medicine |
But is there really only one Dr. "Ronald Weaver" in California? It seems like a common name, so it's prudent to be skeptical.
Just to be safe, let's make sure there are no other Ronald Weaver, or even a Ron Weaver that the WSJ article could be referring to. It's a seemingly common name, and for all we know, the WSJ is referring to a Ronald Weaver who, in a newer/older version of the database, is not listed in California.
It doesn't take too long with SQL to check for variations on the name and expanding the search nationwide. We can just use a simple COUNT
to confirm Dr. Ronald Weaver's unique name (among doctors participating in Medicare):
SELECT COUNT(*) AS number_of_ron_weavers
FROM providers
WHERE
nppes_provider_last_org_name LIKE 'WEAVER'
AND nppes_provider_first_name LIKE 'RON%';
number_of_ron_weavers |
---|
1 |
Works for me. Moving on.
I've re-arranged the order of assertions in the original sentence, because it's easier to establish the "$2.3 million" part and "Los Angeles" part first.
The total received by any individual provider can be found in the providers
table's total_medicare_payment_amt
field? Or is it the total_medicare_allowed_amt
field? What's the difference?
Let's just write a query that includes both amount fields and see which comes closest to "$2.3 million". We'll also include the nppes_provider_city
field just to make sure Dr. Weaver is based out of Los Angeles. And we omit all the other identifying fields for now:
SELECT
nppes_provider_city AS city,
total_medicare_allowed_amt,
total_medicare_payment_amt
FROM providers
WHERE
nppes_provider_last_org_name = 'WEAVER'
AND nppes_provider_first_name = 'RONALD'
AND nppes_provider_state = 'CA';
The result:
city | total_medicare_allowed_amt | total_medicare_payment_amt |
---|---|---|
INGLEWOOD | 2916171.30999897 | 2311513.10999804 |
Inglewood is indeed part of the Los Angeles County area. And it looks like total_medicare_payment_amt
, with a value of 2311513
, is what the WSJ thinks it's important. Is the WSJ right on that, or does that particular metric fit better with their agenda? We don't know, but at least we have a better understanding of the possible nuance and differences.
OK, back to that 98% assertion. So now we see where the granularity of the payments
table is relevant. The WSJ isn't featuring Dr. Weaver because of the total Medicare payments he received – it's the procedure he billed for that matters.
The first question in writing our query is: is there a column that specifies that a procedure is a "cardiac"-type procedure, in the same way that provider_type
specifies a doctor's specialty?
(note: provider_type
exists in payments
but only because I forgot to cut it out of the source raw data – it contains the same value as what's found in the providers
table)
Here's something to think about: the assumption that a given medical procedure or treatment can be categorized as just a "cardiac"-type of procedure evinces a somewhat simplified view of how medicine actually works. If, hypothetically, a procedure or treatment could be relevant for "cardiac"-categorical reasons and for "oncology"-categorical reasons, how would that hypothetical procedure-category column? Probably not very well, or cleanly.
And here's a spoiler alert: The provider_type
for each doctor isn't consistent. Want to look up all gynecologists? Those doctors could be listed as having a provider_type
of 'Obstetrics/Gynecology'
or 'Gynecology/Obstetrics'
. And unlike other fields, such as 'Urology'
, there are no doctors for which provider_type
is just 'Obstetrics'
or 'Gynecology'
.
In fact, there's several different ways that the term 'MD'
is spelled, just incase you were hoping to filter a list just to include providers who are certified doctors (because physician assistants and nurses are also in the database).
In fact, if you browse some of the address fields, you'll see that words like 'SUITE'
are misspelled on occasion.
All of this doesn't even scratch the surface of what kind of errors could quite easily lead to inadvertently but seriously flawed investigations…but let's get back on track with our query.
We don't have to know the descriptive name of this procedure – which would be found in the hcpcs_description
column – nor its special HCPCS code, i.e. the column named hcpcs_code
. We already know that it makes up for 98% of Dr. Weaver's Medicare billing, so it seems reasonable that listing all of Dr. Weaver's procedures in the order of Medicare payment amount will get us the answer.
OK, but do any of the "payment-type" fields in the payments
table correspond to a total-amount-per-procedure? Because average_Medicare_payment_amt
would seem at a first glance to describe, well, the average payment per individual procedure.
Let's not think too hard about it anymore…Let's just find all of the types of procedures that Dr. Weaver is listed as doing and figure it out by eye-balling the results.
But remember that in creating the dataset for this exercise, I removed all human-readable identifying information from payments
, i.e. there's no way to look for "RONALD WEAVER" with payments
alone.
OK, so we just have to join payments
and providers
together. And it seems that npi
, whatever that stands for, is not only common to both tables, but looks like some kind of identifier number.
In fact, it is an acronym for the National Provider Identifier, "a 10-position, intelligence-free numeric identifier (10-digit number)". Good enough for me.
To save time, I'm going to re-use our initial "Is Ronald Weaver not a cardiologist" query, alter it to include only the npi
field from the payments
table.
Then I use that as my key to INNER JOIN
with the payments
table. And I'm going to pick a few promising fields from payments
:
hcpcs_code
- I guess this is like a NPI number, except for medical procedureshcpcs_description
- a human-readable name for the corresponding hcpcs_code
line_srvc_cnt
- No idea…but I'm assuming cnt
is short for count
bene_unique_cnt
- OK, another count-type field…but what is a 'bene'
?average_Medicare_payment_amt
- Seems to be the payment-type field to go with, as Medicare_payment_amt
is the same kind of payment field that the WSJ used in the providers
table. But it really is just a guess…(This is all old hat if you survived the Midterm Baby Steps exercises…I mean, it's not even a nested query or a LEFT JOIN
…)
(For the sake of printability, I alias some of the columns and I also ROUND
the payment-column, which is inexplicably includes minuscule fractions of a cent. I'm assuming whole dollars is good enough for our analysis)
SELECT
providers.npi AS npi,
hcpcs_code,
hcpcs_description,
line_srvc_cnt,
bene_unique_cnt,
ROUND(average_Medicare_payment_amt) AS payment_amt
FROM providers
INNER JOIN payments ON
providers.npi = payments.npi
WHERE
nppes_provider_last_org_name = 'WEAVER'
AND nppes_provider_first_name = 'RONALD'
AND nppes_provider_state = 'CA';
The results return 4 procedures associated with Dr. Weaver's npi
number:
npi | hcpcs_code | hcpcs_description | line_srvc_cnt | bene_unique_cnt | payment_amt |
---|---|---|---|---|---|
1669465928 | 82947 | Assay glucose blood quant | 8613.0 | 265 | 4.0 |
1669465928 | 85610 | Prothrombin time | 78.0 | 25 | 6.0 |
1669465928 | 93000 | Electrocardiogram complete | 12.0 | 12 | 17.0 |
1669465928 | 99214 | Office/outpatient visit est | 16.0 | 16 | 90.0 |
1669465928 | G0166 | Extrnl counterpulse, per tx | 16619.0 | 615 | 137.0 |
It's been so long since we looked at the actual WSJ assertion that it's worth reviewing:
…Yet 98% of the $2.3 million…was for a cardiac procedure
Do any of those descriptions seem obviously a cardiac procedure? Not to me. But OK, remember that our plan was to focus on the "98% of $2.3 million" number. One of those procedures, when you multiply the payment_amt
by either line_srvc_cnt
or bene_unique_cnt
, gets us close to $2.3 million.
Riffing off our previous query and hardcoding Dr. Weaver's npi
number – i.e. 1669465928
– so that we don't have to join against providers
, here's a useful query:
SELECT
hcpcs_code,
hcpcs_description,
ROUND(line_srvc_cnt * average_Medicare_payment_amt)
AS total_srvc_payment_amt,
ROUND(bene_unique_cnt * average_Medicare_payment_amt)
AS total_bene_payment_amt
FROM payments
WHERE npi = 1669465928
ORDER BY total_srvc_payment_amt DESC;
Looks like we have a clear winner:
hcpcs_code | hcpcs_description | total_srvc_payment_amt | total_bene_payment_amt |
---|---|---|---|
G0166 | Extrnl counterpulse, per tx | 2274250.0 | 84161.0 |
82947 | Assay glucose blood quant | 35189.0 | 1083.0 |
99214 | Office/outpatient visit est | 1439.0 | 1439.0 |
85610 | Prothrombin time | 434.0 | 139.0 |
93000 | Electrocardiogram complete | 201.0 | 201.0 |
"Extrnl counterpulse, per tx"
"G0166"
line_srvc_cnt
by average_Medicare_payment_amt
seems to get us the kind of payment metric that the WSJ is using as evidence against Dr. Weaver.Congrats to us…we just wrote all the SQL needed to verify the facts stated in the first paragraph of the WSJ's story. That might make you feel proud, or cause your inner self to scream in despair. Or both.
Let's move on to the next paragraph of assertions.
…The government data show that out of the thousands of cardiology providers who treated Medicare patients in 2012, just 239 billed for the procedure, and they used it on fewer than 5% of their patients on average. The 141 cardiologists at the Cleveland Clinic, renowned for heart care, performed it on just six patients last year. Dr. Weaver’s clinic administered it to 99.5% of his Medicare patients — 615 in all — billing the federal health-insurance program for the elderly and disabled 16,619 times, according to the data.
This isn't really an exact number. But let's make sure there are "thousands" of cardiology providers in the providers
table:
SELECT COUNT(*)
FROM providers
WHERE provider_type = 'Cardiology';
COUNT(*) |
---|
22243 |
Uh…OK…22,000 is most definitely "thousands of cardiology providers." But why didn't the WSJ writer/editor say "tens of thousands", which would boost by an order of magnitude how rare Dr. Weaver is in his work? Maybe the person who made the decision is unusually anal-retentive about the number of words used to describe numerical quantities (unlike me, for example). Or maybe our query is just too simple?
There's no reason why we have to stick to SQL. This seems like a question Google could solve if we ask, "how many cardiologists in the us"
:
Who could argue against a site named CardioSolution – though we would expect our count to be less than whatever the official estimate is, as our count only includes cardiologists who participate in Medicare…but maybe that's most of them?
Before we move on, let's do a GROUP BY
aggregation that takes into account that providers
has fields that differentiate between individual doctors and hospitals, and doctors and non-doctors. These fields, respectively, are nppes_entity_code
and nppes_credentials
:
SELECT provider_type,
nppes_credentials,
nppes_entity_code,
COUNT(*) AS ct
FROM providers
WHERE provider_type LIKE '%Cardiology%'
GROUP BY provider_type, nppes_credentials, nppes_entity_code
ORDER BY ct DESC
LIMIT 5;
provider_type | nppes_credentials | nppes_entity_code | ct |
---|---|---|---|
Cardiology | MD | I | 10448 |
Cardiology | M.D. | I | 9442 |
Cardiology | DO | I | 462 |
Cardiology | D.O. | I | 436 |
Cardiology | NULL | I | 363 |
Well, the good news is that there is only one type of 'Cardiology'
descriptor. But the nppes_credentials
is pretty messy, just in case you wanted to do something with that…
Oh, it's also possible that "cardiology providers", as the WSJ phrases it, is something distinct from a doctor specializing in Cardiology…Anyway, moving on – I think we've established the Medicare data has "thousands" of cardiologists, who fall under the category of "cardiology providers".
G0166
]"We actually don't need to join against providers
since payments
has its own provider_type
column. Also, we assume that all hcpcs_code
and npi
combinations happen exactly once, so no need to do some kind of GROUP BY
:
SELECT COUNT(*)
FROM payments
WHERE hcpcs_code = 'G0166'
AND provider_type = 'Cardiology';
COUNT(*) |
---|
200 |
200 does not equal 239. What are we missing here? Let's remove the WHERE
condition to get a count of all doctors:
SELECT COUNT(*)
FROM payments
WHERE hcpcs_code = 'G0166';
COUNT(*) |
---|
239 |
There's our answer – the phrase "cardiology providers" includes anyone who is happening to provide this particular procedure.
Out of curiosity, what is the breakdown by provider_type
for the "counterpulse" procedure, i.e. HCPCS G0166
?
In other words, is Dr. Weaver the lone internal medicine doctor who bills for this procedure?
Now we need a GROUP BY
:
SELECT provider_type,
COUNT(*) AS provider_count
FROM payments
WHERE hcpcs_code = 'G0166'
GROUP BY provider_type
ORDER BY provider_count DESC;
provider_type | provider_count |
---|---|
Cardiology | 200 |
Internal Medicine | 17 |
Cardiac Electrophysiology | 4 |
Family Practice | 4 |
Nurse Practitioner | 4 |
Hematology/Oncology | 2 |
Physical Medicine and Rehabilitation | 2 |
Pulmonary Disease | 2 |
Anesthesiology | 1 |
General Practice | 1 |
General Surgery | 1 |
Orthopedic Surgery | 1 |
Besides cardiology, internal medicine is the specialty most associated with this procedure. So Dr. Weaver's practice isn't completely out of left field…
Skipping a bit out of order, trying to verify this assertion will confirm which of the payments
and providers
columns correspond to the number of individual patients and the number of times the service was rendered.
(Again, Dr. Weaver's npi
is 1669465928
)
Best to split this assertion into 2 queries:
The previous assertion doesn't tell us how many patients Dr. Weaver has – though we could obviously find out by looking at what he has in the total_unique_benes
column. The number 615
is what we get if we were to take 99.5 percent of total_unique_benes
.
So, return a single number, compare it to 615:
SELECT total_unique_benes * 0.995
FROM providers WHERE
npi = 1669465928;
Close enough:
total_unique_benes * 0.995 |
---|
614.91 |
This can be confirmed in the payments
table. Again, the relevant HCPCS code is G0166
. I think bene_unique_cnt
is the important column, but let's include the other counts as a comparison:
SELECT bene_unique_cnt, bene_day_srvc_cnt, line_srvc_cnt
FROM payments
WHERE hcpcs_code = 'G0166'
AND npi = 1669465928;
bene_unique_cnt | bene_day_srvc_cnt | line_srvc_cnt |
---|---|---|
615 | 15792 | 16619.0 |
So bene_unique_cnt
is the column that describes how many people the procedure was administered to.
We already know that line_srvc_cnt
is the total number of times the procedure was administered.
Since bene_day_srvc_cnt
is less than line_srvc_cnt
, we might assume that it is refers to the number of days that the procedure was administered. Since it is nearly equal to line_srvc_cnt
, perhaps it indicates that this procedure is typically a one-a-day thing?
We already covered this above, but to reiterate:
SELECT line_srvc_cnt
FROM payments
WHERE hcpcs_code = 'G0166'
AND npi = 1669465928;
line_srvc_cnt |
---|
16619.0 |
OK, this requires taking a step back in the text. Here's the assertion in its context:
The government data show that out of the thousands of cardiology providers who treated Medicare patients in 2012, just 239 billed for the procedure, and they used it on fewer than 5% of their patients on average.
This is easier to calculate now that we know how to query for total Medicare patients per provider and number of patients a procedure was administered to.
The query is a little complicated in that it requires an INNER JOIN
to get the two numerical fields – payments.bene_unique_cnt
and providers.total_unique_benes
– which we then perform arithmetic on to verify the "5% of their patients on average"
SELECT
AVG(100.0 * payments.bene_unique_cnt / providers.total_unique_benes)
AS avg_pct_patients_treated_w_counterpulse
FROM payments
INNER JOIN providers
ON payments.npi = providers.npi
WHERE hcpcs_code = 'G0166';
Woohoo:
avg_pct_patients_treated_w_counterpulse |
---|
4.55289829873767 |
This is mostly a trick question (I think, though I could be missing something blindingly obvious…). The affiliation with the Cleveland Clinic is not something that can be derived from the Medicare data – as you've seen, identifying information is limited to doctors' names and locations.
So where does this fact about the 141 Cleveland Clinic cardiologists come from? It's not hard to imagine the WSJ team either manually looking this up or finding the Cleveland Clinic affiliations in some other datafile that makes it easy to filter the providers
table.
Or maybe they just called the Cleveland Clinic and asked for the number. The reporting doesn't stop at the database's limits.
So after what seems like countless SQL queries, how far did we get into the first story in the WSJ "Medicare Unmasked" series?
Three paragraphs.
Well, actually two paragraphs, but the second one didn't contain anything from the Medicare database. From the PDF version of the WSJ Pulitzer entry, I've highlighted the paragraphs in pink that we've covered so far:
But consider how much data-based evidence was crammed into those 2 paragraphs. Or, to put it another way, look at how many ways the WSJ managed to not just introduce Dr. Weaver, but impugn his professional practice:
And, if you've read the full story by now, you know the WSJ reporters haven't even begun to scrutinize Dr. Weaver's practice.
If you knew almost nothing about the details of the Medicare data, this is what you could learn from the dataset in just those first 3 paragraphs:
We have 6 more pages to go in the WSJ's article…but from this first page, we've encountered many of the angles used by the WSJ team to analyze the dataset and apply it to their reporting. If some of this was intimidating to you, I would have to say that the first page is probably the most difficult, even in just writing the SQL queries.
But things get much easier now that the dataset isn't so familar.
Note: I've written out the SQL queries for the rest of the "Taxpayers Face Big Tab For Unusual Doctor Billings" article. But I've posted it as a midterm for my data journalism class. I'll update this page with all of the queries once the midterm is done.
Done with all the exercises. It's probably the right time to check out the CMS's official documentation