How Bad is the Opioid Crisis?

Let’s fact-check The New Yorker on the opioid crisis and learn how to get answers from data. It’s easier than most people think!

They recently posted a tweet in which they make a pretty shocking claim:


The link goes to a visually striking, very dramatic presentation of a very politicized issue. The piece, entitled Faces of an Epidemic, is dated October 30, 2017, and is largely about the photographs. There is some text, and the piece makes this claim a few pages in:

Opioids now kill more than fifty thousand Americans a year, ten thousand more than AIDS did at the peak of that epidemic—more, too, than gun homicides and motor-vehicle accidents. Opioid overdoses are now the leading cause of death for Americans under the age of fifty.

Those are a couple of very interesting claims! It’s pretty believable that opioid overdoses claim more lives than homicides of any sort, but motor-vehicle accidents? The leading cause of death for Americans under fifty years old? More than fifty thousand a year?

We have some incredible claims, the issue is unfortunately political, and no citations, all of which are red flags. If it seems a bit difficult to believe then how do you find out whether it’s true? It seemed dubious to me, so I had a look. Let me show you how to figure this sort of thing out from the comfort of your own terminal.

First We Need the Data

This is light data science, but don’t let that put you off it. Luckily, the data should be readily available. There are vast troves of open data, and the US government is often pretty good about releasing data. The Census Bureau, the EPA, and even the Social Security Administration have a few big chunks of open data. You can download the source code for the Apollo Guidance Computer.

In our case, the CDC is the relevant Three-Letter Agency. Among other things, they’re charged with containing epidemics, so hospitals in the US feed them records. Every time you get sick, most especially if you die, they hear about it one way or another. A quick search gets you to their data, and the piece that’s relevant here is a dataset called the Mortality Multiple Cause Files. You can get data back to 1968!

We’ll get the 2016 report, as it’s the most recent available. (The data goes through an unbelievably arcane pipeline, so there’s quite a bit of lag.) It’s about a hundred megs, but uncompressing it gets you a 1.3GB file.

Here’s the downside of all this open data: it’s still the government producing it. I was worried it would be in some proprietary format, but it’s just plain text. It’s not at all readable, though: it’s all fixed-width fields and encoded. I’ve written more than my share of ETL code, and you might be surprised (or horrified) at how much data still resides in fixed-width files.

The CDC helpfully provides a guide to decoding the data. If you want to play with the data, you’ll want to keep that around.

How to Get Answers

First, we have to decide the questions we want to ask. To verify the claims, we want to know how many Americans under 50 die of opioid overdoses, and whether that’s the leading cause.

There are a lot of approaches. R is an excellent tool for this. I often load and parse things in irb and then play with the data in memory. PostgreSQL is designed to interactively query large datasets, and it has facilities for this sort of thing. For one-off programs, as long as it gets you an accurate answer in a timely manner, there aren’t any objectively bad approaches. I’ll sometimes change approaches when one tool seems preferable to another. This time, I started and ended with AWK. It’s been a standard tool in Unix for decades, and it’s very easy to write quick awk scripts to process big chunks of text.

Getting Our Hands Dirty

For serious matters like “How bad is the opioid crisis?”, I think it’s important to show your work, even if you’re a robot. If it’s not your own work, give a source. This keeps you honest, and it helps people learn about the process. Since the point here is to show a process by example, we’ll go into detail, and the source code is embedded below.

The data we have is a fixed-width file, each line is a fixed size (490 characters for this one), and each field is a fixed number of characters from the beginning of the line. As the CDC’s guide indicates, age is a two-part field. We’ll start by parsing that: character 70 represents the units, and 71-74 the number. “Year” is the largest unit provided, so as long as the unit is years and the age is less than 50, or if it’s in a unit besides years, we have a record of a death of an American under 50 in 2016.

AWK is great at handling delimited text, but for fixed-width files, we have to use `substr()`. Since the field is zero-padded and this upset comparisons, I incremented and then decremented the age by 1.

We can already get some useful information, namely how many of the 2.7 million records apply to people under 50: 267,647.

The cause of death is encoded as an ICD-10 abbreviation in columns 146-150, and contributing factors are listed in columns 344-444. Accidental overdoses are marked X40-X44, intentional overdoses X60-X64, and unknown intent Y10-Y14. The various substances are listed in the second field; opioids are T40.0 through T40.4, and to be generous we’ll add T40.6, which signifies “unknown narcotics”. So to select records where the cause of death matches an overdose and there were any opioids involved, we’ll have to look through both fields.

The script loops through each record, and increments a total. (This isn’t strictly necessary for AWK, we could use the NF variable.) Then it parses out the age, and the ICD-10 classification for the cause of death. When it finds someone under 50, it increments another counter for that total. Next, we check if it matches the causes we’re looking for, and increments a counter for that, as well as one of two more counters to report whether it was deemed accidental or intentional. Finally, we pull off the first character of the ICD-10 classification to track major classifications, which allows us to see how other causes rank.

The Verdict: The New Yorker Exaggerated

When we run it, we get a different result than the New Yorker: opioid overdoses only account for 29,995 deaths, not quite “over fifty thousand”, the number the New Yorker claimed. The leading causes of death for Americans under fifty years old are heart disease (35,888) and cancer (31,289). They were correct that it causes more deaths than transportation accidents, which totaled 24,887.

I don’t know where the New Yorker got the statistics they used or what the basis for their claims was. Maybe they used a projection, maybe there’s a flaw in their data or methodology, or maybe there’s a flaw in mine. (It could be all three!) If you’re cynical enough, you could probably come up with several other theories, but we’re left to speculate because they didn’t show their work.

The Source Code

If you’re curious or you want to try to spot a bug (I spotted one while writing this), you can see the code here:

Postscript

If you find any bugs in the code or flaws in the methodology, please do let us know. Don’t believe everything you read.

I’d also like, for the sake of clarity, to point out that this article is intended to show how anyone with enough interest can find answers, using publicly available data and simple tools. If you’re willing to roll up your sleeves, you can fact-check the New Yorker, or satisfy some other curiosity you might have about the world. I strongly encourage you to do so! (But do feel free to hire us if it warrants calling in the professionals.) It took about half an hour because my work involves this sort of thing, but it’s possible for nearly anyone to do.

To further clarify, I don’t intend to trivialize the problem of opioid abuse, but I do think accuracy is important, especially in matters of public policy, and it’s prudent to be suspicious of any numbers cited in the vicinity of a political issue.


Charles Hoy Forth Moore

Yesterday, we went to Vol 0×03: Language, Computation, Art of Software Developers Dialectic Galactic.

I had the privilege of speaking at the event last night, and wrote a talk entitled “Charles Hoy Forth Moore”. The first two events were tough acts to follow, so I consider myself very lucky to have gotten any reaction better than an immediate mass exodus by the participants/attendees. I had promised in the slides to put them onto the internet, so here are the slides that I was able to use during the talk, plus a collection of links and some explanatory comments, for the benefit of those that were present. It may be of interest to some that weren’t in attendance, but like any slide deck, it does not quite get everything across.

The topic of the talk was mostly about FORTH, specifically jonesFORTH, and was intended to convey at least a few things about language and computation in a more general sense. If you haven’t read jonesFORTH, I certainly recommend it!

SDDG

It’s a really interesting and unique meetup, and fits very well on this side of LA. Pasadena, where our offices are, is also home to NASA’s Jet Propulsion Laboratory and Cal Tech. Overall, tech here is nerdier than in Silicon Beach, which is probably why Supplyframe and Hackaday live here.

There aren’t any cameras at the meetup, the presentations are jumping-off points for open discussions, and although there are certainly plenty of engineers present, but also plenty of linguists, artists, students, musicians, journalists, and others. The discussion tends to be fascinating.

Slides

slide 001

slide 002

slide 003

The Giant, the Insect, and the Philanthropic-looking Old Gentleman is an excellent short story by Charles Hoy Fort, a very fun author and noted collector of strange accounts and crank science. This story in particular is about an application of a universal theory of everything to the snake oil industry.

This quote should suffice to give an impression of his writing:

I believe nothing of my own that I have ever written.

slide 004

Well, that explains the name of the talk!

slide 005
slide 006

Chuck Moore’s site is down; he has decided to retire from the internet. However, he gave permission to Lars Brinkhoff (the apparently tireless creator of the ForthHub community) to re-publish Programming in a Problem-Oriented Language.

slide 007

These are not the only FORTH-like languages, but they are two fairly mature ones. Joy is more FP-style, and Factor performant and very convenient.

Because I did not think that it would supplement the conversation much and because it sits in the uncanny valley between a conventional FORTH and some of the more autré stack-based languages, I didn’t bring up Pez, but digressions are easier to forgive in a supplement/addendum to a talk. Pez is based on ATLAST, though it’s gone off the rails from there.

slide 008
slide 009

I’m not kidding.

slide 010

Erratum: HERE behaves differently in Pez, so there is a bug here. See the jonesFORTH implementation of IF and THEN.

slide 011

I think there is something really cool about a langauge that can implement control flow from tiny primitives.

We didn’t get to go into some of the more interesting parts of the runtime, but there was a pretty interesting conversation going on when this slide arrived.

slide 012

This is Paul Graham’s version, ported to Common LISP, of John McCarthy’s specification for LISP.

slide 013

This code is really fascinating.

Urbit is another language with a very concise specification. I very highly recommend reading the author’s thought experiment about the language. I’d be hard-pressed to come up with a stranger programming language.

slide 014
slide 021
slide 022
slide 015

These slides are presented here in the order they were presented last night. This part of the slideshow ended up a bit chaotic because it spun off a lengthy essay I have started writing for my personal blog.

slide 016

The most relevant one here is The Rise of Worse is Better, a part of Lisp: Good News, Bad News, How to Win Big. It was very insightful and brilliant as a result was very unpopular at the time.

slide 017
slide 018
slide 020
slide 019

We talked a little about brevity, though I didn’t get to go into these examples in depth. Consider these four slides a teaser for a potential future presentation.

The rest of the slides I didn’t get to. Hopefully I’ll get to soon!

Links

slide 023

Reflections on Trusting Trust by Ken Thompson. There are several places to find this talk. This is one of the bits that I didn’t get to cover during the meetup, but the connection between this talk and the part in Ken’s talk about “teaching” the compiler should probably be easy to see.

Bytecode Interpreters for Tiny Computers by Kragen Sitaker. Very, very highly recommended reading. Another part that we didn’t get to.

This page intentionally left nonblank

slide 024

This slide is a bit redundant on this page, except for the ASCII-art rendition of our logo.


FeelsBot, an Experimental Slack Bot

We’ve released a Slack Bot! It’s called (somewhat lightheartedly) “FeelsBot”.

What is FeelsBot for?

FeelsBot is kind of a social experiment. Sometimes a channel can turn rough, and this impacts morale. Usually a manager, a lead, or someone else will step in and impose rules or chastise users. This is the top-down approach.

But we’re all adults, aren’t we? Does a room full of adults need someone to tell them to be nice?

FeelsBot is an attempt to see if we can get by without that. Instead of requiring a real human’s attention, we outsource the task to a bot, and instead of top-down (and often subjective) rules, the bot helps people stay mindful of what they’re saying and applies the same standards to everyone. Humans are fundamentally cooperative, and FeelsBot can help us prove that: we think teams can self-regulate better if they get a reminder than if the boss has to step in to calm people down.

What does FeelsBot do?

Simply put, FeelsBot sits in your Slack channels, and does sentiment analysis on the messages, to gauge whether people are being nice, neutral, or hostile. When the mood in the channel starts to change, FeelsBot responds with an emoji to indicate what is going on, with progressively more happy or upset emoji depending on what it sees going on in the room.

FeelsBot is finely tuned to understand conversations, and how important a factor the passage of time is, but FeelsBot also learns! After a few rounds of training on a large corpus of conversation data and subsequently being used by our team, it’s learned a few thousand new words and a few dozen new emoji, and is now classifying new words it sees more accurately than we could do by hand.

Try it out!

If it sounds like fun and you’d like to participate, then there’s good news: you can add FeelsBot to your own Slack channels!

Add to Slack

(We believe in privacy, and we don’t spy; it’s in the Terms.)

We chose Slack because we have been using it lately, and it had a convenient API for doing what we wanted. If there’s any interest, we’ll likely end up porting FeelsBot to other platforms. The code is pretty modular, so new APIs should be trivial. (We have tested out an IRC version!)

Of course, if you have a more specialized need for NLP or machine-learning, we can do that.


What is Bayes' Theorem?

Let’s talk some about Bayes’ Theorem, and why you ought to know it. It isn’t just good to know for programmers: it helps you straighten out probabilities. This is especially useful for, say, understanding statistics in the news, where statistics are often abused to sensationalize stories.

To put it very simply, Bayes’ Theorem deals with the detection of events, and helps you estimate the likelihood of future events based on past events.

The Theorem

`P(A|B) = (P(B|A) * P(A))/(P(B))`

A very quick rundown of the notation: `P(X)` is the probability (a number between 0 and 1) that we will detect an event `X`, and `P(X|Y)` is the probability that we will detect an event `X` given that we have already detected another event, `Y`.

That’s the small, beautiful form. For real data, usually you have no measurement of `P(B)` independent of `P(A)`. Typically, all you have is the posterior: `P(Prediction|Data)`, but it’s not often the case that you have all of the possible data. Luckily, there’s a handy relation that will get you `P(B)`:

`P(B) = P(B|A) * P(A) + P(B|¬A) * P(¬A)`

Where here the notation `¬` is the logical negation, so `¬A` means `A` was not detected; numerically, `P(¬A)` has the same value as `1 – P(A)`. This of course yields the longer form:

`P(A|B) = (P(B|A) * P(A))/(P(B|A) * P(A) + P(B|¬A) * P(¬A))`

A Simple Example.

One of the very cool things about Bayes is that it articulates something that “everybody knows”: essentially, this is how your brain works. (More on this in a later post.)

Let’s say you live in Pasadena, where the Rekka Labs are, and you have a dog. The dog barks. According to the City of Pasadena, there were 36 residential burglaries in June of 2017. According to Wikipedia, the Census reports there are 55,270 households in Pasadena, meaning that you have a `36/55270` probability of your house getting burgled that month, which yields a daily probability of about 0.00002172. (We’re going to gloss over Bernoulli distributions and a couple complicating factors for the sake of simplicity.) Let’s suppose the dog barks in the middle of the night once a week (daily probability of approximately 0.2427), and if an unfamiliar person comes near your house, the dog barks 9 times out of 10. What are the odds that you should get out of bed to check?


What we’re looking at is `P(A|B)`, where `A` represents the probability that you are being burgled and `B` the probability that the dog barks. The probability that the dog barks (`P(B)`) is 0.2427, and the probability that the dog barks given that someone is near your house (`P(B|A)`) is 0.9. This makes the math pretty easy to work out:

`P(Burglary|Bark) = (P(Bark|Burglary) * P(Burglary)) / (P(Bark))`

This gives us the following:

`(0.9 * 0.00002172) / 0.2427 ≈ 0.00008054`

So you have about a 0.008% chance of being burgled when your dog barks. That solves the curious incident of the dog in the night-time. (Again, sorry Bernoulli, you’re beyond the scope of the current blog post.)

You’d never get out of bed to check the window if that’s the only calculation you were doing, but you also apply a cost-benefit analysis. The cost of getting out of bed is low, but the cost of being burgled is high, so unless you’re very tired or it’s a very cold night (which increases the cost of getting out of bed), you probably check sometimes. If you play with the numbers a bit, you can see that we have formal mathematical proof of why a yappy dog gets ignored (much higher `P(B)`) or why a dog that doesn’t bark as often is not as useful for this purpose (lower `P(B|A)`).

(The canonical explanations usually involve cancer screenings. Concrete, quotidian events that don’t involve chronic illness seemed like a better example, and easier to grasp without the cognitive overhead of a memento mori.)

Thomas Bayes

Bayes was born in London around 1701 and died in Kent in 1761, and during that time never actually published the theorem for which he is famous. He was a Presbyterian minister, a philosopher, and a mathematician. He wasn’t exactly a tragic genius, though: he was inducted into the Royal Society in 1742. The actual situation is stranger: he wasn’t considered very noteworthy. Obviously you need some notoriety to become a Fellow of the Royal Society, but there aren’t any biographies, he wasn’t knighted, and we don’t even have any known contemporary portraits. We’re not even sure why he was inducted into the Royal Society! Working backwards, we kind of assume it was for a defense of Newton’s Calculus, but that work would probably not be very well-known today had it not been written by Bayes.

So it’s kind of lucky that we ended up with Bayes’ Theorem to begin with: one of his friends (Richard Price, who was a somewhat important figure in the history of the American Revolution) went through his unpublished essays after his death, and ended up publishing An Essay towards solving a Problem in the Doctrine of Chances in 1763, two years after Bayes died of…some kind of illness. Nobody bothered to write that down, either, apparently.

So, ironically, the odds that we’d have Bayes’ Theorem were not great. What if the papers had been discarded after his death? What if Price hadn’t recognized the importance of the work? (After all, Bayes himself hadn’t published it, and we don’t know if he was going to.) We’d lose an entire field of statistical analysis!

Worse, no one would be able to understand this XKCD.

Next Time: A Classifier, Real Data

We’ll follow up soon by constructing a Bayesian Classifier that is simple to use and understand, and test it out on some real data, and talk about some real-world applications, including A/B-testing, spam filters, and FeelsBot.

Incidentally, there appears to be a popular belief that “Bayesian vs. Frequentist” represents an ideological split in statistics; this isn’t actually the case.


Acknowledgements

This post was a little math-heavy, so we used asciimathml, which was fun and convenient!

Reverse acknowledgement of moderate shame: according to MDN, although Firefox and Safari can display MathML, Chrome and IE don’t at time of writing, which is what necessitated adding the JavaScript to begin with.

We have updated it after getting some excellent feedback from Rob.


Domain Knowledge is Key for Creating Great Tools

One of the most fun and interesting parts of consulting is the diversity of business types that you get to interact with. I believe that this is unique to building software and hardware solutions: not only is tech ubiquitous across industries, but good software has to be created with a solid understanding of the rules of the problem it solves.

People tend to think of pure software businesses when they think about the tech industry, but it’s not all apps and e-commerce. Lots of companies need internal tools to streamline their own businesses, so a large amount of software (perhaps most software) is not written to offer to the public.

Two of the companies I had worked for when we started this company did consulting, and of course, Rekka Labs is a tech consulting business. I’ve had the opportunity to write code that powered tools for education tech, animation, energy production, warehousing, and retail, among others, in addition to more typical software offerings like e-commerce, advertising, and social media applications. This may not be a universal taste, but I absolutely love learning what problems are presented by the sometimes chaotic atmosphere in a warehouse, by the different types of soil and their impact on oil pipeline coating, or by the complex distribution of funds behind large-venue ticket sales.

Useful Software Represents Domain Knowledge

This is certainly universal. If you take a look at any piece of software that has stood up over the years, you will find that the code is a representation of knowledge about a process and the domain in which the process is carried out.

That may not sound quite so obvious, but it is a necessary, direct consequence of an obvious fact: the tool fails to be useful unless its creators understand the way the tool is used and why the tool would be desirable. If the creator understands that and is able to apply the necessary skill to create the tool, the tool embodies knowledge about the process. A tool that does not embody this knowledge does not turn out to be useful: there is a mismatch between the process people want to carry out and what the tool can help them accomplish.

In simpler and more concrete terms, a screwdriver is the manifestation of knowledge about the process of driving a screw. You want to apply torque to turn the screw so that it is embedded in the wood. The design decisions that apply to a screwdriver are representations of several pieces of knowledge about this process. The head has to match the screws, the tip may or may not be magnetized, the handle can be round or angular, with deep grooves or shallow, and the handle can be smooth, textured, or covered with rubber. All these decisions rely on knowing a lot about thew screwdriver’s use:

  • What type of screw you are driving?
  • Should the tip be magnetized? Does it need to be used near equipment sensitive to magnets, or are the screws too heavy for magnetized tips to be useful?
  • What are you driving the screw into? Does it have to pierce wood or drywall, or are the screws driven into holes that match their threading?
  • How confined is the space where the screw is driven? Is there plenty of space to work with, or is even the user’s hand going to be hard to get in?
  • What sort of grip is convenient? Is the user going to wrap their hand around it, or hold the handle in their fingertips?
  • How long the screwdriver is going to be used? Does it have to be comfortable even after being used all day, or does it have to be convenient for intermittent use?

That’s just a small number of factors that differentiate the type of screwdriver I keep by my desk for computers from the type that lives in a construction worker’s toolbelt or the type that comes with a repair kit for eyeglasses.

The creator of a tool has to understand very clearly how the tool is used. Have you ever used a bad screwdriver? Do you have any appliances in your kitchen or garage that you never make use of because they aren’t matched to their purpose?

Have you ever had to fight with useless software?

Make Sure the Creator has the Domain Knowledge

The first thing you have to do when creating a tool, whether you are a developer, you are managing developers, or you are hiring us is to make sure that the creators of the tool understand the problem the tool is supposed to solve and how it will be used to solve this problem. This is one of the benefits of user stories: they are phrased as an expectation a tool’s user has of the tool. What the user is going to try, what they expect to result, and why they expect it.

But you can’t stop there. The creator of the tool has to understand the entire context of the process to create a really great tool. In some cases, the tool might be able to bridge the gap between two processes, or a different tool might solve the problem even more effectively.

For example, when we built DailyRead, we were chatting with our client about the process of writing articles for the site, and he mentioned that there was an external tool they had used to get various metrics about the text, like the Flesch-Kincaid reading ease or the SMOG index. They were copying the article and pasting it into this tool. When I found this out, I almost fell out of my chair. Manual processes like that are always a red flag, but it never occurred to the client that it would be possible for us to integrate that functionality directly into the editor. We built the feature, and it turned out to be fast enough that it could be done in real-time, while the article’s author was typing. (This was distracting to authors, though, so we turned it into a button.)

How to Make Sure It Works

There’s absolutely no subsitute for close communication between developers and users. Full stop.

Implementing that optimally is the only real question. This brings me to another great point of software: it’s easy to iterate and improve designs, certainly much easier than with physical goods like screwdrivers. Steve Wozniak remarked that playing with the color scheme for Breakout was much easier in software than hardware:

I called Steve Jobs over to my apartment to see what I’d done. I demonstrated to him how easily and instantly your could change things like the color of the bricks. Most importantly, in one-half hour I had tried more variations of this game than I could have done in hardware over 10 years.

He ended up with a color scheme he loved because he only had to change a number in the source code to try it out rather than replacing or rearranging physical chips on a board, meaning that feedback was nearly instantaneous. That’s the theoretical limit for a team: the user, the designer, and the developer were the same person. The closer we can get to that limit, the better the product is. In this case, Apple ][’s BASIC is still fondly remembered, and Woz’s version of Breakout had a very strong influence on the hardware and software of the Apple ][, which was the machine that started Apple on the path to being the tech giant it is today.

Our process was designed with domain knowledge at the middle. We start by talking about the problem, then talking about the business to understand the domain, and then talking about potential solutions before we get to the task of producing a spec for the solution. Throughout development, we keep communication and feedback at the forefront, and this ensures that developers’ understanding of the problem and the problem domain is constantly refined, which is the best way to ensure that the product is great.

The only way to have a great tool or a great product of any sort is to ensure that everyone building it really understands it. So the best team you can hire is a team that works hard not just on your product, but also on understanding your product and how it fits into your business and your users’ lives. Get started with that team.