Predicting Prescriber Induced Overdoses Using Machine Learning Algorithms.


For my capstone project, while taking the data science immersive program at General Assembly, I’ve decided to tackle the major public health issue that is the opioid crisis in the united states of America. How do the opioids start to emerge as a public health problem? It’s worth looking at the history. I put them in a very concise way and further researching may be necessary for details.

Screen Shot 2018-07-24 at 23.05.58.png


Screen Shot 2018-07-24 at 23.06.14.png
How opioids become a catastrophe

Right now, our country faces an average of 115 opioid overdose deaths each day according to the data from wonder/CDC. Over 280 million prescriptions were written in 2012 alone and are a big contributing factor to this issue. Prescription opioid abuse, misuse, and dependence as a public health hazard is a daily phenomenon now; according to Centers for Disease Control and Prevention (CDC), over 1,000 people are treated in the emergency department every day for misusing prescription opioid drugs. The number of overdose deaths from substance abuse in the US increased from around 9,000 in 1990 to more than 64,000 in 2016, up by 23% increase from the previous year. According to the data source, 75% of drug OD deaths involve opioids (prescribed or illegal). Prescription opiate drugs, especially methadone, oxycontin, and hydrocodone, are believed to have played a significant role in this public health crisis sweeping the US. In addition to the health issues arising from an overdose, opioid epidemic requires significant economic resources from cities and state governments for emergency call response and policing. The estimated total costs of US opioid epidemic reaches over 78 billion dollars.

Screen Shot 2018-07-25 at 08.38.41.png
Opioids: Addictive painkillers

The major source of diverted opioids is physician prescription. However, opioids prescription to patients with acute pain and patients with chronic pain requires a careful distinction. As Opioid is regarded as one of the most effective drugs for the acute pain management, limiting its use for patients who are in urgent need of pain control, post-surgical status, cancer patients, and other health crisis would not only be inhumane but also defeat its intended purpose. On the other hand, use of opioids for chronic non-malignant pain control has remained controversial for decades and requires a closer look in regards to the current opioids health crisis.

Screen Shot 2018-07-24 at 22.37.17.png
The severity of chronic pain as compared to Coronary artery disease(CAD)

As data scientists, we can’t really question whether doctors are pushing prescriptions from sponsorship from pharmaceutical companies, or if they are constantly barraged by patients claiming they are in severe pain and actually need the medication, or really, most other factors for that matter. So, what I wanted to do was try and find the likelihood that a provider would prescribe an opioid and identify any patterns, if any, from that data. Based on my personal research from the currently published materials by CDC and the National Institute of health the number one cause of death from age 15–50 years in the US is drug opioids death. I have shown the comparison with other significant events from our history as follows:-

  • Drug OD deaths in 2016: 64,000
  • Vietnam total deaths (1965–1975) 58,000
  • Motor vehicle Accident deaths in 2016 40,000
  • HIV deaths in peak year (1995) 47,000
  • Gun deaths in peak year (1993) 39,000

This study attempts to build a predictive model of a likelihood of a health care provider prescribing opioids drugs to patients with chronic pain. More specifically, we will identify the correlated features of non-opioid drugs prescription history with an opioid prescription. In addition, we will distinguish gender, specialty, and location that are more highly correlated to the prolonged use of a long-term (more than 12 weeks) supply of opioids.

Datasets and Inputs

A total of eight datasets, detailed data, provider summary, national drug summary table, national health expenditure(NHE), property and violent crime by state, mental health deaths, and state drug summary table , were obtained from the web page of the Centers for Medicare and Medicaid Services (CMS),Kaiser Family Foundation (KFF), CDC/wonder as well as Department of Justice . The detailed data contains 5 the information on prescription drugs prescribed by individual health care providers and paid for under the Medicare Part D Prescription Drug Program in the year of 2016. It includes the detailed prescription information such as the brand drug name, the number of patients who filled the drug more than ten times, the aggregate number of day’s supply for which the prescription drug was dispensed.

Screen Shot 2018-07-25 at 08.41.19.png
Sample data on Prescriber’s information 

The dataset provider summary table contains the demographics of the individual prescribers (n=1,131,550). Briefly, it includes the National provider identification(NPI) number, name, gender, address, medical credential, specialty, and Medicare enrolment status. It also includes the summary of the abstracted clinical data such as the number of total claims count from the prescriber, total opioid claim count based on the supply of all prescription drugs, total claims of antibiotic drugs, total claims of high-risk medication (HRM) drugs, total claims of antipsychotic drugs, and more importantly, the number of patients treated with opioid prescriptions, and the ratio of opioid prescription to non-opioid prescription.

The rest two datasets are national drug summary table and state drug summary table. It lists the prescription drug names, whether the drug is categorized as antibiotics (n=78), opioids (n=29), antipsychotic (n=28), HRM (n=68), or others (n=951) and the number of prescribers of that drug grouped by nation and state, respectively. For more detailed information on how the dataset was collected, please refer to the CMS’s webpage here. The other data sets the NHE shows the percentage share of the prescription drugs all inclusive from the national health budget and the data set includes from 1960–2016. The KFF dataset includes all the mental health deaths by a state that has some level of connection with opioids. The crime data set was collected from multiple sources and agencies using the query method in order to see crimes linked to the opioids or not and shows by state, either it happened in an urban area or rural areas of the country.

The box plot was chosen to show the top prescriber-specialists of the opioid drugs abused on the professional category and it turns out that the cardiology (diseases and abnormalities of the heart.) specialists have the highest variability, followed by Endocrinology(the branch of physiology and medicine concerned with endocrine glands and hormones), psychiatry(the study and treatment of mental illness, emotional disturbance, and abnormal behavior.), Nephrology(the branch of medicine that deals with the physiology and diseases of the kidneys.) and pulmonary disease(chronic obstructive pulmonary disease. : pulmonary disease (such as emphysema or chronic bronchitis) that is characterised by chronic typically irreversible airway obstruction resulting in a slowed rate of exhalation —abbreviation COPD).

My Solution Statement

I start by merging two datasets, detailed data and provider summary, to get combined features of providers’ personal information and prescription history of drugs (sorted by its generic names, n=1154). Two new attributes, op_avg_supply and op_longer, are added. Op_avg_supply is the aggregate number of days supply divided by a total of patients for which opioids drugs were dispensed. Op_longer is labeled 1 if the provider has Op_avg_supply greater than 84 days (12 weeks), and 0 if less than 84 days. Then I train supervised classification models according to the op_longer labels to solve this large-scale nonlinear problem. This projects will consist of three parts: exploratory data analysis, train/test split, building pipeline to fine-tune models via Scikit learn, and extension to the neural networks in TensorFlow.

As the detailed data contains more than 25 million instances and its volume is larger than 3.2 GB, it easily takes up the memory of the local machine, especially when we combine it with prescriber summary dataset to make a wide table. As such, we will utilize’s pandas’ TextFileReader module and read in the instances by small chunks (size=100,000). In order to make sure of seamless and reproducible inputs, I automate the data input pipeline via data cleaning, feature selection, and feature scaling.

Exploratory data analysis In order to gain insights, I explore the data by visualizing each attribute on some randomly selected features as shown below. Let’s see the visuals of the data sets as follows

According to the datasets, the number of opioid prescriptions dispensed by doctors steadily increased from 112 million prescriptions in 1992 to a peak of 282 million in 2012, since 2012 declined to 236 million in 2016. In 2016, 6.2 billion hydrocodone pills were distributed nationwide. In 2016, according to the CDC, around 5 billion Oxycodone (ox i koe’ done) tablets were distributed in the United States, which makes Oxycodone (Percocet) the second most prevalent opioid. Based on these I come to the point that it the vicious circle problem, in our country that we are 5% of the world’s population but 80% of the world’s opioids are prescribed here and it’s easy to see the severity of over-prescription of opiates. Based on the dataset the state of California is leading with the highest number of deaths due to overdose. Kentucky has been hit particularly hard with 1,419 reported overdose deaths in 2016, which is 33.5 per 100,000 people according to the CDC wonder data and of those deaths 989 which is 23.6 per 100,000 people involved some type of opioids. it was followed by WestVirginia, which is 884 reported overdose deaths (52 deaths per 100,000 people).


Opioid-related deaths count by state
Screen Shot 2018-07-24 at 23.31.59.png
State and Speciality by top prescribed opioids
opiods release.png
Opioids prescription rate versus the Extended release rate by state
Screen Shot 2018-07-25 at 01.25.03.png
Interactive visualization of specialty variability per opioid prescription
speciality wise.png
Specialty variability of prescribers average  opioid supply
Screen Shot 2018-07-24 at 23.43.33.png
State visuals: as the thickness of the color enhances the death amount increases due to opioids. 


What do I learn from the datasets comparing states and understanding the above opioid prescription rate and the extended release rate? I will summarise in the following sentence. Opioids bind to receptors in the brain and spinal cord which in turn disrupt the pain signal by activating the reward areas of the brain so that the dopamine hormone to be released and to create the feeling of euphoria, which means being high or happy. The golden standard for any pain medication is morphine and others are converted to morphine milligram equivalent(MME) and we all know morphine and codeine are naturally derived from opium poppy plan more commonly grown in Asia, central and southern America. The illegal drugs according to CDC like heroin are synthesized from morphine. on the other hand, Hydrocodone and Oxycodone are semi-synthetic opioids manufactured in laboratories with natural synthetic ingredients. According to the dataset and above graph between 2007 and 2016, the most widely prescribed opioid almost by all specialties and across all the 50 states was hydrocodone(Vicodin). please see the above graph state and Speciality by the top 11 opioids in my public tableau account using the link to see the interactive graph. methadone is another fully synthetic opioid, which is dispensed for patients recovering from heroin addiction to help them relieve the symptoms of withdrawal. opioid use disorder is one of the clinical terminologies used in the medical practice referring for opioid addiction or abuse. It’s generally known that people who become dependent on any of these painkillers or opioids may experience withdrawal symptoms when they start to stop taking the pills. For any person who worked or practiced in any medical practice, its clear that dependence is often coupled with tolerance, meaning that opioid users need to take increasingly larger doses of the medication for the same effect. The economic terminology for such effect is called marginal utility. What it means is that the additional satisfaction or euphoria an opioid user gets from consuming one more unit of a pill. According to Substance Abuse and Mental Health Services Administration about 11.5 Million American’s age 12 and older misused prescription pain medications in 2016 and around 948,000 or 0.3 % of the united states population age 12 and above used heroin in 2016. Its very significant number and needs more strategic planning and action points, I guess that’s why the Trump Administration on February 9,2018 allocated around 6 billion dollars for the opioid program, with $ 3 billion allocated for 2018 and $ 3billion allocated for 2019 to tackle the epidemics.


The baseline of the data set to overcome is 53.7%. The problem with machine learning is that building an effective model can require a ton of human input. While working my capstone time was so demanding as i was so engaged working the Washington, DC crime prediction and the data cleaning took’s almost my entire time. I never took it seriously that most data scientists spent 80% of their time cleaning and wrangling the data. Humans have to figure out the right way to transform the data before feeding it to the machine learning model. Then they have to pick the right machine learning model that will learn from the data best, and then there’s a whole bunch of model parameters to tweak that can make the difference between a dud and a Nostradamus-like model. Building these pipelines — i.e., sequences of steps that turn the raw data into a predictive model — can easily take weeks of tinkering depending on the difficulty of the problem. This is obviously a huge issue when machine learning is supposed to allow machines to learn on their own. When almost all my developed models, some of them they work better on the training data and underperformed on test data I used the Tree-based Pipeline Optimisation Tool (TPOT). TPOT is a Python tool that automatically creates and optimizes machine learning pipelines using genetic programming. Think of TPOT as your “Data Science Assistant”: TPOT will automate the most tedious part of machine learning by intelligently exploring thousands of possible pipelines, then recommending the pipelines that work best for your data.


Machine learning pipeline and what parts of the pipeline TPOT automates.
Then, based onTPOT recommends me to use decision tree classifier and when i run my model i end up getting 89% on the test data and 99.8% on my training data set which is different from the TPOT recommendation and after tuning for hyperparameters.


Screen Shot 2018-07-25 at 00.59.48.png
decision tree classifier fitted on training data and scored on a test data.
Then i tried other classifiers like random forest Classifier, Logistic Regression, Decision tree, Adaboost classifier and Neural Network using keras). Finally, I chose to apply Neural networks and performed better than the TPOT recommended optimizer. The neural network uses Gender, specialty as my target variable was binary classification problem, opioid prescriber or not opioid prescriber (0 to 1). 

Screen Shot 2018-07-25 at 00.37.45.png

The model trained on 21813 samples, tested on 2424 samples. The neural network model has the number of inputs which I used 354 features as hidden layers with relu activation function. The output layer uses sigmoid since it is a classification problem. The model compiler uses binary cross-entropy as a loss function, Adam optimizer, and accuracy metrics with an early stop to avoid exhaustive training.

Screen Shot 2018-07-25 at 01.04.50.png

The accuracy score of the model was 91.4% with 21.7% loss.

Screen Shot 2018-07-25 at 01.07.08.png
Model accuracy score and loss for NN
Many features came up for all the models. Carisoprodol, diazepam, alprazolam, gabapentin, pregabalin, hydroxychloroquine sulfate, cyclobenzaprine HCl, and tizanidine HCl were highly associated with the label based on all the model. As we expected from the EDA section, of medical specialties, pain management, interventional pain management, rheumatology, hematology, cardiology and family practice and hospice and palliative care were estimated that the label is true with higher probability.


A limitation of the study is that the dataset only includes a snapshot of the total population as beneficiaries of Medicare are an elderly population of age greater than 65. The original materials can be accessed here
This article is a living and breathing matter. Your feedback matters to me. Please leave comments on what resources I should add and what you found the most helpful when you were learning machine learning. Thanks a lot for reading!