Git Product home page Git Product logo

hiv-lin_reg-case_study's Introduction

Forecasting-HIV-Infections Case Study

badge badge

Richard Bellamy | Cindy Wong | Maureen Petterson | Joseph Shanks

Table of Contents

Introduction

Due to the development of anti-retroviral therapies the HIV/AIDS epidemic is generally considered to be under control in the US. However, as of 2015 there were 971,524 people living with diagnosed HIV in the US with an estimation of 37,600 new HIV diagnoses in 2014. HIV infection rates continue to be particularly problematic in communities of color, among men who have sex with men (MSM), the transgender community, and other vulnerable populations in the US. Socioeconomic factors are a significant risk factor for HIV infection and likely contribute to HIV infection risk in these communities. The current US opioid crisis has further complicated the efforts to combat HIV with HIV infection outbreaks now hitting regions that weren’t previously thought to be vulnerable to such outbreaks.

A model that can accurately forecast regional HIV infection rates would be beneficial to local public health officials. Provided with this information, these officials will be able to better marshal the resources necessary to combat HIV and prevent outbreaks from occurring. Accurate modeling will also identify risk factors for communities with high HIV infection rates and provide clues as to how officials may better combat HIV in their respective communities.

Our Goals:

  1. To accurately model HIV incidences (new infections per 100,000) in US counties by building a linear regression model that utilizes HIV infection data, census data, data on the opioid crisis, and data on sexual orientation.

  2. Identify features that are the most significant drivers of HIV infection rates and learn how these drivers differ between different regions.

Data Preparation and Exploratory Data Analysis

The dataset contained HIV prevalence infomation from 3139 columns.

Data_head

We guessed that several of the columns were not going to be important features in building our regression model. In order to eliminate features, we did a quick least squares regression using stats models to look at the p-values of the different features. We also wanted to explore heavily correlated features an eliminate redundant ones.

We also looked at the correlation matrix for all the data, and just the faciliites data. Within the facilities matrix, it seems that they are not independent from each other. We checked out the 'Med_MH_fac' column and the 'MH_fac' column because they have a 0.99 correlation value. Turns out they have extremely similar values, if not identical and we decided to exclude redundant columns in our analysis.

Histogram of HIV Incidence HIV incidence
p-values from OLS p-values
Correlation Heatmap reg errors
Facilities Correlation Heatmap Facilities

Models

We used the following regression models:

  • Linear Regression (both stats models and sklearn)
  • K-Fold Linear Regression
  • Regularization with K-Fold

The residuals from the regression model are shown below. Note the outlier due to one of the counties having an extremely high incidence of HIV.

The metric we used to quantify our model was root mean squared error, although this value would fluctuate depending on the train/test split. The outlier with an incidence of 717 seemed to add greatly to the test errors. Below are the values we got for the different methods.

  • stats.models Linear Regression: 9.35
  • sklean Linear Regression: 9.35
  • Regularization + KFold: 1.75

The regularizaion parameter alpha was chosen as the parameter for which the test errors were the smallest. The plot below shows the variation in train and test errors as alpha is changed and alpha = 2 corresponds to the minimum test errors.

Residuals Linear regression errors
Linear Regularization Train/Test Errors reg errors

Summary

hiv-lin_reg-case_study's People

Contributors

rellikdog avatar mkpetterson avatar josephshanks avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.