Building a Binary Classifier using the USDA's Food Atlas

Introduction
Obtaining Data
Scrubbing Data
Exploratory Data Analysis
- Feature Ranking
- Further Exploration
Building a Classifier
Wrap-Up
- Recommendations
- Possible Next Steps

Introduction

In this project, I used data on food access from the USDA's Food Environment Atlas to predict whether each US county would qualify as a part of the Diabetes Belt, which is a area in the southwest of United States where the rate of adult type II diabetes is especially high. I conceived of this project as an opportunity to practice creating a machine learning classifier on a topic of social relevance.

In order to qualify as a part of the Diabetes Belt, a US County must have an adult diabetes rate of 11% or higher. The CDC made this designation based off of data from 2008. I will be attempting to predict the counties that would be a part of the belt in 2013, which is the most recent publicly downloadable data on adult diabetes.

Obtaining Data

The data is fortunately available in a single Excel file on the USDA's website:

https://www.ers.usda.gov/data-products/food-environment-atlas/data-access-and-documentation-downloads/

Different categories of data are distributed across six different sheets, but as each row is a US county with an unique FIPS code, it was simple to combine into a single pandas DataFrame.

def read(file):
    excel = pd.ExcelFile(file)
    sheets = excel.sheet_names[4:]

    dataframes = [excel.parse(sheet_name=sheet) for sheet in sheets]

    df = reduce(lambda left,right: pd.merge(left,right,on='FIPS'), dataframes)

    return df

data = read('food_atlas.xls')
data.shape

(3143, 296)

data.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	FIPS	State_x	County_x	LACCESS_POP10	LACCESS_POP15	PCH_LACCESS_POP_10_15	PCT_LACCESS_POP10	PCT_LACCESS_POP15	LACCESS_LOWI10	LACCESS_LOWI15	...	PCT_NHPI10	PCT_65OLDER10	PCT_18YOUNGER10	MEDHHINC15	POVRATE15	PERPOV10	CHILDPOVRATE15	PERCHLDPOV10	METRO13
0	1001	AL	Autauga	18428.439685	17496.693038	-5.056026	33.769657	32.062255	5344.427472	6543.676824	...	0.040314	11.995382	26.777959	56580.0	12.7	0	18.8	0	1
1	1003	AL	Baldwin	35210.814078	30561.264430	-13.204891	19.318473	16.767489	9952.144027	9886.831137	...	0.043343	16.771185	22.987408	52387.0	12.9	0	19.6	0	1
2	1005	AL	Barbour	5722.305602	6069.523628	6.067799	20.840972	22.105560	3135.676086	2948.790251	...	0.087409	14.236807	21.906982	31433.0	32.0	1	45.2	1	0
3	1007	AL	Bibb	1044.867327	969.378841	-7.224696	4.559753	4.230324	491.449066	596.162829	...	0.030548	12.681650	22.696923	40767.0	22.2	0	29.3	1	1
4	1009	AL	Blount	1548.175559	3724.428242	140.568857	2.700840	6.497380	609.027708	1650.959482	...	0.031402	14.722096	24.608353	50487.0	14.7	0	22.2	0	1

5 rows × 296 columns

Scrubbing Data

The combined DataFrame had 296 features, many of which duplicated information or had missing entries. I performed the following operations to scrub the data clean (dropped measurements in bold):

The State and County features were not appropriate for use in the classifier, so these I dropped. I stored the FIPS codes in a separate list for graphing purposes.
For every feature recorded as a raw count, the USDA conveniently provided either a normalized or scaled version. This left me free to drop all of the raw counts.
Wherever there were multiple years of measurements, I kept only the most recent year, 2013 or prior.
While features showing the change from one year's measurement to the next could have been interesting transformed into a categorical feature (i.e. positive change, negative change, no change), for the scope of the project I elected to drop these features. It could be a fruitful area of exploration, however.
Finally, I dropped all columns containing null values. Because each row is a county, any leniency here would rob us of crucial data.

After the scrubbing step, I was left with a much more manageable selection of 37 features, including the target.

df.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	PCT_DIABETES_ADULTS13	PCT_65OLDER10	GROCPTH09	PCT_LACCESS_POP10	PCT_WIC09	PCT_SFSP09	PCT_LACCESS_HHNV10	PCT_LACCESS_LOWI10	SNAPSPTH12	...	PCT_NSLP09	PCT_NHWHITE10	PCT_NHPI10	CONVSPTH09	SNAP_PART_RATE13	PCT_NHBLACK10	PCT_SBP09	SNAP_REPORTSIMPLE09	PCT_NHNA10
0	13.0	11.995382	0.110834	33.769657	2.990417	0.56489	3.284786	9.793530	0.674004	...	12.315055	77.246156	0.040314	0.535698	89.184	17.582599	4.509008	1	0.397647
1	10.4	16.771185	0.133775	19.318473	2.990417	0.56489	2.147827	5.460261	0.725055	...	12.315055	83.504787	0.043343	0.663300	89.184	9.308425	4.509008	1	0.628755
2	18.4	14.236807	0.180786	20.840972	2.990417	0.56489	4.135869	11.420316	1.280590	...	12.315055	46.753105	0.087409	0.506201	89.184	46.691190	4.509008	1	0.218524
3	14.8	12.681650	0.261540	4.559753	2.990417	0.56489	3.458580	2.144661	0.719122	...	12.315055	75.020729	0.030548	0.828211	89.184	21.924504	4.509008	1	0.279293
4	14.1	14.722096	0.104637	2.700840	2.990417	0.56489	3.269380	1.062468	0.657144	...	12.315055	88.887338	0.031402	0.540625	89.184	1.263040	4.509008	1	0.497191

5 rows × 37 columns

Exploratory Data Analysis

To begin my exploration, I wanted to identify any highly correlated features. I started by graphing a heatmap including only those features that had a correlation of .8 or greater with one or more features.

Feature Code	Feature Description
PCT_LACCESS_CHILD10	Children, low access to store (%), 2010
PCT_LACCESS_SENIORS10	Seniors, low access to store (%), 2010
PCT_LACCESS_LOWI10	Low income & low access to store (%), 2010
PCT_LACCESS_POP10	Population, low access to store (%), 2010

Because all three of the other features were highly correlated with PCT_LACCESS_POP10, I dropped the other three. After that, I felt confident that I would be able to avoid the worst interferences of collinearity.

Next, I took some time to explore our target variable before transformation into a binary variable. How indeed had the diabetes rates by county changed from 2008 to 2013?

In this chart, the red line represents the 11% cutoff for inclusion in the Diabetes Belt. We can see at a glance that many more counties qualify in 2013 than in 2008.

So, where are these counties concentrated? I chose to approach this question first using a set of parallel boxplots.

Again, the red line represents the 11% Diabetes Belt cutoff. The US south and southeast are well represented above the line, while New England and the west are by and large in better shape.

Finally, I transformed the target feature into binary categories for use in the classifier, and used them to create choropleth maps of the Diabetes Belt in 2008 and 2013 using geopandas and plotly.

The Diabetes Belt in 2008

The Diabetes Belt in 2013

It is clear that the belt had expanded considerably from 2008 to 2013.

Feature Ranking

In order to further narrow down the feature set and select only the most crucial factors, I used Random Forest to tank the features by gini index. This gave me a list of features sorted from the best contributing features to the worst.

# Sort DataFrame by Gini Index, descending
results = results.sort_values(by='Gini', ascending=False)
print(results)

               Features      Gini
27   PCT_OBESE_ADULTS13  0.102188
16           PCT_SNAP12  0.093906
23            PCT_SBP09  0.063383
8    PCT_LACCESS_HHNV10  0.059547
17           SNAPSPTH12  0.055432
31           PCT_HISP10  0.054436
0         PCT_65OLDER10  0.053096
28        PCT_NHASIAN10  0.050246
20        PCT_NHBLACK10  0.043950
7            PCT_SFSP09  0.033113
18      PCT_18YOUNGER10  0.031913
15           CONVSPTH09  0.027032
9           PCT_CACFP09  0.026704
13        PCT_NHWHITE10  0.025728
14           PCT_NHPI10  0.024787
29           PCT_NHNA10  0.024062
5             WICSPTH12  0.022374
3     PCT_LACCESS_POP10  0.022209
6             PCT_WIC09  0.022059
1              FFRPTH09  0.021769
2             GROCPTH09  0.021536
4           RECFACPTH09  0.020892
19     SNAP_PART_RATE13  0.019416
10           SPECSPTH09  0.018830
12           PCT_NSLP09  0.018326
24          SUPERCPTH09  0.013437
22         PERCHLDPOV10  0.010694
11          SNAP_OAPP09  0.005829
30           SNAP_CAP09  0.004070
21              METRO13  0.003267
32             PERPOV10  0.002916
25          SNAP_BBCE09  0.002632
26  SNAP_REPORTSIMPLE09  0.000219

Next, in order to clear out features that weren't contributing much, I used a backward selection algorithm to remove the features at the bottom of the list one by one until the accuracy of the model started to drop.

summary = backward(X, y, titles)
summary

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	# Features	Dropped Feature	Test Accuracy
0	33	NONE	0.864865
1	32	SNAP_REPORTSIMPLE09	0.864865
2	31	SNAP_BBCE09	0.861685
3	30	PERPOV10	0.864865
4	29	METRO13	0.866455
5	28	SNAP_CAP09	0.863275
6	27	SNAP_OAPP09	0.864865
7	26	PERCHLDPOV10	0.864865
8	25	SUPERCPTH09	0.860095
9	24	PCT_NSLP09	0.863275
10	23	SPECSPTH09	0.863275
11	22	SNAP_PART_RATE13	0.863275
12	21	RECFACPTH09	0.861685
13	20	GROCPTH09	0.864865
14	19	FFRPTH09	0.864865
15	18	PCT_WIC09	0.868045
16	17	PCT_LACCESS_POP10	0.863275
17	16	WICSPTH12	0.868045
18	15	PCT_NHNA10	0.868045
19	14	PCT_NHPI10	0.871224
20	13	PCT_NHWHITE10	0.868045
21	12	PCT_CACFP09	0.864865
22	11	CONVSPTH09	0.861685
23	10	PCT_18YOUNGER10	0.861685
24	9	PCT_SFSP09	0.855326
25	8	PCT_NHBLACK10	0.858506
26	7	PCT_NHASIAN10	0.864865
27	6	PCT_65OLDER10	0.844197
28	5	PCT_HISP10	0.844197
29	4	SNAPSPTH12	0.801272
30	3	PCT_LACCESS_HHNV10	0.763116
31	2	PCT_SBP09	0.777424
32	1	PCT_SNAP12	0.737679

plt.plot(summary['# Features'], summary['Test Accuracy'])
plt.title("# of Features vs. Accuracy")
plt.xlabel("Features")
plt.ylabel("Accuracy")

Text(0, 0.5, 'Accuracy')

Having determined that the top seven features were likely to have the largest impact on the quality of the classifier that I wanted to build, I set out to isolate these features and explore their relationship to the target in greater depth.

Futher Exploration

So, what were these features that gave the most insight into diabetes risk?

final = list(results['Features'])[:7]
final

['PCT_OBESE_ADULTS13',
 'PCT_SNAP12',
 'PCT_SBP09',
 'PCT_LACCESS_HHNV10',
 'SNAPSPTH12',
 'PCT_HISP10',
 'PCT_65OLDER10']

decode(final)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Variable Code	Variable Name	Category Name	Subcategory Name	Units
13	PCT_LACCESS_HHNV10	Households, no car & low access to store (%), ...	Access and Proximity to Grocery Store	Household Resources	Percent
68	SNAPSPTH12	SNAP-authorized stores/1,000 pop, 2012	Store Availability	SNAP-authorized	# per 1,000 pop
96	PCT_SNAP12	SNAP participants (% pop), 2012*	Food Assistance	SNAP	Percent
119	PCT_SBP09	School Breakfast Program participants (% pop),...	Food Assistance	School Breakfast Program	Percent
254	PCT_OBESE_ADULTS13	Adult obesity rate, 2013	Health and Physical Activity	Health	Percent
264	PCT_HISP10	% Hispanic, 2010	Socioeconomic Characteristics	Race & Age	Percent
268	PCT_65OLDER10	% Population 65 years or older, 2010	Socioeconomic Characteristics	Race & Age	Percent

Examining these features, I elected to group them into four categories for understanding:

Obesity
Food Access
Food Assistance
Race & Age

How then were these factors related to the diabetes belt? I set out to explore with the aid of boxplots. The function I used to make these plots may be found in my GitHub repository.

boxplotter(obesity, 1, 1, 'Obesity - No Belt vs. Belt')

Here, the relationship is clear. The median county diabetes rate in the Belt is at least 5% greater than outside of it.

boxplotter(access, 1, 2, 'Food Access - No Belt vs. Belt')

Clearly, household access to store is an important factor, with lower access overall in the belt. SNAP-authorized stores may be thought of here as a proxy for poverty, as it is likely that there are more such stores in impoverished areas.

boxplotter(food_assist, 1, 2, 'Food Assistance - No Belt vs. Belt')

Again, these measures are probably best thought of as proxies for poverty, though it might take more domain knowledge to really know what is going on. Still, rates seem to be generally higher on the belt than off of it.

boxplotter(race_age, 1, 2, "Race & Age - No Belt vs. Belt")

Here, larger Hispanic populations were better represented off of the belt. Meanwhile, older populations were slightly better represented on the belt.

Building a Classifier

At this point, I felt comfortable taking a crack at developing a model. In order to do this, I built a pipeline to test four popular algorithms for classification:

Logistic Regression
Naive Bayes
Support Vector Machines
Random Forest

# instantiate our models
models = [
    ('Logistic Regression', LogisticRegression(solver='liblinear')),
    ('Naive Bayes', GaussianNB()),
    ('SVM', SVC(gamma='scale')),
    ('Random Forest', RandomForestClassifier(n_estimators=1000, n_jobs=-1))
]

results_accuracy = []
results_recall = []
names = []

# perform kfold validation on our data, applying each model 10 times and returning the means of the accuracy and recall scores
for name, model in models:
    names.append(name)
    
    kfold = model_selection.KFold(n_splits=10, random_state=1)
    cv_results_acc = model_selection.cross_val_score(model, X, y, cv=kfold, scoring='accuracy')
    results_accuracy.append(cv_results_acc)

    cv_results_rec = model_selection.cross_val_score(model, X, y, cv=kfold, scoring='recall')
    results_recall.append(cv_results_rec)
    
    msg = "{}: Mean Accuracy: {}; Mean Recall: {}".format(name, np.mean(results_accuracy[-1]), np.mean(results_recall[-1]))
    print(msg)

Logistic Regression: Mean Accuracy: 0.7965726417955719; Mean Recall: 0.8002437773643095
Naive Bayes: Mean Accuracy: 0.7736639369123445; Mean Recall: 0.8481707175206757
SVM: Mean Accuracy: 0.800387220705692; Mean Recall: 0.7647862197476227
Random Forest: Mean Accuracy: 0.7914750783540592; Mean Recall: 0.7639746311963183

All four models did remarkably well. In order to get a clearer sense of what was going on, I produced boxplots of the individual scores from the kfold validation step.

Surprisingly (to me), logistic regression performed the best overall of the four! Because it is a simple, fast algorithm that is robust against overfitting, and because the fact that it performed well probably means that the data is well-suited to what is essentially a linear approach, I felt comfortable selecting this as the final model with a mean accuracy of 80%.

Exploring the confusion matrix confirmed that the model was well-tuned for accuracy, with similar numbers of false positives and false negatives:

Wrap-Up

Given the wealth of data in the USDA's Food Environment Atlas, I have found that it is possible to construct a classifier that will determine whether a US county qualifies as a member of the CDC's Diabetes Belt with ~80% accuracy.

The top categories of features appear to be obesity, food access, food assistance, and race/age. Some of these may be proxies for the effect of poverty, rather than causes in of themselves, but to assess this would require domain expertise and the investment of greater material resources.

I observed comparably excellent performance from all classifiers trialed, and ended up using logistic regression, which is fast, accurate in this case, and robust against overfitting.

Recommendations

Based off of these findings, my recommendations to those who are working to identify diabetes hotspots in the US are as follows:

A high rate of obesity in a county is the single best indicator for whether diabetes is likely to be a problem. This is already well known, but it is valuable to know that these findings confirm that is the case.
Vulnerable populations such as the elderly and families without transportation or easy access to stores are at a special risk for diabetes, and where this demographic information is available, it is worth giving these populations special attention preemptively.
Data on food access and socioeconomic factors are of demonstrable use in determining level of risk of diabetes in a county. It is worth continuing to gather this data, and the more of it is available for public access, the better!

Possible Next Steps

I feel confident that with more data and finer tuning, even more accurate models might be built from this data. Specifically, with more time to devote I would consider the following steps:

More comprehensive hyperparameter tuning for SVM and Random Forest. I ran a grid search on the hyperparameters for SVM, and tried a small group of different possibilities for Random Forest, but I could have done more, and might have wrung out better performance. I'm also curious about Bayesian hyperparameter optimization, and wonder whether that might lead me in a more fruitful direction.
Engineering new features from multi-survey differences. Features that consisted of the change of a measurement from one year or another could yield useful information about whether a particular county is trending better or worse, or perhaps whether or not it is the beneficiary of some kind of investment. While these are of questionable use as they are, as a transformed categorical variable, they might yield some fruit.
Filling out features with null values instead of dropping them entirely. There were a large number of interesting features that suffered from null values, which I did not consider when building this classifier. It might be interesting to see whether any of them provide useful enough information to be worth backfilling. Ultimately, my hope is that future USDA surveys will render this unnecessary.

Thank you for reading!

calremmel / diabetes-belt-prediction Goto Github PK

diabetes-belt-prediction's Introduction

Building a Binary Classifier using the USDA's Food Atlas

Table of Contents

Introduction

Obtaining Data

Scrubbing Data

Exploratory Data Analysis

The Diabetes Belt in 2008

The Diabetes Belt in 2013

Feature Ranking

Futher Exploration

Building a Classifier

Wrap-Up

Recommendations

Possible Next Steps

diabetes-belt-prediction's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent