Help in a SNAP

Random Forest Vote Ensemble

Overview

For a capstone project at General Assembly, I wanted to do a risk assessment on demographics vulnerable to food insecurity during COVID.  I used SNAP (formally recognized as food stamps) QC datasets from the USDA to create a prediction model.

Technicals

I wanted to do a 10-year gap analysis, emphasizing the effect of the 2008 housing crisis, so I used datasets from 2007 and 2017 to build the model.  I first focused on New Mexico as an emerging hot spot and Nebraska as an emerging cold spot, determined in a previous GIS spatial analysis, to highlight extreme, contrasting states.  I started off with over 40k rows and 800 columns in each dataset, and using a combination of correlation, high nullity, data insight, and domain knowledge I whittled the data down to 33 columns and a final dataset of 4k records, optimizing with python reference scripts for frequently used code. I used a Vote Ensemble with Random Forest, Gradient Boost, and Bagging Classifier ending in a final CV score of 95%.

Takeaways

I focused on an interpretable model on purpose.  I wanted to see what are the biggest impacts leading to food insecure communities.  I was surprised throughout my analysis, from the EDA to the final interpreted results.  The initial exploration showed me that the households requesting SNAP were not big families, maybe one or two children at most, and not alot of elderly members.  There was one initial correlation, with almost half being single moms as head of household.  As I got into the EDA, more correlations emerged like less working poor in 2017 vs 2007 and more assistance programs available for applicants. At this point, I noticed New Mexico seemed more vulnerable to economic changes than Nebraska.

Ultimately, the biggest surprise was in the interpretable coefficients of the model itself.  Which told me the top 4 biggest impacts in predicting a household requesting SNAP is how expensive and how stable their housing situation was.