Python | Numpy | Pandas | Statsmodels | Sci-Kit Learn |
---|---|---|---|---|
Slides live here. Clean notebooks coming soon
Classified drought levels at Summer's end (2000-2013) for reservoirs in Southern California using storage data and exogenous variables.
California’s recent drought has placed unprecedented demands on our freshwater resources, renewing enthusiasm for surface water infrastructure investments such as raising dams to capture more water in wet years.
Reservoir improvements would need to consider the frequency and the extent to which these dams are depleted.
My analysis looked at time series reservoir storage data around LA county to classify storage levels at summer’s end (Sept 1), given data about the rest of the year.
Guiding Questions
- Does climate serve as a valid predictor in classifying water availability?
- Can water availability be predicted using earlier monthly storage measurements?
- Data sourced from the Department of Water Resources California Data Exchange Center (CDEC)
- Climate Data sourced from Berkeley Earth
- Climate Monthly Average Temperature around Los Angeles (Average Temperature and Avg Error from 2000-2013)
- Additional information sourced from Wikipedia: Elevation of the dam, year completed, dam type (material), heights (in feet and meters), capacity (in feet and meters.
Project Notes:
- Adjusted storage measurements as proportions of the reservoir's capacity
- Make predictor features out of the storage af dataframe (1/1 - 6/1)
- Included climate data
- Created dummy variables
- Missing values (monthly storage measurements) were the means of the adjacent neighbors
- Created dummy variables
- Make classes out of the storage af dataframe (9/1)
- Multiclass variables for four different reservoir conditions
- Train/Test Splits (climate data only goes to 2013…)
- Train set (2000 - 2010)
- Test set (2011-2012)
- Holdout set (2013)
- Classification model
- Feature engineering
- Parameter optimization
- Future work:
- Do I download more historical data (pre-2000?)
- Do I incorporate population data? Visualization
- Flask & D3.js
- AWS t2.micro EC2 instance with a PostgreSQL database
- Jupyter notebook
- Python 3.5
- Pandas, Matplotlib, Seaborn
- Sci-kit Learn
- Plotly
Might include step by step series of examples that tell you have to get a development env running
- https://plot.ly/~atomahawk/48/storage-for-major-reservoirs-around-los-angeles-county/
- https://plot.ly/~atomahawk/50/reservoir-storage-capacities-in-2001/
More to come. Will explain insights gleaned, model evaluation, or patterns in visualization.
Best Performer: Random Forest Classifier
- Max Depth: 3
- Number of estimators = 3
More to come
- assumes business as usual water demand
- No natural disasters (wildfires and earthquakes)
- A static population size
- Unchanging urban, ag, and environmental uses
- Limited to reservoirs that had data available on CDEC
- Storage is the most complete predictor variable, with most reservoirs containing public data on storage
- Reservoirs that had recent data (2000-2017) were used in this analysis (as recent years give context to contemporary population size, consumption, water demand, etc).
- Some reservoirs had storage data dating back from the 80s to 2001
- Why CDEC stopped recording monthly storage data for some reservoirs, idk
Explain what next steps could involve
This project is licensed under the MIT License - see the LICENSE.md file for details
- Problem & Visual Inspiration: http://ww2.kqed.org/lowdown/2015/09/21/now-that-summers-over-what-do-californias-reservoirs-look-like-a-real-time-visualization/
- etc