Git Product home page Git Product logo

oc-ds-p4-supervised-learning's Introduction

Anticipate the energy consumption of new commercial buildings

Overview

To achieve its goal of being a carbon-neutral city by 2050, careful readings of total energy consumption were carried out by Seattle city officials. However, these statements are expensive to obtain.

The aim of this project is to:

  • predict CO2 emissions and total energy consumption of new commercial buildings in Seattle, based on :
    • data available before commercial operation (size and use of buildings, date of construction, etc.).
    • (expensive) surveys already carried out in 2015 and 2016 on existing buildings.
  • evaluate the interest of the ENERGY STAR Score for the prediction of emissions.

Motivation

This is project 4 for the Master in Data Science (in French, BAC+5) from OpenClassrooms. The project tests the performance and compares baseline, linear, non-linear and ensemble methods of supervised regression:

  • feature engineering, log/quantile transformation and scaling the data
  • splitting the data into train and test sets, avoiding data leakage
  • using filter, wrapper and embedded methods for feature selection
  • L1, L2 regularization and hyperparameter tuning
  • creating pipelines to preprocess, select features and tune the models
  • performing gridsearch and cross-validation
  • evaluating feature importance, model learning curves

Requirements

To run the notebooks, the dataset must be placed in a DATA_FOLDER ('data/raw'). Python libraries are listed in requirements.txt. Each notebook also includes a list of its own requirements, and a procedure for pip install of any missing libraries.

Data : The dataset (2 data files CSV, 2 metadata files JSON) can be downloaded (~3Mb) from the site https://www.kaggle.com/datasets/city-of-seattle/sea-building-energy-benchmarking

Python libraries : numpy, pandas, matplotlib, seaborn, scikit-learn, scipy, missingno, dython, shap

Files

Note: Files are in French. Custom functions created in this project for data preprocessing, statistical analysis and data visualisation are encapsulated within each notebook, to avoid importing and versioning custom libraries. Open https://nbviewer.org/ and paste notebook GitHub url if GitHub takes too long to render.

Approach

Data cleaning

  • data merge, elimination of non-compliant/missing data
  • selection of target columns and only features available for new buildings
  • correction of whitespace, standardise formatting (upper/lower case)

Data exploration and Feature Engineering

  • dimension reduction for categorical columns and one-hot encoding
  • creation of new categories via binning
  • log transformations of X and target regressor

Feature selection

Features were selected to reduce overfitting (high variance), improve confidence in predictions, simplify the models and speed up training.

  • Filter
    • numerical: elimination of colinearities (pearson >0.7, variance inflation factor >5)
    • categorical: Cramer’s V (Chi-squared), Thiel’s U (conditional entropy)
  • Embedded
    • L1 (Lasso), L2(Ridge), L1 & L2 (ElasticNet) regularisation
    • Feature importance (decision trees)
  • Wrapper (KBest)

Regression Models

GridSearch with cross-validation was used to test the following regressors:

  • Baseline (DummyRegressor)
  • Linear (Ridge, Lasso, ElasticNet)
  • Non-Linear (Support Vectors, Kernel Ridge)
  • Ensemble methods (RandomForest, Bagging)

Selection of model

For this set of data, the best performing model was Kernel Ridge (non-linear):

  • Performance metric - low RMSE
  • Faster than ensemble methods
  • Learning curves show training of this model may not be scalable above 3000 buildings.
  • Residuals analysis show under estimation for hospitals and data centers

Conclusion

  • Log transformation of X and Y variables was needed to reduce the influence of outliers (hospitals and data centers)
  • Binning, simplification and one-hot encoding of categorical variables improved the performance of the model.
  • The best performance overall was using Kernel Ridge regression
  • Residual analysis showed that the energy consumption of hospitals and data centers tends to be under-estimated by the model
  • The ENERGY STAR Score reduced the performance on total energy consumption prediction, having no impact on total CO2 emissions: The property usage type and age of construction were more important features.

Suggestions for Improvement

  • Create new features (datacenter_floor_area, hospital_floor_area, unheated_floor_area,...)
  • Use Recursive Feature Elimination
  • Improve interpretability using SHAPely values

Features (keywords)

  • Data cleaning (merge, missing values, outliers, whitespace)
  • Feature engineering (log, quantile, binning, one-hot encoding)
  • Scikit-learn processing pipelines, column transformers, transform target regressor
  • Feature selection (Filter, Wrapper, Embedded)
  • Supervised learning (gridsearch, cross-validation, hyperparameter tuning)
  • Linear regression with L1 and L2 regularization (Ridge, Lasso, ElasticNet)
  • Non-linear regression (support vector (SVR), Kernel Ridge)
  • Ensemble methods: Random Forest, Bagging
  • Performance evaluation, learning curves, residuals analysis
  • Feature importance, permutation importance, SHAPley values

Skills acquired

  • Set up the supervised learning model adapted to the business problem
  • Evaluate the performance of a supervised learning model
  • Adapt the hyperparameters of a supervised learning algorithm in order to improve it
  • Transform the relevant variables of a supervised learning model

oc-ds-p4-supervised-learning's People

Contributors

mrcreasey avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.