This script performs energy modeling for buildings using combinations of relevant machine learning algorithms. It takes in cleaned CSV files of building energy data and produces trained models for energy usage prediction.
Build the Docker images using the provided bash script:
bash run_docker.sh
- Conda
- Tensorflow (Steps Down Below)
conda create -n tf-gpu --yes python==3.9
conda activate tf-gpu
conda update --yes --all
conda install --yes -c conda-forge prophet
conda install --yes numpy pandas matplotlib flask
pip3 install --upgrade pip
python3 -m pip install --upgrade setuptools
pip3 install -r requirements.txt
Linux/WSL2:
conda install --yes -c conda-forge cudatoolkit=11.8.0
python3 -m pip install nvidia-cudnn-cu11==8.6.0.163 tensorflow==2.12.*
mkdir -p $CONDA_PREFIX/etc/conda/activate.d
echo 'CUDNN_PATH=$(dirname $(python -c "import nvidia.cudnn;print(nvidia.cudnn.__file__)"))' >> $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
echo 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/:$CUDNN_PATH/lib' >> $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
source $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
# Verify install:
python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
MacOS:
# There is currently no official GPU support for MacOS.
python3 -m pip install tensorflow
# Verify install:
python3 -c "import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000])))"
- Place preprocessed CSV files in a local data folder and call it data.csv (e.g.,
./data/data.csv
). - Create/assign an explicit 'bldgname' column.
- Create/assign an explicit 'ts' column.
- Verify
config.py
to ensure it has the correct configurations for your project. a. Make sure 'save_preprocessed_files' is set to True for first time preprocessing. - Run the script with the following command:
- To run Flask App
python3 -m flask run --host=0.0.0.0 --port=8080
- To run inside terminal
python3 main.py [--preprocess] [--train] [--save] [--predict] [...]
The script supports the following flags:
- --preprocess: Performs the preprocessing step on the preprocessed CSV files in the data folder. The preprocessed data will be stored in temporary files for later use in the training step.
- --train: Performs the training step using the preprocessed data from the previous step. It uses the temporary files generated during the preprocessing step.
Examples:
- To run only the preprocessing step:
python3 main.py --preprocess
- To run only the training step using the preprocessed data:
python3 main.py --train
- To run both preprocessing and training steps sequentially:
python3 main.py --preprocess --train
Examples:
python3 main.py --preprocess --train --save --time_step 48 --datelevel hour
python3 main.py --preprocess --train --save --time_step 30 --datelevel day
python3 main.py --preprocess --train --save --time_step 12 --datelevel month
python3 main.py --preprocess --train --save --time_step 1 --datelevel year
python3 main.py --predict --building_file Stadium_Data_Extended --y_column all --time_step 48 --datelevel hour
Current Heartbeat Examples:
python3 main.py --prune --save_preprocessed_files --preprocess
python3 main.py --run_all --time_step 48 --datelevel hour --results_file 48_hour.csv
python3 main.py --run_all --time_step 30 --datelevel day --results_file 30_day.csv
python3 main.py --run_all --time_step 12 --datelevel month --results_file 12_month.csv
python3 main.py --run_all --time_step 1 --datelevel year --results_file 1_year.csv
Note: If no flags are specified, the script will display a message and exit without performing any action.
The script contains the following configurations:
- data_path: Path to the directory containing the preprocessed CSV files for building energy data.
- buildings_list: List of preprocessed CSV files for building energy data. The default value is ['Stadium_Data_Extended.csv']. save_model_file: Boolean indicating whether to save the trained models as .pkl files. The default value is False.
- save_model_plot: Boolean indicating whether to save the model plots. The default value is False.
- min_number_of_days: Minimum number of days required for a column to be considered for training. The default value is 365.
- memory_limit: Memory limit (in KB) for the AutoSklearnRegressor. The default value is 102400.
- exclude_column: The column to exclude from training. The default value is 'present_co2_tonh'.
- warnings.filterwarnings("ignore"): Ignore warnings during execution.
- y_columns: List of energy usage column names. The default value is ['present_elec_kwh', 'present_htwt_mmbtuh', 'present_wtr_usgal', 'present_chll_tonh', 'present_co2_tonh'].
- add_features: List of additional features to include in the model. The default value is ['temp_c', 'rel_humidity_%', 'surface_pressure_hpa', 'cloud_cover_%', 'direct_radiation_w/m2', 'precipitation_mm', 'wind_speed_ground_km/h', 'wind_dir_ground_deg'].
- header: List of column names in the preprocessed CSV files. The default value is ['ts'] + y_columns + add_features.
- model_types: List of model types to train. Can be 'ensembles' or 'solos'.
- preprocessing_methods: List of preprocessing methods to use. The default value is ['linear_regression', 'linear_interpolation', 'prophet', 'lstm'].
- feature_modes: List of feature selection modes. The default value is ['rfecv', 'lassocv'].
- n_features: List of the number of features to consider. The default value is list(range(1, len(add_features))).
- n_folds: Number of folds for cross-validation. The default value is 5.
- time_steps: List of time steps to use for training. The default value is [1, 8, 12, 24].
- minutes_per_model: Maximum time in minutes for training each model. The default value is 2.
- split_rate: Train-test data split rate. The default value is 0.8.
The CSV files should contain the following columns:
- ts: Timestamps in YYYY-MM-DD HH:MM:SS format.
- bldgname: Building names.
- (Dependent variables aka y_column(s))
- (Optional additional features)
The data is first converted into a Pandas dataframe, sorted by building name and timestamp, and grouped by building name.
The data is then split into training and testing sets, and normalized using MinMaxScaler.
For each building and each energy usage column, the script trains an AutoML model using AutoSklearnRegressor.
If the number of data points for an energy usage column is at least 1 year (365 days) long and the energy usage column is not present_co2_tonh, missing values are filled using Prophet.
The AutoML models are trained on a sliding window of the training data using various time steps specified in time_steps.
For model_type == 'solos', only one model is trained for each energy usage column. For model_type == 'ensembles', multiple models are trained for each energy usage column and their predictions are ensembled together.
The trained models are evaluated using various metrics, including:
- Root Mean squared error (RMSE)
- Mean absolute error (MAE)
- Mean absolute percentage error (MAPE)
The trained models are saved as .pkl files in the ./models/ folder. The evaluation metrics are saved in the root project directory in results.csv
The temperature parameter in the adaptive_sampling function influences the number of data points selected for both exploration and exploitation during the optimization process. It determines the number of data points considered for exploration (random sampling) and exploitation (selecting the best solutions) within each group. A higher temperature value increases the number of data points considered for both exploration and exploitation, while a lower temperature value reduces this number. Be aware that the actual number of data points available for exploration can be limited if the exploitation step uses most of the data points in a group.
Please note that once the temperature is lowered, it has a 'sticky' effect. The number of results will continue to correlate with the lowest temperature value used unless it is reset. If you wish to increase the temperature after lowering it, you will need to reset it to -1 before setting it to your desired higher value to achieve the expected number of results.