Maths and Stats
Simon Wood
Matthew Dailey (s1837608)
Theodora Richards (s1723589)
Emma Saheed (s1858135)
Eman Wong (s1974845)
Zhihao Xu (s2135624)
This project will involve finding publically available data on Covid-19 death rates (per 100 thousand people) for different countries, along side variables that may be associated with the differences in death rate, such as health system spending, hospital cross infection rates, GDP, economic inequality, extend of suppression measures undertaken etc.
The assembled data set will then be analysed using statistical regression methods in R, to investigate the strength and statistical significance of any apparent associations.
A couple of example data sources are https://www.worldometers.info/coronavirus/ https://www.cia.gov/the-world-factbook/
Group (group size 1 - 4)
You need to be competent with applied statistics, linear regression and R. You also need to be comfortable searching out data on the web, and getting it into a usable form.
Faraway, J Linear Models with R, CRC/Taylor and Francis
Contained is this repository is all the code that was used in the creation of our report.
There are several folders:
All Model Latex
Combined DataFrame Work
Data_Dump
GLM (Random Forest (2nd))
GLM Data and Analysis
GLM Take 2
LM Data and Analysis
Old/Middle-Man Data Frames
R_Scripts
Working_Data
This folder contains the LaTeX of the tables for the final model summaries used in our report.
The subfolder structure is:
CSV Files
Clean
Old
Not Clean By Category
Code
Old
Contains our data sorted by category, after a cleaning step was done to ensure that only countries, which are entities in the United Nations are included.
Deprecated
Contains our data sorted by category, but before the cleaning step was done. Thus there are entities which are not countries included, such as "Western Europe".
Contains the code used for merging our initial pulled data from OWID, as well as for cleaning the countries included in our datasets.
Deprecated
Contains all the data pulled from our sources.
This folder contains the code and data for our RF models, which used the R
randomForest
library for our imputation method.
The subfolder structure is:
Bootstrap
Categories Imputed Model
Categories Miss Model
Combined Model
RF Split Into Categories Initial Work
RF Category Imputed Full CSV
RF Category Sig Imputed CSV
RF Category Sig Miss CSV
Contains the script used for bootstrapping our model selection process, as well as CSVs which includes the tallies for each included variable using each model selection pipeline.
Contains the R
script for model RF(b). This uses the data stored in GLM (Random Forest (2nd))/RF Split Into Categories Initial Work/RF Category Sig Imputed CSV
.
Contains the R
script for model RF(c). This uses the data stored in GLM (Random Forest (2nd))/RF Split Into Categories Initial Work/RF Category Sig Miss CSV
Contains the R
scripts for model RF(a). This uses the data: GLM Take 2/Combined Model/combined_all_missing.csv
.
The R
scripts in this directory handle the processing of the clean data, by category, and an initial randomForest
imputation and model selection is run. This process gives us the subset of variables to feed into the following step of the model selection pipeline in RF(b) and RF(c). The data for this subset is then stored in GLM (Random Forest (2nd))/RF Split Into Categories Initial Work/RF Category Sig Imputed CSV
and GLM (Random Forest (2nd))/RF Split Into Categories Initial Work/RF Category Sig Miss CSV
, to generate RF(b) and RF(c) respectively.
Contains the imputed CSVs for all variables initially included in the GLM (Random Forest (2nd))/RF Split Into Categories Initial Work
scripts.
The CSVs to be used in the second stage of the RF(b) model selection pipeline.
The CSVs to be used in the second stage of the RF(c) model selection pipeline.
Deprecated
This folder contains the R
scripts and CSVs that were used for our initial GLM modelling. However, we were using the missForest
package, and included the response along with our explanatory variables during the imputation. This led to relationships between our variables and the response that we found were "too good to be true", and we determined that it was due to the response variable being included in the data used for imputation. Thus, we then left the response out when imputing our data, and the R
scripts and CSVs used in that process are in GLM Take 2
.
This folder contains the code and data for our RF models, which used the R
missForest
library for our imputation method.
The subfolder structure is:
Bootstrap
Categories Imputed Model
Categories Miss Model
Combined Model
Split Into Categories Initial Work
Category Imputed Full CSV
Category Sig Imputed CSV
Category Sig Miss CSV
Summary Statistics
Report Graphs
Contains the script used for bootstrapping our model selection process, as well as CSVs which includes the tallies for each included variable using each model selection pipeline.
Contains the R
script for model MF(b). This uses the data stored in GLM Take 2/Split Into Categories Initial Work/Category Sig Imputed CSV
.
Contains the R
script for model MF(c). This uses the data stored in GLM Take 2/Split Into Categories Initial Work/Category Sig Miss CSV
Contains the R
scripts and CSVs used for model RF(a). The final script is Full Model Imputation.R
, uses the data: combined_all_missing.csv
.
The R
scripts in this directory handle the processing of the clean data, by category, and an initial missForest
imputation and model selection is run. This process gives us the subset of variables to feed into the following step of the model selection pipeline in MF(b) and MF(c). The data for this subset is then stored in GLM Take 2/Split Into Categories Initial Work/Category Sig Imputed CSV
and GLM Take 2/Split Into Categories Initial Work/Category Sig Miss CSV
, to generate MF(b) and MF(c) respectively.
Contains the imputed CSVs for all variables initially included in the GLM Take 2/Split Into Categories Initial Work
scripts.
The CSVs to be used in the second stage of the MF(b) model selection pipeline.
The CSVs to be used in the second stage of the MF(c) model selection pipeline.
Contains the R
script used to generate the summary statistics used in the report.
Contains the images generated by GLM Take 2/Summary Statistics/Category Summary Statistics.R
to be used in the report.
Deprecated
This folder contains the R
scripts and CSVs that were used for our initial LM modelling. However, we found that the model assumptions that were necessary in linear modelling were not satisfied, and that a linear model was not suitable for our purposes.
Deprecated
Contains our initial CSV files after pre-processing. This is not used anymore.
This folder contains subfolders with code, CSV files, graphics, etc. generated by each member. The graphics used throughout the report, aside from the summary statistics, is found in R_Scripts/MatthewR
. For our model selection process, several helper functions were written to drop variables based on their VIF, as well as to perform backwards selection using AICc while fitting models using the glm2
package. These helper functions are contained in R_Scripts/EmanR/automate_vif.R
and R_Scripts/EmanR/step2.R
.
This folder contains all the data, sorted by category, after an initial pruning process, in which irrelevant datasets in Data_Dump
were disregarded.