Members: Mani Hema Prashaad, Wang Yiteng, Wong Jun Hong, Yang Xinyi.
- Isolation of Breast Cancer Related Genes' Expression
- Cibersort
- Cox Analysis
- Random Survival Forest (RSF)
- Generation of Means Comparison Graphs
-
Isolation of prognostic gene expression.ipynb
- Extracts the FPKM values for the genes-of-interest after filtering for the primary tumour samples' RNAseq data files for each case.
-
- FPKM data of the patients, pulled from GDC.
-
- Executed to generate the mixture file to be sent to CIBERSORT.
- Also determines the cases without FPKM data (GDC does not have FPKM records of these patients).
-
- Contains the 967 cases-of-interest after dropping cases or imputation (from 1098 cases).
-
- Contains the file IDs of all files pulled from GDC.
-
MANIFEST.txt
file does not include the case ID of each dataset.- Using a scrapped JSON, these file IDs were mapped to their respective Case IDs.
-
- GDC FPKM files uses the ENSEMBL gene symbols which were incompatible with CIBERSORT's LM22 signature matrix, which required HUGO symbols.
- Maps the ENSEMBL symbols to HUGO symbols.
-
- The output mixture file after executing
CibersortProcess.java
. - Uploaded to CIBERSORT.
- The output mixture file after executing
-
- Output from CIBERSORT after 100 permutations were ran, using LM22 and mixture file.
- Contains the cellular proportion of each case. This was used as a feature for predicting survival.
-
- Cox and penalized Cox model for a combination of clinical and gene expression data after feature selection.
-
- Data processing including trimming and imputation for clinical data.
- Cox and penalized Cox model for only clinical features.
-
- Cox and penalized Cox model for a combination of clinical and TIIC data after feature selection.
-
- Contains the 962 patients' clinical data and survival status.
-
Expression profile of Prognostic genes
- Contains the gene expression data of 962 patients with regards to the prognostic genes.
-
- Contains the code for Random Survival Forest model.
- Graphing.ipynb
- Plots barplots overlaid with scatters of datapoints for the testing and training accuracy metrics for the survival analysis models utilised.