This study presents a machine learning-based approach to predict the primary disease type of samples,
focusing on breast cancer, lung squamous cell carcinoma, and lung adenocarcinoma. Gene expression
data from "train_data.tsv" and "train_label.tsv" were employed for model training and evaluation.
Performance metrics, including F1 score, Accuracy, AUC, Recall, Precision, Kappa, and MCC, were
utilized to gauge model performance. Leveraging AutoML and ensemble techniques, an effective
predictive model was developed. The analysis encompassed exploratory data analysis, dimensionality
reduction, and feature selection techniques to enhance the prediction process. The results underscore the
effectiveness of the model in accurately classifying cancer types, emphasizing the significance of robust
pre-processing and thoughtful model selection.
The objective of this analysis is to perform a comprehensive investigation of differential gene expression
and pathway enrichment between "Tumor" and "Normal" samples. Leveraging the DESeq2 package [1],
differential gene expression analysis is conducted. Subsequently, the GAGE package [3] is employed to
perform pathway enrichment analysis using the KEGG pathway database. The gene expression data
sourced from "TCGA-BRCA.htseq_counts_gene_name.tsv" is in log2-transformed RNAseq count
format, with corresponding sample labels available in the "TCGA-BRCA.pheno.tsv" file.
Materials and Methods: