Alzheimers_Diagnosis_StageDetermination

EECS 6893 Big Data Analytics - Final Project

Project ID: 201712-18

Authors: Jing Ai (ja3130), Michael Nguyen (mn2769), Haoquan Zhao (hz2441)

Alzheimer’s Disease affect 1 in 3 seniors in the US and is one of the fastest rising part of the healthcare budget. Gaining better understanding of disease patterns and achieving accurate diagnosis showing the disease progression are crucial problems to address. Our project aims to identify the Alzheimer’s Disease biomarker combinations with the highest diagnostic power and examine the disease patterns of patients at different disease stages. The novelty of our project is that we performed a comprehensive analysis that integrated clinical, genomic and imaging data and included patients of multiple disease stages (Normal, Early Mild Cognitive Impairment, Late Mild Cognitive Impairment and Diagnosed Alzheimer’s), as previous studies have only focused their analysis on one modality and binary phenotypes (diseased/not diseased).

Dataset: ADNI data collection

The Alzheimer's Disease Neuroimaging Initiative (ADNI) data collection is a publicly available data collection consist of clinical, genetic and imaging datasets based on studies of approximately 1,550 participants including Alzheimer’s disease patients, mild cognitive impairment subjects and elderly controls across 3 multi-year cohorts (ADNI1, ADNI GO, ADNI2) between 2004 and 2017. More information on the data collection can be found here: http://adni.loni.usc.edu/data-samples/

Analytics

Data Normalization, Preprocessing using PCA

Preprocessing

Clinical

clinical_preprocess.py

Imaging - radiology measurements

imaging_preprocess.py

Genetic

genetic_preprocess.py

Combined

combined_preprocess.py

Data-merging

Classification, Features importance, Features correlations and Data visualization

Multi-layer Perceptron for Merged Data Classification

mlp.py

Convolutional Neural Network(CNN) for Raw Images Classification

Random Forest for classification and assessing feature importance

Spearman’s correlation for estimating feature correlations

adni_visualization.R

tSNE for visualizing high dimensional data

tSNE.ipynb

Problem in Duplicating your study run

Hi, try to replicate your project. Got the data from ADNI. But there are some issues.

(a) I assume I can run the steps one-by-one i.e. pre-processing first then merge then ... I concentrate on the 4 pre-processing steps and 2 merge steps first.

###The most important issue is in (c) about an error which I cannot handle, but I present my questions in the step-by-step manner.

(b) For the before-Merge steps, some of the datasets I cannot find the sources. I handle it somehow but many of the datasets cannot be found in csv and/or the challenge. I ended up just restore your dataset but that is no good. As I should generate from ADNI sources. In particular, at least the following 4 not sure where the sources are:

Merged_clinobioimg_nona.csv     <-- seems to be generated by 2nd merge step 
adin_clin_gwas.csv                         <-- ??? no sure where is this coming from
Merged_Filtered.csv                      <-- seems to generated by preprocessing step 4 
MergedProcessedMRI_filtered.csv <--seems to generated by 1st merge step  as well from outside:
      
For the last one you may refer to Merge step 2 the statement
     #img =pd.read_csv
('/Users/ja/Documents/BigDataAnalytics/BigData_ADNI_project/Data/ProcessedImaging/MergedProcessedMRI_filtered.csv')  I cannot find this from ADNI download and I think there is another file like this.  Where is the big data ADNI project?  Your advice would be needed and very helpful.

(i) The VISCODE cannot be drop as the steps above merge generate VISCODE_x and VISCODE_y. I amend all to drop those instead of VISCODE but not sure it is right.

(ii) Then these error ... I cannot resolve them, do you mind to have a look :

# Convert features to numpy array for PCA transformation
d_x = d_x.as_matrix()
d_x = d_x.astype(float)

# Normalize features to mean=0, variance = 1
x_mean = np.mean(d_x, axis = 0)
x_std = np.std(d_x, axis = 0)

d_x = np.subtract( d_x, np.matlib.repmat(x_mean, n_sampels, 1) )
d_x = np.divide( d_x,np.matlib.repmat(x_std, n_sampels, 1) )

--- error below The first .values is unimportant for the moment, but others do not know how to hanlde ---- 

C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
  
C:\ProgramData\Anaconda3\lib\site-packages\numpy\core\fromnumeric.py:3118: RuntimeWarning: Mean of empty slice.
  out=out, **kwargs)
C:\ProgramData\Anaconda3\lib\site-packages\numpy\core\_methods.py:78: RuntimeWarning: invalid value encountered in true_divide
  ret, rcount, out=ret, casting='unsafe', subok=False)
C:\ProgramData\Anaconda3\lib\site-packages\numpy\core\_methods.py:140: RuntimeWarning: Degrees of freedom <= 0 for slice
  keepdims=keepdims)
C:\ProgramData\Anaconda3\lib\site-packages\numpy\core\_methods.py:110: RuntimeWarning: invalid value encountered in true_divide
  arrmean, rcount, out=arrmean, casting='unsafe', subok=False)
C:\ProgramData\Anaconda3\lib\site-packages\numpy\core\_methods.py:130: RuntimeWarning: invalid value encountered in true_divide
  ret, rcount, out=ret, casting='unsafe', subok=False)

If I ignore it obviously the PCA has not been done and nothing come up.

(d) Under the Merge there is another notebook about file an imaging and "Extract NIfTI imaging files"

I am not sure I can find the files and what are the purpose of this script.

    rootdir = '/Users/ja/Documents/BigDataAnalytics/Preprocessed_AD/Preprocessed_AD_%s'%n

and output is

    copyfile(rootdir+'/'+fid+'/'+sub +'/'+sub2+'/'+sub3+'/'+image_name, "/Users/ja/Documents/BigDataAnalytics/Preprocessed_AD/"+image_name)

asking in advance just in case it will be useful in the future.


for your kind advice:

sapphirine / alzheimers_diagnosis_stagedetermination Goto Github PK