Schizophrenia is a serious mental disorder that affects how a person thinks, feels, and behaves. It is often described as a type of psychosis, where individuals may have difficulty distinguishing their own thoughts and ideas from reality. The symptoms of schizophrenia can include hallucinations, delusions, muddled thoughts, and speech. The exact cause of schizophrenia is believed to involve a combination of genetic and environmental factors, and it is considered to be a brain disorder.
An electroencephalogram (EEG) is a diagnostic test that assesses the electrical activity in the brain by placing small metal discs (electrodes) on the scalp. The brain constantly communicates through electrical impulses, remaining active even during sleep. This ongoing activity is depicted as wavy lines on the EEG recording, providing valuable insights into brain function.
Integrating machine learning with electroencephalogram (EEG) analysis can enhance early schizophrenia detection. EEG, a non-invasive method measuring brain electrical activity, is invaluable in capturing diverse patterns during different conditions such as rest, sleep, listening, and cognitive activities. This versatility proves crucial in schizophrenia research, revealing distinctive EEG patterns in patients compared to a control group. Examining EEG across varied conditions contributes to a comprehensive understanding of schizophrenia-related brain activity, bolstering its diagnostic potential. Machine learning algorithms excel in analyzing these intricate EEG patterns, offering insights and identifying potential markers for schizophrenia. Numerous studies underscore the superior performance of machine learning in accurately classifying schizophrenia from diverse EEG data.
-
The raw data comprises 32 participants (17 of which are diagnosed with schizophrenia and the remaining 15 being the control group).
-
The EEG data recorded using a reference-free montage using the 10-20 system with the following electrodes shown in the table and image below:
Electrode Location Fp1 Frontopolar, Left Hemisphere Fp2 Frontopolar, Right Hemisphere F3 Frontal, Left Hemisphere F4 Frontal, Right Hemisphere C3 Central, Left Hemisphere C4 Central, Right Hemisphere P3 Parietal, Left Hemisphere P4 Parietal, Right Hemisphere O1 Occipital, Left Hemisphere O2 Occipital, Right Hemisphere F7 Frontotemporal, Left Hemisphere F8 Frontotemporal, Right Hemisphere T3 Temporal, Left Hemisphere T4 Temporal, Right Hemisphere T5 Temporal, Left Hemisphere (posterior to T3) T6 Temporal, Right Hemisphere (posterior to T4) Fz Frontal Midline Pz Parietal Midline Cz Central Midline
- For most of the participants, four phases of EEG data were recorded. The first and third phases being when the participant is at rest. The second phase is when the participant was performing an arithmetic task. The fourth phase is when the participant was subject to frequency-modulated auditory stimuli. There are multiple trials for each participant.
- All EEG data was saved using the European Data Format(EDF). It is a common file standard for recording multichannel biological and physical data.
The following steps were taken and outputted to the processed_data
folder:
- Each participant's trial phase data was extracted and saved in a CSV file and would serve as a datapoint in the dataset. They are stored in the
eeg_data
folder. - All invalid EDF files were discarded.
- A CSV file in
Participants Trial Data.csv
containing all datapoints with their corresponding metadata (i.e. phase, trial). - Another CSV file,
Participants Data.csv
contains all participants with their respective categories (i.e. Patient or Control). - Event markers data i.e. those pertaining to the fourth phase were retrieved and stored in the
event_markers
folder.
- In this project phase, statistical analysis was applied to each datapoint within specific EEG recording phases. Mean, median, variance, standard deviation, and range were computed, and their histograms were inspected.
- The goal was to identify a statistic with a concentrated datapoint distribution, indicating consistent measurements. Outliers, potentially indicative of erroneous EEG data, were targeted for removal.
- Given the sensitivity of EEG measurements, where precision is critical, outliers were considered possible instances of inaccuracies.
- Range emerged as the most fitting statistic, effectively highlighting concentrated regions on histograms and aiding in the identification of potential outliers. This approach enhances the overall reliability of the EEG dataset, crucial for subsequent analyses and machine learning model. The result of this step was stored in the
Filtered Range Participant Trial Data.csv
.
The following features can be derived from the EEG data and have been found to have a correlation with people diagnosed with schizophrenia:
-
Fuzzy Entropy
- Definition: Fuzzy entropy is a measure of the degree of irregularity or fuzziness in time series data.
- Significance: Higher fuzzy entropy values indicate increased irregularity in the EEG signal, which may be associated with schizophrenia.
-
MMN (Mismatch Negativity)
- Definition: MMN measures the mismatch between the power in the presence of a baseline auditory stimulus supplied at a frequent rate and the power in the presence of a deviant auditory stimulus, which is infrequent.
- Significance: Deviations in MMN can reveal the brain's sensitivity to unexpected stimuli, a factor often associated with schizophrenia.
-
Wave Power
- Alpha (8 to 13 Hz): Detected in a restful state.
- Beta (12 to 30 Hz): Detected in the performance of cognitive exercises, e.g., arithmetic.
- Gamma (30 to 100+ Hz): Detected when subjected to auditory stimuli.
- Significance: Aberrations in wave power, especially in specific frequency ranges, can provide insights into cognitive and sensory processing abnormalities linked to schizophrenia.
- For each of the aforementioned features, there is a Jupyter notebook (named using the snake case) that implements the computation of the said feature. They are in the folder
dataset_generators
. - There is a dataset that then combines all features with participants for each participant with a complete set of features used as a record.
- For each of the features created in the previous section(Dataset generation) a Jupyter notebook implements a machine learning model prediction pipeline to diagnosing for schizophrenia. This is also repeated for the combined dataset.
The pipeline for the model is implemented as follows (using MLFlow):
- An experiment is created and details about the name of the feature(s) being used predict for schizophrenia and description of are supplied. Check the MLFlow documentation for more info ast to what an experiment is.
- There is a run which is where the actual model logging occurs. Each run has a description which comprises:
- Model Description: Here the model being used is stated.
- Model Rationale: Here the reason behind the choice of model and whatever parameters were selected.
- Dataset Description: Here the dataset information is stated.
- Dataset Rationale: Here an hypothesis is provided as to what result the dataset would yield.
- The model is then run and relevant metrics and recorded.
- The following data is logged: model and dataset parameters, datasets, metrics.
- The results are then interpreted and logged in a conclusion.
Information about model implementations can be accessed models
folder.
- Simply run the command below (preferably in a virtual environment):
pip install -r requirements.txt
- Each notebook contains detailed explanations for each of steps taken for each feature programatically and mathematically.
- A comprehensive report is in the works.
- References would be included in the said report.