Repository for Biological Data course project, Master Degree in Data Science at University of Padua.
All the required Python packages can be install executing the code
pip install -r requirements.txt
while inside the folder of the project.
All the remaining operations were executed using a Linux x64 machine, launching the bash files inside the data folder.
All the databases needed to execute the code were not included in the repository due to their size, and are hosted in this OneDrive folder. After downloading them, place them inside the data/part_2/original_datasets folder
.
Since the computation of all the metrics for all the models is quite time consuming, we computed them just the first time, saved all the results on .csv files, and just read from them in the Notebook. To recompute from scratch all the metrics to test all the computations, just delete all the data in the parsed
subfolders in data/part_1/HMMs
and data/part_1/PSSMs
.
The main file of the project is Project.ipynb
(available here), in here all the steps we have done can be followed and executed again.
report.pdf
contains an in-depth explanation of what was done during the project and in there will be the interpretations of our results.
In code
can be found all the Python script used in the Jupyter notebook.
In data
can be found all the files and bash script used and saved from the Jupyter notebook.