esteinig / achilles Goto Github PK

Artificial neural networks for taxonomic nanopore signal classification in host microbiomes *update soon* :deciduous_tree:

License: MIT License

Python 99.36% Dockerfile 0.64%

achilles's People

Contributors

Stargazers

Watchers

Forkers

np-core coinlaboratory

achilles's Issues

Add logo

add achilles logo

Adding dropout and recurrent dropout to bidirectional LSTM

Add dropout and test recurrent dropout, since model is over fitting on the human 20 no norm 120k 400 40 minimal dataset.

Paperspace GPU node integration

CUrrently running on Ubuntu interface of GPU machine at Paperspace. Integrate with their Python API to run training directly from command line task.

Reading raw signal limits (.fast5)

Data generation in Dataset currently reads max_samples_per_class from Fast5 files and limits to the maximum per class. More transparent would be to randomly sample nb_samples consecutive overlapping windows from max_reads Fast5 files containing raw signal arrays .

Training on pass and fail reads

Currently training data is from passing Fast5 files only, try and include failing reads for real-world classification.

At least evaluate on failign reads too, see effect on accuracy.

Random seeds

Add random seed for reproducible results in data generation to main options of achilles --help - also check if this requires setting for random, Keras and Numpy separately.

Spitting data into training and validation after data extraction

Split currently occurs independently performing data extraction itself in validation. While signal arrays from random scans of signal windows likely does not overlap much with training data, must make sure the validation data is independent.

Speedup prediction

Replaced custom list comprehension in utils.transform_array_to_tensor with numpy.reshape to generate the 4D-array input (first residual block in Achilles).

Task for cleaning read data

Make sure that raw signal reads map to reference genome for clean training data. Task for associating Fast5 with FASTQ reads (see Tombo) and mapping with mappy perhaps?

Mixed datasets for simulating patient samples

Generate mixed dataset in appropriate proportions to simulate patient data (sequencing i.e. from septic patients and bloodstream infections) with human and bacterial reads in Dataset. If multi-classifier possible, it would be good to test for taxonomic domains, i.e. human, virus, bacterial.

Code annotations

Priority. Annotate code properly.

Pre-release

Pre-release for internal testing, still to do:

Implement production env ✔️
Product of prediction slices
Live watcher Fast5 ✔️
Poremongo release (alpha) on PyPI
Experiment interface
Collections collector to create deployable YAML
Base unittests
Poremongo config CLI

Fix logging integer assignment and implement logging module

Logging module needs to be implemented. Assignment of integer after the log file name (run_id.log.1 or run_id.log.2 with run_id.log most recent) not functional.

Select task, random or largest files

Select utility task asclepius select for copying n Fast5 files from recursive directory to target directory needs a flag for either selecting random (default) or largest --largest, -l files.

Resample chromosomes for independent evaluation data

Resample chromosome 14 and Mixed for independent evaluation data, since the chromosome 14 and mixed models were trained on the generated evaluation data and therefore accuracy and validation in current results are not independent.

Evaluation dataset

Use the training validation split to make a new data set (.hdf5) that reads signals exclusively from the beginning (i.e. 400x400 windows along the signal from beginning) to cheaply simulate data coming in from actual reads. Then later time it by simulating read generation with Taepr and replacing some with human reads perhaps?

Signal normalization

Test effect of normalizing signal by subtracting mean and dividing by standard deviation as in Chiron. Is normalization then also necessary over 400 signal (real-time) prediction windows? Does raw signal values without normalization prevent generalization between sequencing runs or does it capture specific human or bacterial signal? Test with human dataset from different groups from nanopore-wgs-consortium/NA12878

Input file checks

Need to check for existence / validity of input files in Terminal.

Real-time analysis option for predict CLI

Reminder to implement the Fast5 watcher in achilles predict

Baseline data set

Generate baseline dataset for experiments. Current working and decently performing data set is generated on 3000 B. pseudomallei and 2700+ human chromosome 20 reads with:

asclepius prep --dirs bp,human_chr20 --signal_length 400 --signal_stride 400 --data_file data.20.nn.rand100.400.400.200k.h5 -m 200000 -r

Test model on different chromosomes for validation

Currently model trains and validates on chromosome 20 (pt. 5) of the CEPH1463 nanopore raw signal data and our Burkholderia pseudomallei reference B04 from PNG. To validate that the model learned to differentiate human from bacterial signal in the raw signal values, the performing model needs to be tested against other chromosomes and ideally other B. pseudomallei sequence runs. First, test the recent model on other chromosome + B04 data.

Data exploration task

Additional task explore -> achilles explore --help

Generates Jupyter notebook for data exploration of training data, evaluations and predictions:

training data plots - distribution of pA values in reads, number and read length of sampled Fast5
evaluations against randomly and independently sampled datasets, percentage accuracy / loss tables
prediction, confusion matrices, run predictions in simulated runs, runtimes

Bug writing output history with Pandas

DataFrame constructor not properly called with history output object from fit_generator for training and validation.