esteinig / achilles Goto Github PK
View Code? Open in Web Editor NEWArtificial neural networks for taxonomic nanopore signal classification in host microbiomes *update soon* :deciduous_tree:
License: MIT License
Artificial neural networks for taxonomic nanopore signal classification in host microbiomes *update soon* :deciduous_tree:
License: MIT License
add achilles logo
Add dropout and test recurrent dropout, since model is over fitting on the human 20 no norm 120k 400 40 minimal
dataset.
CUrrently running on Ubuntu interface of GPU machine at Paperspace. Integrate with their Python API to run training directly from command line task.
Data generation in Dataset
currently reads max_samples_per_class
from Fast5 files and limits to the maximum per class. More transparent would be to randomly sample nb_samples
consecutive overlapping windows from max_reads
Fast5 files containing raw signal arrays .
Currently training data is from passing Fast5 files only, try and include failing reads for real-world classification.
At least evaluate on failign reads too, see effect on accuracy.
Add random seed for reproducible results in data generation to main options of achilles --help
- also check if this requires setting for random
, Keras
and Numpy
separately.
Split currently occurs independently performing data extraction itself in validation. While signal arrays from random scans of signal windows likely does not overlap much with training data, must make sure the validation data is independent.
Replaced custom list comprehension in utils.transform_array_to_tensor
with numpy.reshape
to generate the 4D-array input (first residual block in Achilles).
Make sure that raw signal reads map to reference genome for clean training data. Task for associating Fast5 with FASTQ reads (see Tombo) and mapping with mappy
perhaps?
Generate mixed dataset in appropriate proportions to simulate patient data (sequencing i.e. from septic patients and bloodstream infections) with human and bacterial reads in Dataset
. If multi-classifier possible, it would be good to test for taxonomic domains, i.e. human, virus, bacterial.
Priority. Annotate code properly.
Pre-release for internal testing, still to do:
Logging module needs to be implemented. Assignment of integer after the log file name (run_id.log.1
or run_id.log.2
with run_id.log
most recent) not functional.
Select utility task asclepius select
for copying n
Fast5 files from recursive directory to target directory needs a flag for either selecting random (default) or largest --largest
, -l
files.
Resample chromosome 14 and Mixed for independent evaluation data, since the chromosome 14 and mixed models were trained on the generated evaluation data and therefore accuracy and validation in current results are not independent.
Use the training validation split to make a new data set (.hdf5) that reads signals exclusively from the beginning (i.e. 400x400 windows along the signal from beginning) to cheaply simulate data coming in from actual reads. Then later time it by simulating read generation with Taepr
and replacing some with human reads perhaps?
Test effect of normalizing signal by subtracting mean and dividing by standard deviation as in Chiron. Is normalization then also necessary over 400 signal (real-time) prediction windows? Does raw signal values without normalization prevent generalization between sequencing runs or does it capture specific human or bacterial signal? Test with human dataset from different groups from nanopore-wgs-consortium/NA12878
Need to check for existence / validity of input files in Terminal
.
Reminder to implement the Fast5 watcher in achilles predict
Generate baseline dataset for experiments. Current working and decently performing data set is generated on 3000 B. pseudomallei and 2700+ human chromosome 20 reads with:
asclepius prep --dirs bp,human_chr20 --signal_length 400 --signal_stride 400 --data_file data.20.nn.rand100.400.400.200k.h5 -m 200000 -r
Currently model trains and validates on chromosome 20 (pt. 5) of the CEPH1463 nanopore raw signal data and our Burkholderia pseudomallei reference B04 from PNG. To validate that the model learned to differentiate human from bacterial signal in the raw signal values, the performing model needs to be tested against other chromosomes and ideally other B. pseudomallei sequence runs. First, test the recent model on other chromosome + B04 data.
Additional task explore
-> achilles explore --help
Generates Jupyter notebook for data exploration of training data, evaluations and predictions:
DataFrame
constructor not properly called with history
output object from fit_generator
for training and validation.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.