Git Product home page Git Product logo

achilles's People

Contributors

esteinig avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

achilles's Issues

Paperspace GPU node integration

CUrrently running on Ubuntu interface of GPU machine at Paperspace. Integrate with their Python API to run training directly from command line task.

Reading raw signal limits (.fast5)

Data generation in Dataset currently reads max_samples_per_class from Fast5 files and limits to the maximum per class. More transparent would be to randomly sample nb_samples consecutive overlapping windows from max_reads Fast5 files containing raw signal arrays .

Training on pass and fail reads

Currently training data is from passing Fast5 files only, try and include failing reads for real-world classification.

At least evaluate on failign reads too, see effect on accuracy.

Random seeds

Add random seed for reproducible results in data generation to main options of achilles --help - also check if this requires setting for random, Keras and Numpy separately.

Speedup prediction

Replaced custom list comprehension in utils.transform_array_to_tensor with numpy.reshape to generate the 4D-array input (first residual block in Achilles).

Task for cleaning read data

Make sure that raw signal reads map to reference genome for clean training data. Task for associating Fast5 with FASTQ reads (see Tombo) and mapping with mappy perhaps?

Mixed datasets for simulating patient samples

Generate mixed dataset in appropriate proportions to simulate patient data (sequencing i.e. from septic patients and bloodstream infections) with human and bacterial reads in Dataset. If multi-classifier possible, it would be good to test for taxonomic domains, i.e. human, virus, bacterial.

Pre-release

Pre-release for internal testing, still to do:

  • Implement production env ✔️
  • Product of prediction slices
  • Live watcher Fast5 ✔️
  • Poremongo release (alpha) on PyPI
  • Experiment interface
  • Collections collector to create deployable YAML
  • Base unittests
  • Poremongo config CLI

Select task, random or largest files

Select utility task asclepius select for copying n Fast5 files from recursive directory to target directory needs a flag for either selecting random (default) or largest --largest, -l files.

Resample chromosomes for independent evaluation data

Resample chromosome 14 and Mixed for independent evaluation data, since the chromosome 14 and mixed models were trained on the generated evaluation data and therefore accuracy and validation in current results are not independent.

Evaluation dataset

Use the training validation split to make a new data set (.hdf5) that reads signals exclusively from the beginning (i.e. 400x400 windows along the signal from beginning) to cheaply simulate data coming in from actual reads. Then later time it by simulating read generation with Taepr and replacing some with human reads perhaps?

Signal normalization

Test effect of normalizing signal by subtracting mean and dividing by standard deviation as in Chiron. Is normalization then also necessary over 400 signal (real-time) prediction windows? Does raw signal values without normalization prevent generalization between sequencing runs or does it capture specific human or bacterial signal? Test with human dataset from different groups from nanopore-wgs-consortium/NA12878

Input file checks

Need to check for existence / validity of input files in Terminal.

Baseline data set

Generate baseline dataset for experiments. Current working and decently performing data set is generated on 3000 B. pseudomallei and 2700+ human chromosome 20 reads with:

asclepius prep --dirs bp,human_chr20 --signal_length 400 --signal_stride 400 --data_file data.20.nn.rand100.400.400.200k.h5 -m 200000 -r

Test model on different chromosomes for validation

Currently model trains and validates on chromosome 20 (pt. 5) of the CEPH1463 nanopore raw signal data and our Burkholderia pseudomallei reference B04 from PNG. To validate that the model learned to differentiate human from bacterial signal in the raw signal values, the performing model needs to be tested against other chromosomes and ideally other B. pseudomallei sequence runs. First, test the recent model on other chromosome + B04 data.

Data exploration task

Additional task explore -> achilles explore --help

Generates Jupyter notebook for data exploration of training data, evaluations and predictions:

  • training data plots - distribution of pA values in reads, number and read length of sampled Fast5
  • evaluations against randomly and independently sampled datasets, percentage accuracy / loss tables
  • prediction, confusion matrices, run predictions in simulated runs, runtimes

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.