nanoporetech / bonito Goto Github PK

View Code? Open in Web Editor NEW

375.0 36.0 116.0 763 KB

A PyTorch Basecaller for Oxford Nanopore Reads

Home Page: https://nanoporetech.com/

License: Other

Python 98.76% Makefile 1.24%

nanopore basecalling pytorch

bonito's Introduction

Bonito

Bonito is an open source research basecaller for Oxford Nanopore reads.

For anything other than basecaller training or method development please use dorado.

$ pip install --upgrade pip
$ pip install ont-bonito
$ bonito basecaller [email protected] /data/reads > basecalls.bam

Bonito supports writing aligned/unaligned {fastq, sam, bam, cram}.

$ bonito basecaller [email protected] --reference reference.mmi /data/reads > basecalls.bam

Bonito will download and cache the basecalling model automatically on first use but all models can be downloaded with -

$ bonito download --models --show  # show all available models
$ bonito download --models         # download all available models

Modified Bases

Modified base calling is handled by Remora.

$ bonito basecaller [email protected] /data/reads --modified-bases 5mC --reference ref.mmi > basecalls_with_mods.bam

See available modified base models with the remora model list_pretrained command.

Training your own model

To train a model using your own reads, first basecall the reads with the additional --save-ctc flag and use the output directory as the input directory for training.

$ bonito basecaller dna_r9.4.1 --save-ctc --reference reference.mmi /data/reads > /data/training/ctc-data/basecalls.sam
$ bonito train --directory /data/training/ctc-data /data/training/model-dir

In addition to training a new model from scratch you can also easily fine tune one of the pretrained models.

bonito train --epochs 1 --lr 5e-4 --pretrained [email protected] --directory /data/training/ctc-data /data/training/fine-tuned-model

If you are interested in method development and don't have you own set of reads then a pre-prepared set is provide.

$ bonito download --training
$ bonito train /data/training/model-dir

All training calls use Automatic Mixed Precision to speed up training. To disable this, set the --no-amp flag to True.

Developer Quickstart

$ git clone https://github.com/nanoporetech/bonito.git  # or fork first and clone that
$ cd bonito
$ python3 -m venv venv3
$ source venv3/bin/activate
(venv3) $ pip install --upgrade pip
(venv3) $ pip install -e .

Interface

bonito view - view a model architecture for a given .toml file and the number of parameters in the network.
bonito train - train a bonito model.
bonito evaluate - evaluate a model performance.
bonito download - download pretrained models and training datasets.
bonito basecaller - basecaller (.fast5 -> .bam).

References

Licence and Copyright

Bonito is distributed under the terms of the Oxford Nanopore Technologies, Ltd. Public License, v. 1.0. If a copy of the License was not distributed with this file, You can obtain one at http://nanoporetech.com

Research Release

Research releases are provided as technology demonstrators to provide early access to features or stimulate Community development of tools. Support for this software will be minimal and is only provided directly by the developers. Feature requests, improvements, and discussions are welcome and can be implemented by forking and pull requests. However much as we would like to rectify every issue and piece of feedback users may have, the developers may have limited resource for support of this software. Research releases may be unstable and subject to rapid iteration by Oxford Nanopore Technologies.

bonito's People

Contributors

Stargazers

Watchers

Forkers

kishwarshafin pythseq lpryszcz genetronbioinfomatics vonrosenchild cakn15 monzyy kedartatwawadi alanteoyueyang ultragenyxts shubhamchandak94 dkurt boegel arun-sub alexbotelhoa davidcpage ditannan josedanielgj lvxuan96 rahulmohan prabindh huge90 epislim sirelkhatim biobenkj shashi-cglab wangjianshou josephlalli szalata cangfengzhe thequicksort standardgalactic jasminequah sathiiii zmunro vellamike ktan8 sleeepyjack jackwadden anddigital-jh dhanushkimapitigama dominik-handler kasperskytte sebastiendazy megsdevs chawater savinduwannigama rsnmsr iwangguoping scottdbrown chunxxc hcstubbe kuanchiun lucidrains linhduongtuan trentprall computescience aadamccox prosto-potomushto alexander4wenjun hiruna72 jameslplatt xiongjun19 snalty shishir-reddy oliviaweng tomisac pmbio lianggong24 jhammery sophia0509 psy-fer ansuini drpopz matthiasdepoortere-gsk nekogahoshiyi myrtlecat tomcxf sf3518 kfwins2022 ryan-tsk caspargross soundag jellislab mbbrendel usamec a-sneddon mauriciolp armroselund13 kw5km cameronweller zwshan n-damo dongspy jinzhaocongmu samueldashadrach kimbioinfostudio william-simon singagan jmniederle

bonito's Issues

reproducing Medaka results

Hey @iiSeymour ,

I'm trying to reproduce the results you reported here: https://github.com/nanoporetech/bonito#medaka

I want to make sure I do things right. You followed that same Medaka procedure of Racon-Medaka and not Medaka alone, is that correct? Also, which assessment tool did you use? Pomoxis?

bonito with openvino error

Hello,

I follow the instruction to install but with little modifications. (I use conda envs)

conda create --name bonito python=3.7
conda activate bonito
git clone -b openvino https://github.com/dkurt/bonito

after the pip install requirement I obtain this error message:

ERROR: After October 2020 you may experience errors when installing or updating packages. This is because pip will change the way that it resolves dependency conflicts.                                                                                                                                                                           
We recommend you use --use-feature=2020-resolver to test your packages with the new resolver before it becomes the default.                                                                                                                                                                                                                                                                  
openvino-python 2021.1 requires numpy==1.16.3, but you'll have numpy 1.19.2 which is incompatible.

when I do "bonito download --all" I obtain this output:

[downloading models]
[downloaded dna_r9.4.1.zip]                                                                         
[downloading training data]
[downloaded dna_r9.4.1.hdf5]                                                                        
[converting dna_r9.4.1.hdf5]
> preparing training chunks

1250345it [05:31, 3774.90it/s]
Traceback (most recent call last):
  File "/home/sysfate/.conda/envs/bonito/venv3/bin/bonito", line 11, in <module>
    load_entry_point('ont-bonito', 'console_scripts', 'bonito')()
  File "/home/sysfate/.conda/envs/bonito/bonito/bonito/__init__.py", line 39, in main
    args.func(args)
  File "/home/sysfate/.conda/envs/bonito/bonito/bonito/download.py", line 93, in main
    File(__models__, train).download()
  File "/home/sysfate/.conda/envs/bonito/bonito/bonito/download.py", line 68, in download
    convert(args)
  File "/home/sysfate/.conda/envs/bonito/bonito/bonito/convert.py", line 118, in main
    training_chunks = filter_chunks(training_chunks)
  File "/home/sysfate/.conda/envs/bonito/bonito/bonito/convert.py", line 87, in filter_chunks
    mu, sd = np.mean(chunks.target_lengths), np.std(chunks.target_lengths)
AttributeError: 'ChunkDataSet' object has no attribute 'target_lengths'

Have you any idea why I obtain this error ?
maybe when I install openvivo, the numpy version are bad ? Or something else ?

RuntimeError: CUDA out of memory.

Dear @iiSeymour

I have been continuing training bonito models with some new datasets. To do so I have used subsets of a mix of bacterial species data (no longer a single dataset), inputting comparable amount of training input (500.000.000 basepairs). When running the Bonito train command on our HPC facility, I do get a RuntimeError: CUDA out of memory. Do you have any estimation on memory consumption/requirement for the Bonito training. I have used the --multi-gpu flag already and tried 2 GPUs, however resulting in the same error message. Our HPC facility has 4x NVIDIA Volta V100 GPUs (32GB GPU memory) on each of 10 nodes.

Thank you in advance.
Best regards,
Nick Vereecke

"CUDA out of memory" error

Hi,

I've encountered "CUDA out of memory" error during running bonita. Below is the error message.

RuntimeError: CUDA out of memory. Tried to allocate 1.52 GiB (GPU 0; 5.80 GiB total capacity; 2.30 GiB already allocated; 1.45 GiB free; 1.21 GiB cached)

My command was as follows:
bonito basecaller /path_to_fast5/ ./bonito/models/quartz5x5-s3-4000/

I'm running bonito on Ubuntu 18.04 equipped with GTX 1660 (NVidia Driver Version: 430.50 & CUDA Version: 10.1).

Are there any solution(s) for this error?

Thanks.

rebasecalling best practice

Hi,

More of a general question than an issue here.

While rebasecalling PromethION runs, should reads from failed_fast5's also be rebasecalled or should one only take the pass_fast5 files?
As fail or pass criteria is most likely based on the outcome of current basecalle guppy, i would think fail_fast5 reads should also be recalled.
Although it will be difficult in the output to distinguish bad quality reads from high quality reads in fail_fast5 basecalled reads as bonito does not report quality scores yet.

Do you think a general filter for read-length (>4000 bp) would make sense?

Thanks,
Michel

Any plans for outputting raw signal based on basecalls?

One of the great things about CTC is that the signal segmentation. I'm not exactly certain as to whether this is possible in Bonito, but are there any plans or is there a possible way to extract the raw signal chunks based on specific subsequences in the basecalled data? It would be really useful for a lot of different applications.

use multiple GPU for base calling

Dear,
is there a way, like for the training, to use multiple GPU in basecalling?

cheers
Luigi

Pair consensus decoding for sequences with the same UMI

Hello,

I'll be soon getting data RNA-seq data with UMIs, and I was interested in running pair consensus decoding on the data. I noticed the paper's and bonito's implementation takes the reverse complement since it's assumed it was from the same molecule.
In my case it's just PCR duplicates so would it be possible to add an option to not take the reverse complement? I'm assuming I can comment out line #149 in pair.py or is there more to it?
Also since I'm detecting multiple copies with > 2 UMI's, could the current model take the consensus of multiple reads during decoding? I'm assuming the runtime will be quadratic in terms of number of reads but the read lengths are shorter since these are transcripts. Otherwise I was planning to running pair decoding randomly in the pool of UMIs and then take the POA of the decoded pairs.

Thanks,
Chang

Pre-trained models

HI all,

I can't see any pre-trained models for human DNA. Did I miss them somewhere or do you plan to release some soon?

Thanks,
Liam

First Issue

How does a lame Mac user run this sans CUDA ? i.e. CPU only.

File "/Users/cbrown/Dropbox/dev/python/bonito/bonito/util.py", line 34, in init
assert(torch.cuda.is_available())
AssertionError

CUDA implementation for training requires improvement

Hi @iiSeymour ,

Over the weekend, I've re-designed the training implementation for Bonito in my fork. The basic improvement was to implement dataparallel class from PyTorch. I've also changed the parameters for my convenience.

You can use amp and dataparallel which allows the batch size from 64 to 384 on the 8GPU machine and the ETA per epoch is ~50mins as opposed to ~6hours without having dataparallel.

Runtime estimation:

GPU Usage:

I think the best idea of training would be to use amp distributedDataparallel to pump up the training speed more. But, at first, I want to make sure that this method works.

To that end. Could you please let me know which species are these reads from? So I can validate the models on a holdout species. Also, please feel free to close the issue.

Event length filter

Would we be able to customize the event length filter on bonito. Feels like its is skipping many short reads in the run.

Sequencing Summary File

Hi, bonito can not generate a sequencing_summary.txt like Guppy do. Is there any way to generate this file?

Many thanks!

Some help on parameters "beamsize" and "weights"

What do the parameters beamsize and weights from bonito basecaller do?

Do you have a quick description? Or were do I find info?

What values can they have?

Speed up data loading and shuffling

Hi Chris,

I am currently setting up bonito for a very large chunks.npy file (>100GB) and realized that load_data() routine takes quite a while on my setup for that, especially due to shuffling the mmaped chunks file.

So while this is probably not the usual input size, maybe its worthwhile considering speeding up the process?
The following requires something like 25 minutes instead 5+ hours for loading and inplace shuffling over a network file system.

Best
Robert

def load_data(shuffle=False, limit=None, directory=None, validation=False):
    """
    Load the training data
    """
    if directory is None:
        directory = default_data

    if validation and os.path.exists(os.path.join(directory, 'validation')):
        directory = os.path.join(directory, 'validation')

    chunks = np.load(os.path.join(directory, "chunks.npy"))
    chunk_lengths = np.load(os.path.join(directory, "chunk_lengths.npy"))
    targets = np.load(os.path.join(directory, "references.npy"))
    target_lengths = np.load(os.path.join(directory, "reference_lengths.npy"))

    if shuffle:
        rng_state = np.random.get_state()
        np.random.shuffle(chunks)

        np.random.set_state(rng_state)
        np.random.shuffle(chunk_lengths)

        np.random.set_state(rng_state)
        np.random.shuffle(targets)

        np.random.set_state(rng_state)
        np.random.shuffle(target_lengths)  

    if limit and limit < chunks.shape[0]:
        chunks = chunks[:limit]
        chunk_lengths = chunk_lengths[:limit]
        targets = targets[:limit]
        target_lengths = target_lengths[:limit]

    return chunks, chunk_lengths, targets, target_lengths

Multiple GPU usage

Hello, we have been trying to use multiple GPUs to basecall and have discovered some interesting behavior. When specifying --device cuda:0 it runs on our CUDA0 device as intended. When we specify --device cuda:1 it leaks on to both GPUs as seen in the image below. Is this an intended behavior? The nomenclature with guppy for multiple GPUs is --device "cuda:0 cuda:1". Is bonito basecalling intended for multi-GPU usage?

[bug] memory leak with CPU basecalling

Hi,

thanks for this, interesting project.

I may have found a mem leak while testing:

Server Ubuntu 1604 48 GB RAM CPU only

# mem leak, starts 40-50% of 48GB RAM, uses 75% after ca 30minutes. Then killed (after 1+ h). bonito basecaller --device cpu /mnt/ngsnfs/tools/bonito/bonito/models$
 bonito basecaller --device cpu /mnt/ngsnfs/tools/bonito/bonito/models/dna_r9.4.1/ subdir1/  > bonito3.fastq &

#input data
-rw-rw-r-- 1 rcug rcug 427M Feb 19 17:08 FAL71492_420e77a4a3d53f993710f389b74f684f01c6c3d4_14.fast5
-rw-rw-r-- 1 rcug rcug 369M Feb 19 17:08 FAL71492_420e77a4a3d53f993710f389b74f684f01c6c3d4_15.fast5
-rw-rw-r-- 1 rcug rcug 372M Feb 19 17:08 FAL71492_420e77a4a3d53f993710f389b74f684f01c6c3d4_16.fast5
-rw-rw-r-- 1 rcug rcug 419M Feb 19 17:08 FAL71492_420e77a4a3d53f993710f389b74f684f01c6c3d4_17.fast5
-rw-rw-r-- 1 rcug rcug 396M Feb 19 17:08 FAL71492_420e77a4a3d53f993710f389b74f684f01c6c3d4_18.fast5







pip freeze

alembic==1.4.0
asn1crypto==1.2.0
bioepic==0.2.9
biopython==1.74
bleach==2.1.3
botocore==1.10.24
bz2file==0.98
certifi==2019.11.28
cffi==1.13.2
chardet==3.0.4
Click==7.0
cliff==2.18.0
cmd2==0.8.9
colorlog==4.1.0
colormath==3.0.0
conda==4.8.2
conda-package-handling==1.6.0
cryptography==2.3.1
cutadapt==2.6
cycler==0.10.0
Cython==0.29.14
deap==1.2.2
decorator==4.4.1
deepTools==2.5.7
diagram==0.2.25
dnaio==0.3
docopt==0.6.2
docutils==0.15.2
entrypoints==0.2.3
fast-ctc-decode==0.1.3
ForestQC==1.1.5
future==0.16.0
gittyleaks==0.0.23
h5py==2.10.0
html5lib==1.0.1
idna==2.8
ipykernel==4.8.2
ipython==6.2.1
ipython-genutils==0.2.0
jedi==0.11.1
Jinja2==2.10.3
jmespath==0.9.4
joblib==0.14.1
jsonschema==2.6.0
jupyter-client==5.2.3
jupyter-core==4.4.0
jupyterhub==0.8.1
jupyterlab==0.31.12
jupyterlab-launcher==0.10.5
lzstring==1.0.4
Mako==1.1.1
mappy==2.17
MarkupSafe==1.1.1
matplotlib==2.1.2
mistune==0.8.3
multiqc==1.0
mysql-connector-python==8.0.17
nanoQC==0.3.3
natsort==6.2.0
nbconvert==5.3.1
nbformat==4.4.0
networkx==2.0
notebook==5.4.0
numpy==1.18.1
olefile==0.46
ont-bonito==0.0.4
ont-fast5-api==3.0.1
ont-tombo==1.5
ont2cram==0.0.1
optuna==1.1.0
pamela==0.3.0
pandas==0.20.3
pandocfilters==1.4.2
parallel-fastq-dump==0.6.5
parameterized==0.7.0
parasail==1.1.19
parso==0.1.1
patsy==0.5.1
pbr==5.4.4
pexpect==4.4.0
pickleshare==0.7.4
Pillow==5.1.0
prettytable==0.7.2
progressbar33==2.4
prompt-toolkit==1.0.15
ptyprocess==0.5.2
py2bit==0.3.0
pyBigWig==0.3.10
pycosat==0.6.3
pycparser==2.19
pyfaidx==0.5.5.2
Pygments==2.2.0
pyOpenSSL==19.0.0
pyparsing==2.4.6
pyperclip==1.7.0
pysam==0.11.2.2
PySocks==1.7.1
python-dateutil==2.8.1
python-editor==1.0.4
python-oauth2==1.1.0
pytz==2019.3
PyYAML==5.3
pyzmq==17.0.0
requests==2.22.0
rpy2==2.8.6
ruamel-yaml==0.11.14
scandir==1.7
scikit-learn==0.19.1
scipy==1.4.1
seaborn==0.8
Send2Trash==1.5.0
sh==1.12.14
simplegeneric==0.8.1
simplejson==3.8.1
singledispatch==3.4.0.3
sip==4.19.13
six==1.14.0
spectra==0.0.11
SQLAlchemy==1.3.13
statsmodels==0.8.0
stevedore==1.32.0
stopit==1.1.1
svim==0.4.2
terminado==0.8.1
testpath==0.3.1
toml==0.10.0
toolshed==0.4.6
torch==1.4.0
tornado==6.0.3
TPOT==0.9.1
tqdm==4.31.1
traitlets==4.3.2
typing==3.5.2.2
umi-tools==0.4.4
update-checker==0.16
urllib3==1.24.2
wcwidth==0.1.8
webencodings==0.5.1
xopen==0.7.3

Procedures for generating training file

Can you please post documentation on generating the training data (eg: bonito.hdf5)?

Thanks!

details about training data

Hello,

Are there any details on how to generate the training data? If you can simply mention the steps, we will hopefully be able to figure it out.

missing wheel package as requirement

Dear developer,
I think wheel is missing in the requirements since without it the installation is not possible.

Thanks,

Luca

issues in running bonito

Hello,

I was able to install bonito in an environment with python 3.6.9, then if I run it I get:

$ bonito basecaller
Traceback (most recent call last):
  File "/home/copettid/anaconda3/envs/py36/bin/bonito", line 5, in <module>
    from bonito import main
  File "/home/copettid/anaconda3/envs/py36/lib/python3.6/site-packages/bonito/__init__.py", line 7, in <module>
    from bonito import train, tune
  File "/home/copettid/anaconda3/envs/py36/lib/python3.6/site-packages/bonito/tune.py", line 31, in <module>
    import optuna
  File "/home/copettid/anaconda3/envs/py36/lib/python3.6/site-packages/optuna/__init__.py", line 1, in <module>
    from optuna import dashboard  # NOQA
  File "/home/copettid/anaconda3/envs/py36/lib/python3.6/site-packages/optuna/dashboard.py", line 23, in <module>
    import optuna.study
  File "/home/copettid/anaconda3/envs/py36/lib/python3.6/site-packages/optuna/study.py", line 13, in <module>
    import pandas as pd  # NOQA
  File "/home/copettid/anaconda3/envs/py36/lib/python3.6/site-packages/pandas/__init__.py", line 55, in <module>
    from pandas.core.api import (
  File "/home/copettid/anaconda3/envs/py36/lib/python3.6/site-packages/pandas/core/api.py", line 29, in <module>
    from pandas.core.groupby import Grouper, NamedAgg
  File "/home/copettid/anaconda3/envs/py36/lib/python3.6/site-packages/pandas/core/groupby/__init__.py", line 1, in <module>
    from pandas.core.groupby.generic import DataFrameGroupBy, NamedAgg, SeriesGroupBy
  File "/home/copettid/anaconda3/envs/py36/lib/python3.6/site-packages/pandas/core/groupby/generic.py", line 56, in <module>
    import pandas.core.algorithms as algorithms
AttributeError: module 'pandas' has no attribute 'core'

Can you help me figure out the type of issue I am having here?

Also, my fast5 files are in a 1.8 TB file.tar archive: do I need to extract them from there or can I use this archive as an input directly?
Thanks,
Dario

Base quality from bonito v0.3?

Hello,

We were going to try to see if Bonito is now suitable for the PEPPER-DeepVariant pipeline but the default commands look like Bonito is still outputting FASTA instead of FASTQ? Is there a way to output FASTQ files on which our pipeline makes few decisions?

Resuming a run

Hi,

I am wondering -- if bonito terminates for whatsoever reason during a run, is there a way to restart the run and pick up from where it left off?

Thanks!

GPU memory reset

Hey Chris, just wanted to let you know that failed / canceled Bonito runs appear to not reset the memory on our GPUs (Tesla V100, 440.64.00, CUDA: 10.2) and run as PIDs that are not visible on nvidia-smi

I usually check with lsof /dev/nvidia0 and kill them, but perhaps this is something you might want to consider (if it's a Bonito issue)

Does bonito require an existing pre-trained model for the workflow?

You mentioned in #8 that you used the prepare_mapped_reads.py script in Taiyaki to generate the training set. However, the script require a pre-trained basecaller model.

Does that mean running bonito will also require me to fetch an existing basecalling pre-trained model, for example, to generate the training set? Sorry, I'm new to this area and trying to learn how to start my own training with my own data.

Training parameters for Bonito: benchmarking

Hi @iiSeymour
First of all, thanks for all the great work you've done on Bonito: keep it up!

I've been exploring different methods for improving basecalling accuracy through pre-processing, and am reaching the point where I would like to compare directly against current state-of-the-art models. According to the "Nanopore Community" website, Bonito achieves 96.16% modal raw read accuracy, and I've seen similar results when using it myself. Would you be willing to share details regarding how the final model was trained and evaluated so that I can replicate it, and accurately compare my own techniques?

For example, I would like to know:

training set (I believe you used the standard HDF5 file included in the repo, but how many chunks, and what min/max sequence length?)
batch size
test set
Did you find using half-precision or different learning rates to have much effect on accuracy? How many epochs did it take for training to converge, and how long was that on your GPU cluster?

I could do some hyperparameter searching myself, but if you'd be willing to share just the parameters chosen for the final model it would save me a lot of time and I'd greatly appreciate it. Thanks in advance for your help!

having problems with bonito installation

Bonito repeatedly fails at parasail1.1.19 installation. Tried installing 1.1.17 first but no success.

make of autoconf failed

IndexError with bonito train

Hello,

I have been trying out model training with Bonito using a converted Taiyaki training dataset.

Looking at the change in Loss function, the training looks really promising, but at the end of training I get IndexError: list index out of range. Since Bonito doesn't seem to save intermediate files or model checkpoints, all model files get lost because of the error.

Here's the full error report (slightly redacted):

[1875010/1875010]: 100%|####################################################| [5:23:58, loss=0.3562]
Traceback (most recent call last):
  File "/.local/bin/bonito", line 8, in <module>
    sys.exit(main())
  File "/.local/lib/python3.6/site-packages/bonito/__init__.py", line 39, in main
    args.func(args)
  File ".local/lib/python3.6/site-packages/bonito/train.py", line 78, in main
    val_loss, val_mean, val_median = test(model, device, test_loader)
  File "/.local/lib/python3.6/site-packages/bonito/training.py", line 209, in test
    references = [decode_ref(target, model.alphabet) for target in test_loader.dataset.targets]
  File "/.local/lib/python3.6/site-packages/bonito/training.py", line 209, in <listcomp>
    references = [decode_ref(target, model.alphabet) for target in test_loader.dataset.targets]
  File "/.local/lib/python3.6/site-packages/bonito/util.py", line 115, in decode_ref
    return ''.join(labels[e] for e in encoded if e)
  File "/.local/lib/python3.6/site-packages/bonito/util.py", line 115, in <genexpr>
    return ''.join(labels[e] for e in encoded if e)
IndexError: list index out of range

Best,
Mantas

bonito on CPU - strange core selection

Hi,

running bonito latest (14. 8 .2020, downloaded via pip today), I get very strange CPU usage.

80 core HT CPU.

1-20 and 61-80 are used, but not the other cores (htop, Ubuntu 20.04). Is this expected ?


python3 --version
Python 3.6.3 :: Anaconda, Inc.

bonito --version
bonito 0.2.1

Thanks,
Colin

  1  [|                                   1.9%]    21 [                                    0.0%]   41 [|||||||||||||||||||||||||||||||||||92.9%]    61 [|||||||||||||||||||||||||||||||||||92.9%]
  2  [|||||||||||||||||||||||||||||||||||92.8%]    22 [                                    0.0%]   42 [||||||                             11.0%]    62 [|||||||||||||||||||||||||||||||||||92.8%]
  3  [|||||||||||||||||||||||||||||||||||92.8%]    23 [                                    0.0%]   43 [                                    0.0%]    63 [|||||||||||||||||||||||||||||||||||92.9%]
  4  [|||||||||||||||||||||||||||||||||||92.9%]    24 [                                    0.0%]   44 [                                    0.0%]    64 [|||||||||||||||||||||||||||||||||||92.9%]
  5  [||||||||||||||||||||||||||||||||||100.0%]    25 [                                    0.0%]   45 [                                    0.0%]    65 [|||||||||||||||||||||||||||||||||||92.8%]
  6  [|||||||||||||||||||||||||||||||||||92.8%]    26 [                                    0.0%]   46 [                                    0.0%]    66 [|||||||||||||||||||||||||||||||||||92.9%]
  7  [|||||||||||||||||||||||||||||||||||92.9%]    27 [                                    0.0%]   47 [                                    0.0%]    67 [|||||||||||||||||||||||||||||||||||92.8%]
  8  [|||||||||||||||||||||||||||||||||||92.9%]    28 [                                    0.0%]   48 [                                    0.0%]    68 [|||||||||||||||||||||||||||||||||||92.8%]
  9  [|||||||||||||||||||||||||||||||||||92.8%]    29 [                                    0.0%]   49 [||                                  1.9%]    69 [|||||||||||||||||||||||||||||||||||92.8%]
  10 [|||||||||||||||||||||||||||||||||||92.8%]    30 [                                    0.0%]   50 [|                                   0.6%]    70 [|||||||||||||||||||||||||||||||||||92.9%]
  11 [|||||||||||||||||||||||||||||||||||92.8%]    31 [                                    0.0%]   51 [|                                   0.6%]    71 [|||||||||||||||||||||||||||||||||||92.9%]
  12 [|||||||||||||||||||||||||||||||||||92.8%]    32 [                                    0.0%]   52 [                                    0.0%]    72 [|||||||||||||||||||||||||||||||||||92.2%]
  13 [|||||||||||||||||||||||||||||||||||92.9%]    33 [                                    0.0%]   53 [                                    0.0%]    73 [|||||||||||||||||||||||||||||||||||92.8%]
  14 [|||||||||||||||||||||||||||||||||||92.8%]    34 [                                    0.0%]   54 [                                    0.0%]    74 [|||||||||||||||||||||||||||||||||||92.9%]
  15 [|||||||||||||||||||||||||||||||||||92.8%]    35 [                                    0.0%]   55 [                                    0.0%]    75 [|||||||||||||||||||||||||||||||||||92.8%]
  16 [|||||||||||||||||||||||||||||||||||92.8%]    36 [                                    0.0%]   56 [                                    0.0%]    76 [|||||||||||||||||||||||||||||||||||92.9%]
  17 [|||||||||||||||||||||||||||||||||||92.9%]    37 [                                    0.0%]   57 [                                    0.0%]    77 [|||||||||||||||||||||||||||||||||||92.9%]
  18 [|||||||||||||||||||||||||||||||||||92.9%]    38 [                                    0.0%]   58 [                                    0.0%]    78 [|||||||||||||||||||||||||||||||||||92.9%]
  19 [|||||||||||||||||||||||||||||||||||92.2%]    39 [                                    0.0%]   59 [                                    0.0%]    79 [|||||||||||||||||||||||||||||||||||92.8%]
  20 [|||||||||||||||||||||||||||||||||||92.8%]    40 [                                    0.0%]   60 [                                    0.0%]    80 [|||||||||||||||||||||||||||||||||||92.8%]

gpu stalling

Hi,

When rebasecalling with bonito I get weird behaviour for some of the fast5 directories as bonito occasionally stalls.

Majority of jobs finish within 1000 secs for 8000 single-seq fast5 files, but some just stop processing. When looking at the GPU, there is actually no GPU-utilization for those processes.

When rerunning the same dir, it stalls again at the same position / read-file, so i think its reproducible.

Do you have any idea what could be the cause or how i can debug this?

Thank you,
Michel

$squeue |grep gn-2

      17012773_397       gpu Bonito_B michelmo  R       3:08      1 gn-2
      **job_393**       gpu Bonito_B michelmo  R 1-21:37:19      1 gn-2
      17012773_368       gpu Bonito_B michelmo  R 2-04:07:09      1 gn-2

$cat logs/bonito-job_393.out |tr "|" "\n" |tail

############
 1639/8000 [03:08<13:41,  7.74it/s]^M 21%
############1
 1642/8000 [03:08<14:44,  7.19it/s]^M 21%
############1
 1643/8000 [03:08<24:08,  4.39it/s]^M 21%
############1
 1644/8000 [03:09<30:28,  3.48it/s]^M 21%
############1
 1645/8000 [03:09<36:55,  2.87it/s]^M 21%
############1
 1647/8000 [03:10<28:44,  3.68it/s]^M 21%
############1
 1648/8000 [03:10<25:07,  4.21it/s]^M 21%
###########7
 1649/8000 [03:12<1:12:59,  1.45it/s]^M 21%
############1
 1651/8000 [03:12<53:23,  1.98it/s]

$nvidia-smi

Every 2.0s: nvidia-smi                                                                                                                                                                                               Tue Apr 14 09:14:30 2020
Tue Apr 14 09:14:30 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64       Driver Version: 440.64       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro RTX 8000     On   | 00000000:01:00.0 Off |                  Off |
| 33%   55C    P2   234W / 260W |  10764MiB / 48601MiB |     73%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Quadro RTX 8000     On   | 00000000:81:00.0 Off |                  Off |
| 33%   26C    P8    12W / 260W |  25854MiB / 48601MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Quadro RTX 8000     On   | 00000000:A1:00.0 Off |                  Off |
| 33%   26C    P8    17W / 260W |  29608MiB / 48601MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0    202151      C   /usr/bin/python3                           10753MiB |
|    1     41510 (job_393)      C   /usr/bin/python3                           25843MiB |
|    2     15580      C   /usr/bin/python3                           29597MiB |
+-----------------------------------------------------------------------------+

Workflow Training new Bonito Model

Dear @iiSeymour

I am currently experimenting with the new Bonito basecaller (in comparison with the Guppy basecaller).

Recently, we have generated a big dataset of bacterial ONT sequences of a highly AT-rich organism. In a de novo assembly, we have noticed we are not able to reach good final assembly Q-scores in comparison to generated Illumina data when running Guppy basecaller. As such, we started investing some time in running Bonito basecaller, followed by de novo assembly. However, no improvement was observed here. We are aware the current Bonito basecaller has not been trained with this type of data, thus we checked first if our Bonito basecaller was performing well on an in-house generated E. coli species dataset. Which rendered highly improved final genome assembly Q-scores (in line with Illumina Q-scores) when using Bonito in comparison with Guppy.

Now we would like to try to train the Bonito basecaller with our own organism dataset(s) to reach similar improved genome assembly Q-scores as with the E. coli dataset. I have been reading through multiple deposits on Github on Bonito, however could not figure out a clear workflow on "how to train the Bonito basecaller". Is it possible to get a (little) walk-through on how to train the bonito model with own input data?

Thank you in advance.
Best regards,
Nick Vereecke

Runtime of Bonito compared to Guppy (CPU and GPU)

I am interested in using the Bonito basecaller, but I am finding much less documentation on it compared to Guppy unfortunately. Is anyone aware of how Bonito compares to Guppy in terms of speed?
Also, depending on its normal speed, can Bonito utilize a GPU during basecalling?

Pretrained Larger Quartznet Models

Hi!

Thanks for providing this very useful repo, and the pre-trained model. I am using the bonito model as a transfer learning model for the application of designing error correction techniques for DNA based data storage.

Did you also try training the larger (10x15) quartznet models? Would it be possible to share their weights? Any other trained models would be useful to try out! (especially the ones which work well on shorter lengths)

Thanks,
Kedar

Keyerror : block

hello
I am trying to run bonito
bonito basecaller --fastq dna_r9.4.1 1/ > fastq
But got an error form python

Traceback (most recent call last):
File "/apps/ont-bonito/0.1.5/bin/bonito", line 11, in
load_entry_point('ont-bonito==0.1.5', 'console_scripts', 'bonito')()
File "/apps/ont-bonito/0.1.5/lib/python3.8/site-packages/ont_bonito-0.1.5-py3.8.egg/bonito/init.py", line 39, in main
args.func(args)
File "/apps/ont-bonito/0.1.5/lib/python3.8/site-packages/ont_bonito-0.1.5-py3.8.egg/bonito/basecaller.py", line 19, in main
model = load_model(args.model_directory, args.device, weights=int(args.weights), half=args.half)
File "/apps/ont-bonito/0.1.5/lib/python3.8/site-packages/ont_bonito-0.1.5-py3.8.egg/bonito/util.py", line 202, in load_model
model = Model(toml.load(config))
File "/apps/ont-bonito/0.1.5/lib/python3.8/site-packages/ont_bonito-0.1.5-py3.8.egg/bonito/model.py", line 44, in init
self.stride = config['block'][0]['stride'][0]
KeyError: 'block

RNA model for bonito

Dear developers,

would you be able to provide an RNA model for bonito somewhere

bonito basecaller rna_r9.4.1 /data/reads > basecalls.fasta

Thank you
Christoph

Store the chunking parameters in the model config

The learning rate, batch size, seed, ect are stored in the model.toml for each model trained, it would be useful (#18) to also store all the parameters given to convert-data in the model.toml. Not all of chunking parameters are available (or can be computed) at training time so just make the convert-data script dump a conversion.toml and read it at training time to store in the model.toml.

install CUDA 10.2 ?

Hi,
is there an easy way to install the CUDA 10.2 packages for bonitio 0.3.0?
Seems like that Ubuntu only has 10.1 in the package manager..

I went the "venv3" route for installing bonio, in case it matters.

thanks,
Peter

Tweaking model parameters

Hello,

I'm trying to test possible enhancements based on ContextNet, but I've noticed the current model file has now deviated from Quartznet. I was wondering how you were able to arrive on the architecture? Was it using some type of grid search? Since ContextNet uses similar architecture compared to Quartznet in the encoder layer I was considering using similar parameters from Bonito.

train a model

instructions no longer working

multi-gpu inference?

Hi Chris,
I'd like to harness the power of multiple GPUs for basecalling. Can you comment if this is possible/easily implementable?
I noticed @kishwarshafin seems to have this implemented in his fork.

any timeline on base quality?

Hey @iiSeymour ,

Internally we have been leaning toward using Bonito but the absence of base quality is halting a lot of integration.
I think you can simply derive the Q values using the inputs of CTC and a linear transformation of aggregated Q values from CTC encoding? It won't be great but will give enough to start with. Do you know when would this be incorporated?

bonito speed question

Hello,
I have installed bonito v0.1.2 and re-basecalled a data set (114 multi fast5 files). With guppy v3.4.4 basecalling took ~2.5hrs while bonito took ~19hrs. The bonito command was as follows:

bonito basecaller --device cuda:0 --half /home/scott/bonito/lib/python3.6/site-packages/bonito/models/dna_r9.4.1 /home/scott/Data/BaS_P0253Lux_4B_SpoIIA_1Ar/BaS_Lux/20190516_2124_MN26548_FAK49745_133fa243/fast5_pass > /home/scott/Data/BaS_P0253Lux_4B_SpoIIA_1Ar/BaS_Lux/20190516_2124_MN26548_FAK49745_133fa243/Bonito_basecalled/BaS_P0253Lux-4B_SpoIIA-1Ar_Bonito2_GPU.fasta

I am running ubuntu 18.04 with a Quadro RTX 5000 GPU. Are there options that can be changed to increase the speed of basecalling?
Thanks,
Scott

Docker container CPU

Is there a docker container for Bonito CPU? I am having lots of dependencies issues trying to install everything

Lowered barcode recognition of bonito basecalled data with "too smart" custom-trained model

Dear @iiSeymour

I have successfully trained a species-specific bonito model reaching assembly Q-scores up to 45 for de novo assembly (what is wonderful and what I hoped to reach with bonito!). However, when performing qcat demultiplexing of my data, significantly more reads (30% to 60%) were classified as "none" as compared to demultiplexing of guppy or standard model (dna_r.9.4.1) trained bonito basecalled reads.

As you guided me previously through training the bonito model in https://github.com/nanoporetech/bonito/issues/22, I was wondering if you have suggestions to get my newly trained model supplemented with this barcode information. It sounds like my own model has become too smart (?!) after its training.

Thank you in advance.
Best regards.
Nick Vereecke

basecaller cannot read fast5 file

Hello iiSeymour :

Recently, I use bonito basecaller to call the fast5 file which is generated by Minion. I have run the basecaller by the model I trained myself and the dna R9.4.1 model released, neither of them worked.
The command is
bonito basecaller dna_r9.4.1 ./minion_phage->minion.fasta

It shows loading model and calling as well, but then it shows there is 0 reads. Such as

loading model
calling
0it [00:00, ?it/s]
completed reads: 0
samples per second 0.0E+00
done

The fast5 files are fine, I read them from the ont-fast5-api, they do contain reads. And I can run the basecaller well just 1 week before. What problem cause this? How can i fix it?

TCSConv1d stride in pointwise convolution

Hi Chris,

I may be missing something, but in TCSConv1D when separable=True shouldnt the pointwise convolution always have stride=1 instead of stride=stride?

Else the separable convolution isn't matching a regular Conv1D in output shape for any stride larger 1, also I feel like stride doesnt make sense for the pointwise part.

Best
Robert

Is block C2 a TCS block?

In the default model config it seems that block C2 is a time-separable convolution (TCS):

bonito/bonito/models/configs/quartznet5x5.toml

Line 87 in 4535398

separable = true

Is this an intentional deviation from the quartznet architecture?
I might be misunderstanding what this line does.

reproduce the training

Hi,

thanks for the project.

I'm trying to reproduce the model training. I had now trained a model with train data(851972x4800). The mean_acc and median_acc in evaluate data(100000x8192) are 95.5645% and 96.1451%. It is about 0.6% less than the model that download from your repository. Out training setting: 8GPUs, bs=512, lr=1e-3, epochs=400. Could you share something detail about this?

Clarification on model

This is probably a dumb question, but the latest model we pull using the script is the same one shown in the March 2020 Accuracy update?

get-models does not work

Hi,
Thanks in advance for your time.
I was trying to get the trained model and I got this error.

$ ./scripts/get-models
Archive:  bonito/models/quartz5x5-s3-4000.zip
  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the last disk(s) of this archive.
unzip:  cannot find zipfile directory in one of bonito/models/quartz5x5-s3-4000.zip or
        bonito/models/quartz5x5-s3-4000.zip.zip, and cannot find bonito/models/quartz5x5-s3-4000.zip.ZIP, period.

Please let me know if I did something wrong.
Pablo.

nanoporetech / bonito Goto Github PK

bonito's Introduction

Bonito

Modified Bases

Training your own model

Developer Quickstart

Interface

References

Licence and Copyright

Research Release

bonito's People

Contributors

Stargazers

Watchers

Forkers

bonito's Issues

Recommend Projects

Recommend Topics

Recommend Org