Dear <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Sorry <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

Dear <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url

Workflow Training new Bonito Model about bonito HOT 21 CLOSED

nanoporetech commented on June 23, 2024

Workflow Training new Bonito Model

from bonito.

Comments (21)

iiSeymour commented on June 23, 2024 1

Hey @menickname

That sounds like a great idea and I'm happy to help.

Did you see the notebook directory which you can run in Google Colab?

Bonito uses the same HDF5 chunkify format that Taiyaki uses so you'll want to start with that, there is good documentation in Taiyaki repo. The training file shipped with Bonito has 66,000 reads split equal across conditions so you will want to filter your own reads to take an equal sample of reads per genome. Once you have created a chunkify file using taiyaki/prepare_mapped_reads.py you can then use bonito/convert-data which will produce 4 .npy files ready for training -

>>> # references.npy
>>> # Integer encoded target sequence {'A': 1, 'C': 2, 'G': 3, 'T': 4}
>>> # Variable length and zero padded (default range between 128 and 256).
>>> np.load('references.npy').shape
(1000000, 256)
>>> np.load('references.npy').dtype
dtype('uint8')
>>>
>>> # reference_lengths.npy
>>> # Lengths of target sequences in references.npy
>>> np.load('reference_lengths.npy').shape
(1000000,)
>>> np.load('reference_lengths.npy').dtype
dtype('uint8')
>>>
>>> # chunks.npy
>>> # Sections of squiggle that correspond with the target reference sequence
>>> # Variable length and zero padded (upto 4096 samples).
>>> np.load('chunks.npy').shape
(1000000, 4096)
>>> np.load('chunks.npy').dtype
dtype('float32')
>>>
>>> # chunk_lengths.npy
>>> # Lengths of squiggle sections in chunks.npy 
>>> np.load('chunk_lengths.npy').shape
(1000000,)
>>> np.load('chunk_lengths.npy').dtype
dtype('uint16')

Bonito uses CTC to train, this a good article explaining how it works https://distill.pub/2017/ctc/. For training, you just need to give a directory to store the model output, the model config and use --directory to select your custom .npy training files.

$ bonito train /data/custom-model-dir config/quartznet5x5.toml --directory /data/custom-chunk-dir

Give it shot and let me know if anything isn't clear or if you run into any problems.

Chris.

from bonito.

menickname commented on June 23, 2024 1

@menickname , indeed! Performs perfectly now. The 4 .npy files have been generated and I will try to start up the training now on HPC GPU resources.

Thank you for your support.

from bonito.

iiSeymour commented on June 23, 2024 1

On the multi-gpu support:

I found DataParallel could hang multi-gpu systems without NVLink/NVSwitch so I haven't merged it yet, guarding the import and uses behind --multi-gpu is probably safe so I'll look to get this in master.

I will look to get this merged in today.

from bonito.

DelphIONe commented on June 23, 2024 1

Without argument chunksize (by default is 3600...) it works ! But I have some doubt. Which impact of this parameter chunksize ?
Thanks for your answer

from bonito.

menickname commented on June 23, 2024

Dear @iiSeymour

Thanks a lot for your prompt response. I will have a look into that. I think I needed this short summary to get from step A to B and so fort to step bonito train. I ended up reading in that notebook directory indeed, however I got confused reading issues on Github and the notebook.
I will try your proposed flow and see how it goes from there. If I encounter any further problems, I will contact you.

One small question for now:

"The training file shipped with Bonito has 66,000 reads split equal across conditions so you will want to filter your own reads to take an equal sample of reads per genome."

This means I should also stick to this 66,000 reads (I don't think the exact size matters here?). I would like to train the model for 1 specific species as I think the basecalling performance is due to the high AT content and numerous repeat stretches of the bacteria we are currently working on. However, I have 10 independent read files of one same isolate, which I might split to get an equal sample read input to train Bonito? If so, I might start training first of all with 1 single dataset and then retrain Bonito with a mixture of all 10 independent runs?

Thank you in advance.
Best regards,
Nick

from bonito.

menickname commented on June 23, 2024

Dear @iiSeymour

Ones again, thank you for your answer earlier today. I have been "playing" around with the taiyaki today and succeeded in getting to the final step - prepare_mapped_reads.py. For this step we have to supply a model file. Currently, I have downloaded and used the r941_dna_minion.checkpoint file available in the taiyaki_walkthrough available for guppy. I was still wondering the following:

Should I use Bonito basecalled data to perform the guppy_aligner step? I don't think so as this does not matter?
Should I use a Bonito specific model file as input for the prepare_mapped_reads.py step? I have been looking through the files available in the Bonito downloads, however could not find the correct model file for this purpose. If so, is there any starting file present available to do this? or should I generate this first myself? The only files I see in Bonito as model files of the dna_r9.4.1 directory are a conf.toml and weights_1.tar file.

Thank you in advance.
Kind regards,
Nick Vereecke

from bonito.

iiSeymour commented on June 23, 2024

Right, the exact size (66,000) isn't important it was more just to let you know that don't need millions of reads for model training. Stick with Guppy for both questions (1, 2), you take the reference sequence from mapping anyway so the caller isn't important.

HTH,

Chris.

from bonito.

menickname commented on June 23, 2024

Dear @iiSeymour

I have successfully generated the chunkify file, however when running the Bonito convert-data command I get the following message:

$ ./convert-data --chunks 1000000 ../../Bonito_Train/Bonito_Nick.hdf5 ../../Bonito_Train/
[ 0.07494327 0. -0.04496622 ... 0.3897055 0.20984128
0.52460355]
Traceback (most recent call last):
File "./convert-data", line 226, in
main(parser.parse_args())
File "./convert-data", line 97, in main
print(read_id, squiggle_duration, len(pointers), mapped_off_the_end, sequence_length)
NameError: name 'mapped_off_the_end' is not defined

I went through all the previous steps again, however could not figure out where something went wrong.
Thank you in advance.

from bonito.

iiSeymour commented on June 23, 2024

Sorry @menickname, I had left a bad debug print in the conversation script, can you do a git pull and try again.

from bonito.

menickname commented on June 23, 2024

Dear @iiSeymour

The first Bonito train is currently running on our HPC-GPU cluster. However, with the current release we are not able to run on multi-GPU. In the #13 (comment) issue I saw you suggested to add some patches to the train.py file. Is there a specific reason this has not been implemented yet in the code itself?

Thank you in advance.

from bonito.

boegel commented on June 23, 2024

@iiSeymour Do you have more information on that hang issue?

How does it manifest, etc., so we can keep an eye on it?

from bonito.

iiSeymour commented on June 23, 2024

There is a PyTorch thread on the issue here.

The hanging happens on the import of DataParallel so you never see the first epoch progressing. See this comment, try running p2pBandwithLatencyTest which can be found in the samples directory of your CUDA install to check your system's device to device latency.

This is all single node multi-gpu b.t.w and I can confirm Nvidia DGX systems are fine as they use NVLink/NVSwitch. Is multi-node something you are wanting to look at?

from bonito.

boegel commented on June 23, 2024

p2pBandwidthLatencyTest doesn't seem to be there anymore in CUDA 10.1.243...

We have 4 NVIDIA Tesla V100 GPUs in each box, so we're only looking at single node multi-GPU for now, which would be a nice step up over single GPU.
A next step could be multi-node, since our GPU cluster has Infiniband (double EDR), but focus is on single-node multi-GPU for now...

from bonito.

boegel commented on June 23, 2024

Seems like we do have NVLink:

$ nvidia-smi nvlink -s
GPU 0: Tesla V100-SXM2-32GB (UUID: GPU-7de090ff-01c3-5997-3e7a-af13b14a1b86)
	 Link 0: 25.781 GB/s
	 Link 1: 25.781 GB/s
	 Link 2: 25.781 GB/s
	 Link 3: 25.781 GB/s
	 Link 4: 25.781 GB/s
	 Link 5: 25.781 GB/s
# similar output for other 3 GPUs

from bonito.

iiSeymour commented on June 23, 2024

Yes, it looks like you are good!

The samples are optional; they are here if you are interested - https://github.com/NVIDIA/cuda-samples

from bonito.

menickname commented on June 23, 2024

Dear @iiSeymour

Bonito train is running perfectly on multi-gpu after installing the patched Bonito version. Please find here the functional version https://gist.github.com/boegel/e0c303497ba6275423b39ea1c10c7d73 from @boegel. This one works on our HPC-GPU resource and fixes 2 of the afore-mentioned issues.

Still 2 questions:

When increasing batchsize with multi-GPU resources, the memory consumption does indeed increase, however speed does not increase. E.g. going from batchsize 128 on 1 GPU to batchsize 256 on 2 GPUs does not increase speed of Bonito training.
What would be required to get this running on multiple nodes? We got 10 nodes with each 4 GPUs available. Could still increase the speed up to 8, 12, or 16 GPUs in multi-GPU mode.

Thank you in advance.

from bonito.

iiSeymour commented on June 23, 2024

How are you measuring speed?
I was thinking about looking at RaySGD for multi-node.

from bonito.

menickname commented on June 23, 2024

I am looking at how progress goes. 1 GPU results in 1% progress in 1 minute. 2 GPUs result in 2% progress in 1 minute. Both having 659240 processed "parts" in total per "round". When running 2 GPUs with batchsize 256 instead of 128, the speed stays at 2 GPUs resulting in 2% progress in 1 minute.
Passed to @boegel .

from bonito.

boegel commented on June 23, 2024

For distributed training, I think @iiSeymour is planning to look into adding support for leveraging RaySGD from Bonito.
I wish I had time to help out with that, but I have way too much on my plate already... If the necessary changes have been made to Bonito, I'm certainly willing to help test it though!

from bonito.

iiSeymour commented on June 23, 2024

Version 0.1.2 is on PyPI now and includes the mutli-gpu patch along with the reworked solution for including the old scripts directory.

What was scripts/convert-data is now available as bonito convert and bonito download has been added for handling pretrained models and training data, full details here.

from bonito.

DelphIONe commented on June 23, 2024

Hi, thanks for your tool :-) I try to train a RNA model... but Bonito has only DNA model so I have used tayaki as start point with
prepare_mapped_reads.py ../fast5/ params_per_read.tsv outputTayaki.fast5 ~/softs/taiyaki/models/mGru_flipflop.py refFromSam.fasta
As @menickname, I have successfully generated the chunkify file. I think :

import h5py
f = h5py.File('outputTayaki.fast5', 'r')
list(f.keys())
['Reads']

With tayaki v5.0.0 only because with the last version (with mLstm_flipflop.py advised for guppy compatibility), I had :
list(f.keys())
['Batches', 'read_ids']

So, I'm happy :-) and I hope to be on the right track.

However when running the Bonito convert command I get the following message:

bonito -v
bonito 0.6.1

bonito convert --chunksize 1000000 outputTayaki.fast5 bonito_convert
/home/xxx/miniconda3/lib/python3.7/site-packages/bonito/cli/convert.py:76: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
reads = np.random.permutation(sorted(reads.items()))

preparing training chunks
0it [00:00, ?it/s]
Traceback (most recent call last):
File "/home/xxx/miniconda3/bin/bonito", line 8, in
sys.exit(main())
File "/home/xxx/miniconda3/lib/python3.7/site-packages/bonito/init.py", line 34, in main
args.func(args)
File "/home/xxx/miniconda3/lib/python3.7/site-packages/bonito/cli/convert.py", line 113, in main
training_chunks = chunk_dataset(training, args.chunksize)
File "/home/xxx/miniconda3/lib/python3.7/site-packages/bonito/cli/convert.py", line 70, in chunk_dataset
chunks, targets = zip(*tqdm(take(all_chunks, num_chunks), total=num_chunks))
ValueError: not enough values to unpack (expected 2, got 0)

I have tried with --chunksize 100000 but same error.
Can you help me please ?
Thank you in advance :-)

from bonito.

Workflow Training new Bonito Model about bonito HOT 21 CLOSED

Comments (21)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent