nanoporetech / duplex-tools Goto Github PK

Splitting of sequence reads by internal adapter sequence search

License: Other

Makefile 1.68% Python 92.99% Shell 5.33%

duplex-tools's Introduction

Deprecation notice Please note that the functionality in this repository is superseded by integrated pairing in Dorado. This means it's no longer necessary to use pairing or dorado_stereo.sh in order to perform end-to-end duplex calling.

Duplex Tools

Duplex Tools contains a set of utilities for dealing with Duplex sequencing data. Tools are provided to identify and prepare duplex pairs for basecalling by Dorado (recommended) and Guppy, and for recovering simplex basecalls from incorrectly concatenated pairs.

Installation

Duplex Tools is written in Python and can be installed directly from PyPI. We recommend installing Duplex Tools into an isolated virtual environment by following:

python -m venv venv --prompt duplex
. venv/bin/activate
pip install duplex_tools

after which the code tools will be available using the duplex_tools command.

General Usage

Duplex Tools is run simply with:

duplex_tools --help

The available sub-commands are:

Duplex pairing

Compatible with Dorado

pair - a wrapper to pair duplex reads, using pairs_from_summary and then filter_pairs.
split_pairs - a utility for recovering and pairing duplex reads (for cases where template/complement are contained within a single minknow read).

Compatible with Guppy+Dorado

pairs_from_summary - identify candidate duplex pairs from sequencing summary output by Guppy or unmapped SAM/BAM by dorado.
filter_pairs - filter candidate pairs using basecall-to-basecall alignment.

Additional tools

split_on_adapter - split the non-split duplex pairs in to their component simplex reads (formerly read_fillet).
- This tool splits basecalled sequences into new sequences. For this reason, it's possible to perform basespace duplex calling after using this method, but not regular stereo calling

Usage with Dorado (recommended)

Currently, pairing and calling are separate processes to allow for workflow flexibility.

For greatest duplex recovery, follow these steps:

Simplex basecall with dorado (with --emit-moves)
Pair reads
Duplex-basecall reads

1a) Simplex basecall with dorado

This will create an (unmapped) .sam file which has a mapping between the signal and bases. --emit-moves allows for additional pairs to be found in step 2b.

$ dorado basecaller [email protected] pod5s/ --emit-moves > unmapped_reads_with_moves.bam

2a) Find duplex pairs for Dorado stereo/basespace basecalling

This will detect the majority of pairs and put them in the pairs_from_bam directory.

duplex_tools pair --output_dir pairs_from_bam unmapped_reads_with_moves.bam

2b) Find additional duplex pairs in non-split reads (optional)

The steps below can recover non-split pairs and allows duplex-calling of them.

Use the sam and a pod5 directory to create additional pairs

$ duplex_tools split_pairs unmapped_reads_with_moves.sam pod5s/ pod5s_splitduplex/
$ cat pod5s_splitduplex/*_pair_ids.txt > split_duplex_pair_ids.txt

3) Stereo basecall all the reads

From the main pairing:

$ dorado duplex [email protected] pod5s/ --pairs pairs_from_bam/pair_ids_filtered.txt > duplex_orig.sam

From the additional pairing (optional):

$ dorado duplex [email protected] pod5s_splitduplex/ --pairs split_duplex_pair_ids.txt > duplex_splitduplex.sam

Usage with Guppy

Preparing duplex reads for Guppy duplex basecalling

To prepare reads for duplex calling Duplex Tools provides two programs. The first parses the sequencing summary output by the Guppy basecaller (or the metadata in a .bam or .sam from dorado) in order to generate candidate pairs from examining simple read metrics. The second program analyses the basecalls of candidate reads, checking for similarity.

To run the basic sequencing summary(/bam metadata) based pairing run the following:

duplex_tools pairs_from_summary <sequencing_summary.txt/dorado.bam> <output directory>

The primary output of the above will be a text file named pair_ids.txt in the user specified output directory. Although this file can be given to Guppy to perform duplex calling we recommend running the second basecall-to-basecall alignment filtering provided by the filter_pairs command:

duplex_tools filter_pairs <pair_ids.txt> <fastq directory/dorado.bam>

The first option here is the file described above and output by pairs_from_summary. The second option should be specified as the Guppy (or MinKNOW), or dorado output directory containing fastq or bam data --- the directory will be search recursively for all .fastq.gz, .fastq, and .sam/.bam files.

The output of this second command will be a file named pair_ids_filtered.txt placed alongside the pair_ids.txt file.

Duplex basecalling with Guppy

The file pair_ids_filtered.txt as prepared above can be used with the original .fast5/.pod5 files produced during a sequencing run in order to calculate high quality duplex basecalls.

For example,

guppy_basecaller_duplex \
    -i <MinKNOW directory> \
    -r -s duplex_calls \
    -x 'cuda:0' -c dna_r10.4.1_e8.2_400bps_sup.cfg \
    --chunks_per_runner 416 \
    --duplex_pairing_mode from_pair_list \
    --duplex_pairing_file pair_ids_filtered.txt

will produce duplex basecalls using the read pairs stored in the pair_ids_filtered.txt file using .fast5/.pod5 files found in the user provided MinKNOW output directory.

Duplex basecalling with Dorado

Please use duplex_tools pair unmapped_dorado.bam. This will run both the pairing and pairwise alignment-based filtering to get a pair_ids_filtered.txt that can be passed to dorado.

For more details, see https://github.com/nanoporetech/dorado.

Help

Licence and Copyright

Duplex Tools is distributed under the terms of the Mozilla Public License 2.0.

Research Release

Research releases are provided as technology demonstrators to provide early access to features or stimulate Community development of tools. Support for this software will be minimal and is only provided directly by the developers. Feature requests, improvements, and discussions are welcome and can be implemented by forking and pull requests. However much as we would like to rectify every issue and piece of feedback users may have, the developers may have limited resource for support of this software. Research releases may be unstable and subject to rapid iteration by Oxford Nanopore Technologies.

duplex-tools's People

Contributors

Stargazers

Watchers

Forkers

rasinj mclaugsf callumparr mattloose olawa

duplex-tools's Issues

np.bool deprecated, package no longer works

Running either sh dorado_stero.sh or duplex_tools pair leads to the following error:

/usr/lib/python3/dist-packages/pandas/util/testing.py:28: FutureWarning: In the future `np.bool` will be defined as the corresponding NumPy scalar.
  import pandas._libs.testing as _testing
Traceback (most recent call last):
  File "/home/michael/.local/bin/duplex_tools", line 5, in <module>
    from duplex_tools import main
  File "/home/michael/.local/lib/python3.8/site-packages/duplex_tools/__init__.py", line 5, in <module>
    from duplex_tools import \
  File "/home/michael/.local/lib/python3.8/site-packages/duplex_tools/assess_split_on_adapter.py", line 7, in <module>
    import pandas as pd
  File "/usr/lib/python3/dist-packages/pandas/__init__.py", line 182, in <module>
    import pandas.testing
  File "/usr/lib/python3/dist-packages/pandas/testing.py", line 7, in <module>
    from pandas.util.testing import (
  File "/usr/lib/python3/dist-packages/pandas/util/testing.py", line 28, in <module>
    import pandas._libs.testing as _testing
  File "pandas/_libs/testing.pyx", line 10, in init pandas._libs.testing
  File "/home/michael/.local/lib/python3.8/site-packages/numpy/__init__.py", line 305, in __getattr__
    raise AttributeError(__former_attrs__[attr])
AttributeError: module 'numpy' has no attribute 'bool'.
`np.bool` was a deprecated alias for the builtin `bool`. To avoid this error in existing code, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
    https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

duplex_tools filter_pairs failed on *fastq.gz files produced by Promethion

Hello,

We noticed a glitch (documented below) where filter_pairs failed to find Good Pairs from the fastq.gz files produced by Promethion.

INPUT

(lims_pvenv) (lims_pvenv) [opl999@cnic-opal-utl-19001 test_duplex]$ file fastq_pass/*.gz
fastq_pass/PAH93923_pass_4ac2ad82_13.fastq.gz: gzip compressed data, max speed
fastq_pass/PAH93923_pass_4ac2ad82_14.fastq.gz: gzip compressed data, max speed

(lims_pvenv) (lims_pvenv) [opl999@cnic-opal-utl-19001 test_duplex]$ cat pair_ids_test.txt 
320a55b5-ec55-40ec-98ad-2a0e2560b5c7 0ac682bd-f31e-421a-9fb0-59c04998ad94

Problem

(lims_pvenv) (lims_pvenv) [opl999@cnic-opal-utl-19001 test_duplex]$ duplex_tools filter_pairs pair_ids_test.txt ./fastq_pass/
[18:44:29 - FiltPairs] Filtering Parameters: 
        nbases_per_strand_to_align=250
        align_threshold=0.6
[18:44:29 - FiltPairs] Alignment Parameters: 
        score_match=2
        score_mismatch=-1
        penalty_open=4
        penalty_extend:1
[18:44:29 - ReadFastq] Processed 0/2 input fastq files.
[18:44:29 - ReadFastq] Found 0.0% of required reads.
[18:44:29 - FiltPairs] Starting alignments.
[18:44:29 - AlignPairs] Aligning 1 pairs
[18:44:29 - AlignPairs] Good pairs: 0

Decompressing the fastq.gz works

(lims_pvenv) (lims_pvenv) [opl999@cnic-opal-utl-19001 test_duplex]$ for i in fastq_pass/*.gz; do gunzip $i; done
(lims_pvenv) (lims_pvenv) [opl999@cnic-opal-utl-19001 test_duplex]$ file fastq_pass/*
fastq_pass/PAH93923_pass_4ac2ad82_13.fastq: ASCII text, with very long lines
fastq_pass/PAH93923_pass_4ac2ad82_14.fastq: ASCII text, with very long lines
(lims_pvenv) (lims_pvenv) [opl999@cnic-opal-utl-19001 test_duplex]$ duplex_tools filter_pairs pair_ids_test.txt ./fastq_pass/
[18:45:13 - FiltPairs] Filtering Parameters: 
        nbases_per_strand_to_align=250
        align_threshold=0.6
[18:45:13 - FiltPairs] Alignment Parameters: 
        score_match=2
        score_mismatch=-1
        penalty_open=4
        penalty_extend:1
[18:45:13 - ReadFastq] Processed 0/2 input fastq files.
[18:45:13 - ReadFastq] Found 100.0% of required reads.
[18:45:13 - FiltPairs] Starting alignments.
[18:45:13 - AlignPairs] Aligning 1 pairs
[18:45:13 - AlignPairs] Good pairs: 1

Regular Gzips work fine

(lims_pvenv) (lims_pvenv) [opl999@cnic-opal-utl-19001 test_duplex]$ for i in fastq_pass/*.fastq; do gzip $i; done            
(lims_pvenv) (lims_pvenv) [opl999@cnic-opal-utl-19001 test_duplex]$ file fastq_pass/*.gz                                     
fastq_pass/PAH93923_pass_4ac2ad82_13.fastq.gz: gzip compressed data, was "PAH93923_pass_4ac2ad82_13.fastq", last modified: Thu Nov 11 00:43:25 2021, from Unix
fastq_pass/PAH93923_pass_4ac2ad82_14.fastq.gz: gzip compressed data, was "PAH93923_pass_4ac2ad82_14.fastq", last modified: Thu Nov 11 00:43:30 2021, from Unix
(lims_pvenv) (lims_pvenv) [opl999@cnic-opal-utl-19001 test_duplex]$ duplex_tools filter_pairs pair_ids_test.txt ./fastq_pass/
[18:55:51 - FiltPairs] Filtering Parameters: 
        nbases_per_strand_to_align=250
        align_threshold=0.6
[18:55:51 - FiltPairs] Alignment Parameters: 
        score_match=2
        score_mismatch=-1
        penalty_open=4
        penalty_extend:1
[18:55:52 - ReadFastq] Processed 0/2 input fastq files.
[18:55:52 - ReadFastq] Found 100.0% of required reads.
[18:55:52 - FiltPairs] Starting alignments.
[18:55:52 - AlignPairs] Aligning 1 pairs
[18:55:52 - AlignPairs] Good pairs: 1

Try edlib for faster duplex matching

Hi,

suggest to try edlib in place of parasail for alignment of pairs to get a speed up for longer comparisons. Potentially this can recover more good pairs if you go from 250 bp to full/long comparison. I tried it with 5kb and got ~15x speedup while keeping almost the same number of good matches.

config/base calling models

I'm confused about the base calling model here, should you use the same one used in simplex base calling or is this model somehow adapted for duplex calling and if so how can we see which duplex models are available?

Thanks for your help,
Charlotte

Feature request: split_on_adapter print each alignment within a read that results in a split (multiple splits)

Is it possible to not only save the print_alignment to file rather than just stdout with the --print_alignment option but do this for each alignment found within a read in the case there are multiple splits.

TypeError: mkdir() got an unexpected keyword argument 'exist_ok'

when I run the pairs_from_summary program, I got the error like this

[10:55:41 - FindPairs] Duplex tools version: 0.2.7 Traceback (most recent call last): File "/export/pipeline/RNASeq/Software/Duplex_tools/bin/duplex_tools", line 11, in <module> load_entry_point('duplex-tools==0.2.7', 'console_scripts', 'duplex_tools')() File "/export/pipeline/RNASeq/Software/Duplex_tools/lib/python3.6/site-packages/duplex_tools/__init__.py", line 39, in main args.func(args) File "/export/pipeline/RNASeq/Software/Duplex_tools/lib/python3.6/site-packages/duplex_tools/pairs_from_summary.py", line 345, in main max_abs_seqlen_diff=args.max_abs_seqlen_diff) File "/export/pipeline/RNASeq/Software/Duplex_tools/lib/python3.6/site-packages/duplex_tools/pairs_from_summary.py", line 35, in find_pairs outdir, prefix, prepend_seqsummary_stem, sequencing_summary_path) File "/export/pipeline/RNASeq/Software/Duplex_tools/lib/python3.6/site-packages/duplex_tools/pairs_from_summary.py", line 101, in prepare_output_paths outdir.mkdir(parents=True, exist_ok=True) TypeError: mkdir() got an unexpected keyword argument 'exist_ok'

That's my command line:

duplex_tools pairs_from_summary sequencing_summary.txt ./output

Could you please help me check this out? Thansk !

allow multiple splits does not work with test duplication

I am trying to see if I can use this tool to split read concatamers that are generated from proximity ligation assay.

Before I went in too deep altering the underlying adapter sequences I wanted to see if it works well using the current adapters.

I took the fastq in the test/data and duplicated the sequence.

But running

duplex_tools split_on_adapter fastq_in split_at_adapter_test/test_data2 Native --print_alignment --allow_multiple_splits --debug_output

the fastq sequence is as follows:

@200bases-tailhead-200bases
TGTGTCGTGATCATGTGTCGTGATCATGTGTCGTGATCATGTGTCGTGATCATGTGTCGTGATCATGTGTCGTGATCATGCTAGTGTAGCTGATGTGCG
TGTGTCGTGATCATGTGTCGTGATCATGTGTCGTGATCATGTGTCGTGATCATGTGTCGTGATCATGTGTCGTGATCATGCTAGTGTAGCTGATGTGCG
**CTGTCTCTTATACACATCTCCTGCAGG
AGATGTGTATAAGAGACAG**
TGTGTCGTGATCATGTGTCGTGATCATGTGTCGTGATCATGTGTCGTGATCATGTGTCGTGATCATGTGTCGTGATCATGCTAGTGTAGCTGATGTGCG
TGTGTCGTGATCATGTGTCGTGATCATGTGTCGTGATCATGTGTCGTGATCATGTGTCGTGATCATGTGTCGTGATCATGCTAGTGTAGCTGATGTGCG
**CTGTCTCTTATACACATCTCCTGCAGG
AGATGTGTATAAGAGACAG**
TGTGTCGTGATCATGTGTCGTGATCATGTGTCGTGATCATGTGTCGTGATCATGTGTCGTGATCATGTGTCGTGATCATGCTAGTGTAGCTGATGTGCG
TGTGTCGTGATCATGTGTCGTGATCATGTGTCGTGATCATGTGTCGTGATCATGTGTCGTGATCATGTGTCGTGATCATGCTAGTGTAGCTGATGTGCG
+
111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
IIIIIIIIIIIIIIIIIIIIIIIIIII
IIIIIIIIIIIIIIIIIII
111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
IIIIIIIIIIIIIIIIIIIIIIIIIII
IIIIIIIIIIIIIIIIIII
111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111

the output splits at the first tail-head adapter but skips over the second tail-head within the fastq.

Duplicate reads and read splitting option in MinKNOW

Hi,

I obtained fastq and bam files from PrometION device and I realized that ~2.4% of reads were duplicate. They have the same read length and read id. I used MinKNOW software with default where read splitting option was turned on. Is there any relationship between read splitting option in MinKNOW software and getting duplicate reads?

Another reason to getting duplicates maybe related to how merging small bam files into one bigger bam file by using samtools cat or samtools merge, but in both case, I got the same number of duplicates and the same read ids.

Do you have recommendation for this in terms of downstream analysis regarding/regardless of duplicate reads.

Kind regards.

Document the relationship between ont-guppy-duplex-pipeline and duplex_tools

The current advice from ONT regarding how to perform duplex basecalling is here:

https://community.nanoporetech.com/posts/guppy-v6-0-0-release (dated 6th December 2021 - login required to view)

It makes no mention of duplex-tools, but says to pip install ont-guppy-duplex-pipeline and then run the script from that package, guppy_duplex, on the original fast5 files.

As far as I can see, this script is a rather clunky wrapper that calls guppy in simplex mode, then performs the equivalent of duplex_tools pairs_from_summary (the code for this is in ont_guppy_duplex_pipeline/channel_neighbours.py and looks like it's related to your duplex_tools/pairs_from_summary.py but the logic is not quite the same) and then runs guppy_basecaller_duplex to get the final result.

My main interest just now is to get a good but quick assessment of the approx number of duplex reads in each dataset, for QC purposes, and so duplex-tools seems the more useful approach. But so save others from having to peer through source code like I've been doing, could you please add some info to the README.md to say what is the relationship between these two ONT-developed packages?

Cheers!

Positional arguments (especially seqkit_stats_nosecondary) in duplex_tools assess_split_on_adapter

Hi!

I am trying to asses how well duplex_tools split_on_adapter is doing its job and duplex_tools assess_split_on_adapter asks for the following positional arguments:
seqkit_stats_nosecondary
edited_reads
unedited_reads
split_multiple_times

I imagine the last three are the .pkl files that are created in the folder for split files, but I am not sure what "seqkit_stats_nosecondary". I have tried to introduce the output of

seqkit stats path/to/file --all

and

seqkit stats path/to/file --all

but I get this error:

/media/seq-ur/65225E7076CF2AF3/basecalling_bacterias/K_oxytoca/K_oxytoca_29_03_2023/pass/split/seqkit_stats contains 1 reads
Traceback (most recent call last):
File "/home/seq-ur/venv/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 3652, in get_loc
return self._engine.get_loc(casted_key)
File "pandas/_libs/index.pyx", line 147, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 176, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 7080, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 7088, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'read'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/seq-ur/venv/bin/duplex_tools", line 33, in
sys.exit(load_entry_point('duplex-tools==0.3.2', 'console_scripts', 'duplex_tools')())
File "/home/seq-ur/venv/lib/python3.9/site-packages/duplex_tools/init.py", line 39, in main
args.func(args)
File "/home/seq-ur/venv/lib/python3.9/site-packages/duplex_tools/assess_split_on_adapter.py", line 129, in main
assess(
File "/home/seq-ur/venv/lib/python3.9/site-packages/duplex_tools/assess_split_on_adapter.py", line 32, in assess
txt = txt[txt['read'].isin(expected_read_ids)]
File "/home/seq-ur/venv/lib/python3.9/site-packages/pandas/core/frame.py", line 3760, in getitem
indexer = self.columns.get_loc(key)
File "/home/seq-ur/venv/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 3654, in get_loc
raise KeyError(key) from err
KeyError: 'read'

Could you help me understand what "seqkit_stats_nosecondary" is?

Thanks!

Extracting duplex reads for multiplexed samples

Since barcoding is still under development in darado, I was wondering what alternative approach would be a proper way to extract duplex reads for multiplexed samples? I am thinking along the lines of the following:

Multiplex using Guppy - to get read ids associated with each barcode
Extract sequences from duplex_orig.sam [step 3 of duplex-tools (Stereo basecall all the reads)] for reads that match the read ids associated with each barcode (from step 1 above)
Can you please suggest?
Thanks

Unexpected input file name changes output file file name format on split_on_adapter

When input FASTX file names include a dot (.) that is not a file extension suffix (example: testfile.1.fastq.gz), split_on_adapter will read .1.fastq.gz as the whole suffix, instead of .fastq.gz. Thus, the output file will be called testfile.fastq.gz, instead of testfile.1_split.fastq.gz. This can break processes downstream in pipelines, because the output file name is not as expected when new naming schemes are introduced.

This is due to lines 123-126 in split_on_adapter.py.

For example:

>>> from natsort import natsorted
>>> from pathlib import Path
>>> fastxs = natsorted(list(Path('.').rglob('*.fastq*')), key=str)
>>> fastx = fastxs[2]
>>> fastx.name
'testfile.1.fastq.gz'
>>> fastx.with_name(fastx.name.replace('.fastq', '').replace('.gz', '') + '_split').with_suffix('.fastq.gz')
PosixPath('testfile.fastq.gz')

Can be solved with this example:

>>> fastx.with_name(fastx.name.replace('.fastq', '').replace('.gz', '') + '_split.fastq.gz')
PosixPath('testfile.1_split.fastq.gz')

Essentially the current code is just overwriting its own addition of '_split' when an unexpected "suffix" occurs.
Accounting for these unexpected suffixes with the --pattern flag can be quite difficult (what would work for this case, assuming there will be more files in the folder named .2.fastq.gz, ..., .600.fastq.gz?), so this seems a pertinent change.

File names that include non-suffix dots can happen due to a variety of reasons. For example, when FASTQ files are split into multiple files with N number of reads in each, for better memory management.

empty output from split_pairs

I get an empty output folder:

LS4:/data2/RUN/RUN091/20230203_1010_MN41147_ANY366_3647ace0/fastq_all$ duplex_tools split_pairs fast5_all.sup.v4.0.0.wmoves.sam . pod5s_splitduplex/
[11:05:09 - SplitPairs] Find split locations.                                                                                                                                                            
[11:05:09 - SplitPairs] Self-align reads from: fast5_all.sup.v4.0.0.wmoves.sam                                                                                                                           
[11:05:14 - SplitPairs] Split/Processed reads:595/5000 (11.90%)                         
[11:05:29 - SplitPairs] Split/Processed reads:1175/10000 (11.75%)      
[11:05:34 - SplitPairs] Split/Processed reads:1697/15000 (11.31%)
[11:05:39 - SplitPairs] Split/Processed reads:2202/20000 (11.01%)                             
[11:05:42 - SplitPairs] Split/Processed reads:2621/24324 (10.78%)                                   
[11:05:42 - SplitPairs] Finished finding breakpoints.                                               
[11:05:42 - SplitPairs] Splitting 2621 reads from: . into pod5s_splitduplex/

LS4:/data2/RUN/RUN091/20230203_1010_MN41147_ANY366_3647ace0/fastq_all$ tree -Difts pod5s_splitduplex/
[       4096 Feb  6 11:05]  pod5s_splitduplex

0 directories, 0 files

Wouldn't I be expecting 2621 pairs in the output folder?

promethion good pairs: 0

Hi
I'm using duplex_tools filter_pairs (duplex tools version: 0.3.2) on promethion created fastq files and out of 2759916 duplex pairs none are reported good. I did find the issue where installing into a new virtual environment fixed this issue, however this didn't work for me. I also gunzip all fastq files and still no good pairs are reported.
The promethion run was created with Guppy 6.4.6 on R10 flow cell. Any ideas what else I could try?
Bettina

filter_pairs_minLen1000_gunzip.log

issue with split_on_adapter output

I successfully ran split_on_adapter with the --allow-multiple-splits option and it split a bunch of reads for me. When converting the new mapped SAM file to BAM format, it wrote about 570mb successfully and then failed with this error:

[E::aux_parse] B aux field type not followed by ','
[W::sam_read1_sam] Parse error at line 413764
samtools view: error reading file "sample_name.mapped.sam"

My workflow:

basecall with Dorado (400bps DNA HAC w/5mC and 5hmC)
convert unmapped SAM to FASTQ with samtools fastq -T '*'
split_on_adapter with --allow-multiple-splits option
map with minimap2
convert output to BAM

This is the offending line:

789cdf07-b995-461e-84fa-bf4ef4c62f1e_1 16 chr1 151237868 60 157M1D3M1D73M35S * 0 0 ATTCCCAAAATTGCTGAGTAGTGGCAATTTTAGATTCTCTTTGGTGGAATCAGAGTGGAAGAGGTAGGCAAGAAGATTTGGAGAAAACTAGATTATAATACATACTGTAGAGAGTTCCTGGGGTTAGAGGAAGGATCTCATTTTCTCCTGTTTTTTTATGATTTTTTTCTCTTTTTGTTTTCTTGATCACTTATTATCTGACCTTCTGGTTTATGGAGGATGAGGCAGTTATGAGCAATATGATGGAACCAGGTACTAACATAAACAG @@>;>9:::A?=-,**///44564455E7200,,.02056;<>A==><=>==;;88@;?=<;;<;;<=BB@<<<=CB@=844..--,--55;<=D<<;<>=?>>>>@acs>>?>;9:=GHEBB42<==@?<;;<?CBCEHKJ@=:::;;<=>?<70.--.<;<DGJ8HB=ACB>;97732(&&&(<;<:9;<>9568899<@=D>===?CBA76<;;;<@>><<99:90//)()+.-'&&%%&&+'''&%%'%##$$'&%%&$##" NM:i:2 ms:i:454 AS:i:454 nn:i:0 tp:A:P cm:i:40 s1:i:216 s2:i:0 de:f:0.0085 rl:i:15 NM:i:4 ms:i:1108 AS:i:1108 nn:i:0 tp:A:P cm:i:84 s1:i:487 s2:i:0 de:f:0.007 rl:i:114 qs:i:15 du:f:1.43725 ns:i:5749 ts:i:10 mx:i:2 ch:i:349 st:Z:2023-04-14T03:12:29.59+00:00 rn:i:25470 f5:Z:FAW60540_pass_9382fa45_a155afd1_502.pod5 sm:f:91.767 sd:f:26.724 sv:Z:quantile MM:Z:C+h?;C+m?; ML:B:C 0->268

Any ideas?

Noah

split_pod5 supported seed types error

When running:

$ duplex_tools --version
Duplex Sequencing Tools 0.3.2
$ python --version
Python 3.11.0
$ duplex_tools split_pairs --threads 2 dorado.moves.bam OESO_152_LSK114/20230322_1606_3G_PAO27011_7b4991d0/ OESO_152_LSK114_pod5s_splitduplex/

I get the following exception:

[07:46:49 - SplitPairs] Finished finding breakpoints.
[07:46:49 - SplitPairs] Splitting 47685 reads from: OESO_152_LSK114/20230322_1606_3G_PAO27011_7b4991d0/ into OESO_152_LSK114_pod5s_splitduplex/
[07:46:49 - SplitPairs] Splitting OESO_152_LSK114/20230322_1606_3G_PAO27011_7b4991d0/pass.pod5
^M0it [00:00, ?it/s]^M0it [00:11, ?it/s]
Traceback (most recent call last):
  File "/lustre/scratch126/casm/team154pc/mp15/duplex-tools.venv/bin/duplex_tools", line 33, in <module>
    sys.exit(load_entry_point('duplex-tools==0.3.2', 'console_scripts', 'duplex_tools')())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lustre/scratch126/casm/team154pc/mp15/duplex-tools.venv/lib/python3.11/site-packages/duplex_tools/__init__.py", line 39, in main
    args.func(args)
  File "/lustre/scratch126/casm/team154pc/mp15/duplex-tools.venv/lib/python3.11/site-packages/duplex_tools/split_pairs.py", line 134, in main
    split_pairs(
  File "/lustre/scratch126/casm/team154pc/mp15/duplex-tools.venv/lib/python3.11/site-packages/duplex_tools/split_pairs.py", line 53, in split_pairs
    _ = split_pod5(
        ^^^^^^^^^^^
  File "/lustre/scratch126/casm/team154pc/mp15/duplex-tools.venv/lib/python3.11/site-packages/duplex_tools/split_pairs_steps.py", line 138, in split_pod5
    rd.seed(read.read_id)
  File "/software/python-3.11.0/lib/python3.11/random.py", line 160, in seed
    raise TypeError('The only supported seed types are: None,\n'
TypeError: The only supported seed types are: None,
int, float, str, bytes, and bytearray.

Combining reads post duplex basecalling for assembly

This may be a naïve question as I'm new to ONT sequencing and analyses. What is the best approach to combine reads post simplex and duplex basecalling for genome assembly? I have a low number of duplex reads, and reads from simplex basecalling. I also have the split reads using split_on_adapter (which I need to assess).

Thanks,
Surabhi

split_pairs not working

Really excited to try out the new split_pairs on some dorado basecalled data but i cant seem to get it to work.

I managed to use pair without problems. but split pair gives the following error (running on 4000 reads for speed). paths shortened for clarity:

(duplex) (base) [francesco testing]$ duplex_tools split_pairs dorado_singlex.bam pod5_test/ split_duplex/
[12:30:15 - SplitPairs] Find split locations.
[12:30:15 - SplitPairs] Self-align reads from: dorado_singlex.bam
Traceback (most recent call last):
  File "/ont-dependencies/duplex_tools/venv/bin/duplex_tools", line 33, in <module>
    sys.exit(load_entry_point('duplex-tools==0.2.20', 'console_scripts', 'duplex_tools')())
  File "/ont-dependencies/duplex_tools/venv/lib/python3.8/site-packages/duplex_tools/__init__.py", line 39, in main
    args.func(args)
  File "/ont-dependencies/duplex_tools/venv/lib/python3.8/site-packages/duplex_tools/split_pairs.py", line 134, in main
    split_pairs(
  File "/ont-dependencies/duplex_tools/venv/lib/python3.8/site-packages/duplex_tools/split_pairs.py", line 40, in split_pairs
    split_locations = get_split_points(
  File "/ont-dependencies/duplex_tools/venv/lib/python3.8/site-packages/duplex_tools/split_pairs_steps.py", line 63, in get_split_points
    chunk3 = [next(it) for _ in range(4000)]
  File "/ont-dependencies/duplex_tools/venv/lib/python3.8/site-packages/duplex_tools/split_pairs_steps.py", line 63, in <listcomp>
    chunk3 = [next(it) for _ in range(4000)]
  File "pysam/libcalignmentfile.pyx", line 2209, in pysam.libcalignmentfile.IteratorRowAll.__next__
StopIteration

Any suggestions for ways to fix this?

Thanks so much!

-Francesco

Unclear usage

how to user duplex_tools split_on_adapter

Cannot install duplex_tools

Hi,

I keep getting the error "Could not find a version that satisfies the requirement pod5 (from duplex_tools) (from versions: )
No matching distribution found for pod5 (from duplex_tools)" when trying to install using pip.

Any ideas on how to solve or circumvent this?

Thank you,
Bert

slow pairing process

I am trying to re-basecall a large set of cDNA reads (~200M) with duplex but it is just too slow. I am guessing I get too many candidates since many reads will be of similar length (1M candidates from 10M reads, LSK109). Guppy_duplex with guppy6.1.1.

A few questions:

are you requiring reads to be from the same pore or is pairing just based on time?
read_all_fastq seems like the slowest part, perhaps due to disk i/o. Not sure exactly what this function does but wonder if it could be sped up e.g if reads were split by pore or sorted by genomic locations?
Some of the runs are barcoded so pairing needs to be done by sample. Can guppy_duplex be set to split reads by barcode before pairing or would it perhaps be better let guppy_barcoder split fastq and fast5 before doing duplex calling?

Combining split_on_adapter with duplex calling

Hi,

From your readme.md it's unclear for me if / how to one can combine the split_on_adapter tool with the guppy duplex calling? The output files of the first do not match with the input files of the second - and I see no parameter in the guppy duplex basecaller to indicate that potentially reads have been split.

Thanks,

Koen

Error in running split_on_adapter

Hello,

I tried running the command duplex_tools split_on_adapter fastq_pass split_res, but got the following error:

Traceback (most recent call last):
  File "/conda/duplex-tools_0.2.3/bin/duplex_tools", line 8, in <module>
    sys.exit(main())
  File "/conda/duplex-tools_0.2.3/lib/python3.6/site-packages/duplex_tools/__init__.py", line 39, in main
    args.func(args)
  File "/conda/duplex-tools_0.2.3/lib/python3.6/site-packages/duplex_tools/split_on_adapter.py", line 320, in main
    args.debug,
AttributeError: 'Namespace' object has no attribute 'debug'

Any ideas on what could be causing the issue?

split_on_adapter: Missing internal adapter/primer from the middle of the main read

Hi,

I am trying to use the "split_on_adapter" functionality to split chimeric ONT reads into sub-reads.

Since we are using different adapters and primers than the one specified in the script, I edited them in the script.

The script seems to be running fine, splitting the reads into sub-reads. But the adapter matches in the middle of the main read are skipped in the output split.fastq file.

To explain better the output sub-read headers are as follows:
The first sub-read:
x_1 None 0->563
The second sub-read:
x_2 None 672->1511

The bases from 564 to 671 are absent from the output file.

Maybe I'm doing something wrong? Because the manual specifies that the output would be just splitting the reads; not removing anything.

Low number of good pairs after filtering

Hi,

I am trying to get duplex sequencing up but I find I get a very low number of 'good pairs' after filtering and consenquently, a very low number of called duplex reads. For example:

Total Reads 28801102
Read pairs (n) 10901142
Paired (%) 75
Good pairs 1151139
Good pairs (%) 4
BAM Duplex reads 1002726
Percentage of original reads (%) 3.48
Mapped 94%

So in this example, I am left with only 4% of the original reads.

I am using the basic usage as recommended:

duplex_tools pairs_from_summary $output_dir/sequencing_summary.txt $output_dir

duplex_tools filter_pairs $output_dir/pair_ids.txt $output_dir

nanopore_guppy guppy_basecaller_duplex \
        --input_path $input_dir \
        -r --save_path $duplex_dir \
        --device auto \
        --config $model \
        --duplex_pairing_mode from_pair_list \
        --duplex_pairing_file $output_dir/pair_ids_filtered.txt \
        --align_ref $ref \
        --bam_out

Questions:

Why do I get so few good pairs and subsequently good reads? 4% is a bit useless.
Should I skip the filtering step and run the second guppy run with the pair_ids.text instead?

Lastly, the duplex basecalling could benefit from simplification. Dorado usage looks good but I am getting errors so its not working at the moment. Would be great if guppy could be simplified!

Several uncertainties

Hi there,

I'm trying to run duplex basecalls on a PromethION cDNA dataset made with kit12/R10.4 chemistry. A few items of confusion:

a.) in the data/ directory of guppy 6.1.7 I don't see any configuration file called q20-fixed-2.0-ft-10M.cfg . Where are users meant to find this? Would the standard config file for 1d basecalling, given knowledge of the sequencing kit/flow cell, work in lieu?

b.) I'm trying to run guppy_basecaller_duplex from the 1d-summary as below:

bsub -q gpu-normal -o guppy.out -e guppy.err -gpu "num=1:j_exclusive=yes" -M 20000 -R "rusage[mem=20000] select[mem>20000]" "./ont-guppy/bin/guppy_basecaller_duplex --input_path 20220613_1523_2E_PAM63315_75942a9f/ -r --save_path 20220613_1523_2E_PAM63315_75942a9f/ -x "cuda:0" --config dna_r10.4_e8.1_sup.cfg --chunks_per_runner 450 --duplex_pairing_mode from_1d_summary --duplex_pairing_file 20220613_1523_2E_PAM63315_75942a9f/sequencing_summary_PAM63315_54e6d536.txt.gz"

But whenever I do I get the error:

"[guppy/error] main: Error: Incomplete row in line 3: expecting 2 columns but got 1
[guppy/warning] main: An error has occurred. Aborting."

I have no idea how to interpret this - would appreciate any input. Am I somehow formulating this command wrong?

c.) Taking a different tack and using duplex_tools first to generate the filtered pair-ids file, I find that the filter_pairs subcommand is a massive memory hog - jobs dying with 100 GB RAM allocated (& 30 cores). Is this normal, and is there anything that can be done to control it?

d.) Finally, one of the other issues threads mentions differences in duplex calling parameters between gDNA and cDNA - is there an up-to-date set of recommendations ONT can provide for cDNA duplexes?

Thanks very much in advance,

Chris L

pairs.txt file empty, but pairs_from_bam/pair_ids.txt not empty

See below:

LS4:/data2/RUN/RUN091/20230203_1010_MN41147_ANY366_3647ace0/fastq_all$ duplex_tools pair *.sam > pairs.txt                                                                                      [10:53:04 - FindPairs] Duplex tools version: 0.3.0 
[10:53:04 - FindPairs] Creating seqsummary from bam                                                                                                                                                      24324it [00:01, 13652.28it/s]  
[10:53:05 - FindPairs] Calculating metrics.                                                                                                                                                              [10:53:05 - FindPairs] No alignment information found for validation.
[10:53:05 - FindPairs] Classifying pairs.                                                                                                                                                                [10:53:05 - FindPairs] 24242 pairs after filtering on duration between reads. (max 200000 s)
[10:53:05 - FindPairs] 4852 pairs after filtering on relative sequence length difference. (max 10.0% difference)                                                                                         [10:53:05 - FindPairs] 4851 pairs after absolute sequence length filtering. (max 5000 bp)
[10:53:05 - FindPairs] 4495 pairs after qscore filtering. (min qscore = 6)                                                                                                                               [10:53:05 - FindPairs] Found 4495 pairs within 24324 reads. (37.0% of reads are part of a pair).
[10:53:05 - FindPairs] Values above 100% are allowed since reads can be either template or complement                                                                                             
[10:53:05 - FindPairs] Writing files into pairs_from_bam directory                                                                                                                                       
[10:53:05 - FiltPairs] Duplex tools version: 0.3.0   
[10:53:05 - FiltPairs] Filtering Parameters:                                                        
        nbases_per_strand_to_align=250                                                              
        align_threshold=0.6
[10:53:05 - FiltPairs] Alignment Parameters:                                                        
        score_match=2                                                                               
        score_mismatch=-1                                                                                                                                                                                        penalty_open=4                                                                                                                                                                                   
        penalty_extend:1                                                                                                                                                                                 [10:53:07 - ReadFastq] Processed 0/1 input fastq/bam files.                                                                                                                                              
[10:53:07 - ReadFastq] Found 100.0% of required reads.
[10:53:07 - FiltPairs] Starting alignments.                                                         
[10:53:07 - AlignPairs] Aligning 4495 pairs                                                         
[10:53:07 - AlignPairs] Good pairs: 1799
[10:53:09 - Pair] Initial reads: 24324
[10:53:09 - Pair] Created pairs: 1799
[10:53:09 - Pair] Paired reads:  3598
[10:53:09 - Pair] Approximate duplex rate for fast5_all.sup.v4.0.0.wmoves.sam: 14.79%

The results are:

[          0 Feb  6 10:53]  ./pairs.txt
[       4096 Feb  6 10:53]  ./pairs_from_bam
[     332630 Feb  6 10:53]  ./pairs_from_bam/pair_ids.txt
[    1858079 Feb  6 10:53]  ./pairs_from_bam/pair_stats.txt
[     133126 Feb  6 10:53]  ./pairs_from_bam/pair_ids_filtered.txt
[     358563 Feb  6 10:53]  ./pairs_from_bam/pair_ids_scored.csv
[    2661432 Feb  6 10:53]  ./pairs_from_bam/read_segments.pkl

Could it be that the way the command was run, it outputs the pairs in the pairs_from_bam/pair_ids.txt file, rather than STDOUT?

Thanks

problem with Fastx

I did the following:

Conda environment

create

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
/bin/bash Miniconda3-latest-Linux-x86_64.sh
source ~/.bashrc

conda create --name nanopore
conda activate nanopore

packages

duplex-tools

https://github.com/nanoporetech/duplex-tools

conda install -c bioconda duplex-tools

if I now want to test the installation it gives the following error:

(nanopore) sheucke@PTH2D537:~/nanopore/nanopore$ duplex_tools --help
Traceback (most recent call last):
File "/home/sheucke/miniconda3/envs/nanopore/bin/duplex_tools", line 6, in
from duplex_tools import main
File "/home/sheucke/miniconda3/envs/nanopore/lib/python3.6/site-packages/duplex_tools/init.py", line 5, in
from duplex_tools import
File "/home/sheucke/miniconda3/envs/nanopore/lib/python3.6/site-packages/duplex_tools/split_on_adapter.py", line 14, in
from pyfastx import Fastx
ImportError: cannot import name 'Fastx'

(nanopore) sheucke@PTH2D537:~/nanopore/nanopore$

Question - Split on adapter

Hello,
I am duplex basecalling with dorado. Can split_on_adapter accept unmapped bam files for input/output?
Thanks

filter_pairs fails if both fastq and aligned bam are present

Hi again,
ran the pairing with a guppy output folder with modbases in aligned bam and got almost no valid pairs. Moving the fastq files to a separate folder solved this, with 63% good pairs vs 1% when bam is present.

[09:09:08 - AlignPairs] Processed/Skip/Good: 94%/0%/1%
[09:09:09 - AlignPairs] Good pairs: 3505

[09:17:28 - AlignPairs] Processed/Skip/Good: 95%/0%/63%
[09:17:29 - AlignPairs] Good pairs: 199603

question on split pairs

When running split pairs on pod5s and the basecalled sam, duplex tools generates a new folder with split pod5 files and associated read ids. Does this folder contain all the reads that where split, including non-duplex reads, or just the ones that where identified as duplex?

Specify model?

HI Team

Any chance of specifying the calling model when running duplex_tools? I manually edited the scripts to use the R10.4.1 models but it crashes half way through...

Andrew

KeyError: 'sequence_length_template' when basecalling is turned off

Hi there,

I'm trying to run duplex basecalling data from a kit14/R10.4.1 Mk1B run. During run setup, we set basecalling to OFF in MinKNOW as we are basecalling with Guppy on a separate server after the run is complete.

When I run duplex_tools pairs_from_summary using the output sequencing_summary.txt file, I get the following KeyError:

[12:15:18 - FindPairs] Duplex tools version: 0.3.1
[12:15:18 - FindPairs] Loading sequencing summary.
[12:15:30 - FindPairs] Calculating metrics.
Traceback (most recent call last):
  File "conda_envs/duplex_tools_v0.3.1/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3802, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 165, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 5745, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 5753, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'sequence_length_template'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "conda_envs/duplex_tools_v0.3.1/bin/duplex_tools", line 8, in <module>
    sys.exit(main())
  File "conda_envs/duplex_tools_v0.3.1/lib/python3.10/site-packages/duplex_tools/__init__.py", line 39, in main
    args.func(args)
  File "conda_envs/duplex_tools_v0.3.1/lib/python3.10/site-packages/duplex_tools/pairs_from_summary.py", line 367, in main
    find_pairs(
  File "conda_envs/duplex_tools_v0.3.1/lib/python3.10/site-packages/duplex_tools/pairs_from_summary.py", line 86, in find_pairs
    seqsummary = calculate_metrics_for_next_strand(seqsummary)
  File "conda_envs/duplex_tools_v0.3.1/lib/python3.10/site-packages/duplex_tools/pairs_from_summary.py", line 267, in calculate_metrics_for_next_strand
    seqsummary["sequence_length_template"].shift(-1)
  File "conda_envs/duplex_tools_v0.3.1/lib/python3.10/site-packages/pandas/core/frame.py", line 3807, in __getitem__
    indexer = self.columns.get_loc(key)
  File "conda_envs/duplex_tools_v0.3.1/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3804, in get_loc
    raise KeyError(key) from err
KeyError: 'sequence_length_template'

I realized this is because our sequencing_summary.txt file only has these headers, and is missing the sequence_length_template header (among others):

filename_fastq filename_fast5 filename_pod5 parent_read_id read_id run_id channel mux minknow_events start_time duration pore_type experiment_id sample_id end_reason

This only seems happens for runs where we have explicitly turned off basecalling; for the runs where basecalling in MinKNOW is enabled, all required headers in the sequencing_summary.txt file are present.

Is there a way to recover the sequence_length_template information somehow to get duplex_tools to run?

duplex_tools filter_pairs to have min and max length options?

Hi
It would be nice to have options for duplex_tools filter_pairs to filter read pairs by length to save time on duplex calling.

Best regards
Rasmus

specify custom adapters to split_on_adapter

We see an extensive amount of chimeras with no intermediate sequencing adapters using the ligation sequencing kit 110. We use the SQK-LSK110 on multiplexed samples that have 24-nt barcodes in each end. After barcoding (in index PCR) the samples are pooled and used as input to the SQK-LSK110 kit. We have up to 24 different barcode sets.

Chimeric sequences have structures like:

5'-ADAPTER-Y-TOP-BARCODEX-PRIMER-SEQ-PRIMER-BARCODEX-BARCODEY-PRIMER-SEQ-PRIMER-BARCODEY-3'

Where adapter-Y-top is the ONT sequencing adapter, barcode X and Y are the barcodes introduced by an index PCR targetting PRIMER sites. The primer sites are ligated to the sequence of interest (SEQ). We have observed up to 14 barcode pairs in a single read.

Would like to split to at sites where barcodes are adjacent to each other, so that the read above becomes:

5'-ADAPTER-Y-TOP-BARCODEX-PRIMER-SEQ-PRIMER-BARCODEX-3' and 5'-BARCODEY-PRIMER-SEQ-PRIMER-BARCODEY-3'

Feature request: specify number of threads

Would be great to be able to change default thread allocation, or maybe I missed the option.

Cheers,
Olin

False positives in filter_pairs when run on linearized plasmid

Hi there, I'm attempting to do duplex basecalling on a linearized plasmid sample. When I ran filter_pairs on this dataset, it told me 90% of my reads were duplex pairs, which seems unlikely given I'm using Kit 12.

Based on the filtering steps filter_pairs reported, it looks like this tool identifies duplex pairs based on whether two consecutive reads are of roughly the same length, and whether the second strand followed the first through the same pore less than 20 seconds after the first strand finished.

I suspect this method of filtering has a high false positive rate when the sample is something like a linearized plasmid or PCR amplicon, where almost all of the DNA molecules are of the same length. This might even be an issue for variant libraries of plasmids or amplicons, if the 'same length' requirement isn't stringent (I think it said it only checks if the read length difference is less than 10% of the total read length?)

Is there any other method of filtering you might recommend or add to the tool, that could address this challenge? For example, does the second strand in a duplex typically have a different (smaller?) mean Q score than the first strand? Is there a quick-and-dirty alignment tool you'd recommend running on the simplex fastq reads of potential duplex pairs, to confirm that one strand is the reverse complement of the other?

Thanks!

split_on_adapter no more than one core?

Hi Duplex tools team,

I've been running duplex_tools split_on_adapter with --threads but cannot get any more than one core and 2 threads utilised. I assume multithreading is supported for this script due to the option but please correct me if not.

e.g. duplex_tools split_on_adapter <fast_dir> <output_dir> Native --threads 20

I did try editing split_on_adapter.py with ThreadPoolExecuter instead of ProcessPoolExecuter which helped in (#9) but saw no performance gains nor more threads utilised.

Any advice greatly appreciated!
Kev

Sorted - sorry issue on my end!

guppy_basecaller_duplex generated six fastq files from 252 fast5 files....

Hi,
I used guppy_basecaller_duplex on data produced with the kit SQK-NBD112-24 on FLO-MIN112 flowcell.

The Following error appears and out of 252 fast5 files, ONLY six fastq files were produced. That I feel is a big loss of data if this
is ok.
This was the error "repeatd many many times with different sample id"
[guppy/error] basecalling::PriorityReadQueue::pass_read_pair: Cannot process duplex read pair 9b230638-706f-43b7-a1b4-f2ff63fe410d, 4a1468c9-0093-40b5-8785-328b331cff11 as sample count 324630 is over maximum supported by caller of 160000, putting on output queue

Despite the above error, guppy_basecaller_duplex said Basecalling was successful
********* Caller time: 1135416 ms, Samples called: 185914147, samples/s: 163741 Finishing up any open output files. Basecalling completed successfully.

This was my call
guppy_basecaller_duplex -i fast5_pass/ -r -s duplex_calls -x auto --kit SQK-NBD112-24 --flowcell FLO-MIN112 --chunks_per_runner 16 --duplex_pairing_mode from_pair_list --duplex_pairing_file duplex_output/pair_ids_filtered.txt --trim_barcodes --compress_fastq

NB. The config file q20-fixed-2.0-ft-10M.cfg was not available is guppy version 6.1.2+e0556ff. Hence I used flags --kit and --flowcell. Also I used the config file dna_r10.4_e8.1_sup.cfg. No imporovement

Can this split read into more than 2 subreads?

We have reads generated from proximal ligation with internal adapters. I was wondering if this tool could search for an adapter and then split the read based on multiple internal adapters being found.

Which release of guppy has guppy_basecaller_duplex?

I cannot seem to see guppy_basecaller_duplex mentioned in the guppy 5.0.16 release notes. Which version introduced this? I cannot seem to find it in 5.0.15 (guppy_basecall_client,guppy_basecaller,guppy_basecaller_supervisor,guppy_basecall_server).

Best regards
Rasmus

Unexpected base in duplex call

Hello,
I extracted fastq from the duplex_orig.sam and compared output to the original raw reads.
In the following alignment file, The top sequence is the output for the duplex read. Below it are the two corresponding reads. Last read is reverse complemented.
The highlighted region shows 'A' in the duplex output while it shows 'C' in read1 and 'T' in reverse complement of read2.
My understanding is, for this locus, the duplex should show either C or T but not A.
Can you please share some insights?
Thanks
-Dev

Split_on_adapter and sample type

I am trying to see in the documentation for duplex_tools what the reasoning is behind the flags for Sample type in the Split_on_adapter command. I see that it is Native or PCR, but how do I decide which flag to use, and what is the reasoning to have this flag.

base quality scores

For duplex reads I see base quality scores above what I would naively expect. For example, many bases have { - which is ASCII 123, or a qscore of 90. My naive expectation would be few qscores above 60. Average qscores across the read are about as expected (Q30+) - and I understand that average qscores are overwhelmingly affected by the lowest qscores in the read.

Long story short, I am wondering whether the per-base qscores been benchmarked by mapping to reference sequences?

Thanks!

guppy_duplex ValueError: not enough values to unpack (expected 4, got 3)

When I try to run guppy_duplex on my fast5 files, it successfully completes simplex basecalling but then gives the error ValueError: not enough values to unpack (expected 4, got 3) when it attempts to generate the duplex pairs files. Here is the full read out before the error:

INFO:guppy_duplex:Using guppy_basecaller at guppy_basecaller
INFO:guppy_duplex:Using guppy_basecaller_duplex at guppy_basecaller_duplex
INFO:guppy_duplex:728396 reads loaded from simplex summary file
INFO:guppy_duplex:Candidate pair generation took 3.07 seconds, 32558 pairs found
Traceback (most recent call last):
  File "/home/minion/.conda/envs/nanop.v1/bin/guppy_duplex", line 8, in <module>
    sys.exit(main())
  File "/home/minion/.conda/envs/nanop.v1/lib/python3.9/site-packages/ont_guppy_duplex_pipeline/guppy_duplex.py", line 279, in main
    duplex_pipeline(args.basecaller_exe, args.duplex_basecaller_exe, args.input_path, args.save_path,
  File "/home/minion/.conda/envs/nanop.v1/lib/python3.9/site-packages/ont_guppy_duplex_pipeline/guppy_duplex.py", line 203, in duplex_pipeline
    simplex_summary = _build_pairs_file(save_path, split_reads)
  File "/home/minion/.conda/envs/nanop.v1/lib/python3.9/site-packages/ont_guppy_duplex_pipeline/guppy_duplex.py", line 175, in _build_pairs_file
    filter_candidates_to_file(neighbours, sequence_file_path, logger, split_reads=split_reads)
  File "/home/minion/.conda/envs/nanop.v1/lib/python3.9/site-packages/ont_guppy_duplex_pipeline/candidate_filtering.py", line 80, in filter_candidates_to_file
    for name, sequence, _, _ in fastx:
ValueError: not enough values to unpack (expected 4, got 3)

Thanks
Domenique

Not all sub reads from multiple splits are written to FASTQ output file

TL;DR is there some other filtering applied when writing sub reads to FASTQ, as some reads in middle_fasta do not end up in the final FASTQ.

I have been using split_on_adapter as a standard to first split any reads that may result from failure to either split reads during sequencing or result from ligation.

I allow for multiple splits and seems to work as expected.

duplex_tools split_on_adapter split_on_adapter_test/LambdaDNA_3/barcode11/input split_on_adapter_test/LambdaDNA_3/barcode11/NanoporeAdapter Native --debug_output --threads 12 --debug --print_alignment --allow_multiple_splits --trim_start 20 --trim_end 20 > split_on_adapter_test/LambdaDNA_3/barcode11/alignment.txt

so for instance

2a0ed51b-3d0c-4877-b065-81f8d9be8739 becomes:

@2a0ed51b-3d0c-4877-b065-81f8d9be8739_1 runid=e08a91d5edc5ac90e3d8c8b2faebd0af477edb35 sampleid=LamdaDNA_3 read=45762 ch=177 start_time=2021-11-13T20:10:00Z barcode=barcode11 0->3552
@2a0ed51b-3d0c-4877-b065-81f8d9be8739_2 runid=e08a91d5edc5ac90e3d8c8b2faebd0af477edb35 sampleid=LamdaDNA_3 read=45762 ch=177 start_time=2021-11-13T20:10:00Z barcode=barcode11 3590->3838

I have then adapted the adapters.py file so that it can search for my own internal adapters and then split the reads accordingly. These internal adapters are also created by ligation between two or more DNA fragments. Its like proximity ligation of fix chromatin with bridge adapters inserted in between ligation sites.

duplex_tools split_on_adapter split_on_adapter_test/LambdaDNA_3/barcode11/NanoporeAdapter split_on_adapter_test/LambdaDNA_3/barcode11/APACseqAdapter Native --debug_output --threads 12 --debug --print_alignment --allow_multiple_splits --trim_start 20 --trim_end 20 > split_on_adapter_test/LambdaDNA_3/barcode11/alignment2.txt

Checking through some reads I can see where it makes a lot of sense it does a good job splitting reads into an expected number of subreads. Of course it may need some tuning but one issue is that although there might be several sequences created in the middle.fasta file for some reason some are not then written to the split.fastq.z file.

Example might be here:

You can see a read that was first split by the default Nanopore sequencing adapters then one of the reads was further split as it found my custom adapter hence the _x_x readID nomenclature.

012a9a4d-353c-4d94-a60c-ba20c4af3bdf_1
012a9a4d-353c-4d94-a60c-ba20c4af3bdf_2_1
012a9a4d-353c-4d94-a60c-ba20c4af3bdf_2_2

Or a read that did not contain an internal Nanopore sequencing adapter but did contain internal custom adapters and was split into 3 reads.

0a9ac4fe-70c2-4c03-80a4-a0190dc4c153_1
0a9ac4fe-70c2-4c03-80a4-a0190dc4c153_2
0a9ac4fe-70c2-4c03-80a4-a0190dc4c153_3

And then there is some case where not all subreads are written out when it finds multiple custom adapters internally.

131b78a4-a711-43b2-b60a-1de3985c53bd_1
131b78a4-a711-43b2-b60a-1de3985c53bd_2
131b78a4-a711-43b2-b60a-1de3985c53bd_3
131b78a4-a711-43b2-b60a-1de3985c53bd_5

I noticed that it is missing because the sucecssive run broke between _3 and _5.

Looking at the reads from the middle_fasta I am unsure why the fourth sequence is not written out.

❯ grep "131b78a4-a711-43b2-b60a-1de3985c53bd" APACseqAdapter/fastq_runid_e08a91d5edc5ac90e3d8c8b2faebd0af477edb35_barcode11_pass_split_middle.fasta -A 1
>131b78a4-a711-43b2-b60a-1de3985c53bd
CTGTCTCTTATACACATCTCCTGCAGGAGATGTGTATAAGAGACAGACATGGGGATCATCCCGCCGAAGGAGTCCTTCCTGCCCCCCTGTTGCAGCAGGATCAGCCACGGACTTTGCCCGCCTGCAAGCTGCGTGGCCACGTCAAATTAATTGTGCAGGCAGCATACGCATGGCGGCTTTATACTGCCCGACGGAAATCCCCGCTTTCTGTGCGGCCAGCGCCTGTCGGCTCAGCGACTGTTCAACGACTGCCGCTGTTTTTTCGCATCACTTTCCGTACCAGAAAAATGACGCCTGACTCTGGCCATCTGCTCGTCAAATCTGGCCGCATCCAGACTCAAATCAACGACCAGATCGCCTACCGGTTCAGCCATACCGGACTCCTCCTGCGATCCTTCTGATACTGTCATCAGCATTACGTCATCCTCCGTCATGTCCGCCACATCGGGAAGCGGGATAACTTCATTCCGTCCGGGCAAAGCGGACACCTCCGGCAAGCCCTGCCGCTTTCTGCATCAG
>131b78a4-a711-43b2-b60a-1de3985c53bd
GTGTATAAGCTGGAACAGGCACACTTGAATCATGGCTTTATGACGTAACATCCGTTTGGGATGCGACTGCCACGGCCCGTGATTTCTCTGCCTTCGCGAGTTTTGAATGGTTCGCGGCGGCATTCATCCATCCATTCGGTAACGCGAATCGGATGATTACGGTCCTTGCGGTAAATCCGGCATGTACAGGATTCATTGTCCTGCTCAAAGTCCATGCCATCAAACTGCTGGTTTTCATTGATGATGCGGGACCAGCCATCAACGCCCACCACCGGAACGATGCCATTCTGCTTATCAGGAAAGGCGTAAATTTCTTTCGTCCACGGATTAAGGCCGTACTGGTTGGCAACGATCAGTAATGCGATGAACTGCACGTAGCTGGCATCACCTTTAAATGCCGTCTGGCGAAGTGGTGATCAGTTCCTGTGGGTCGACAGAATCCATGCCGACACGTTCAGCCAGCTTCCCAGCCAGCGTTGCGAGTGCAGTACTCATTCGTTTTATACCCTCTGAATCAATATCAACCTGGTGGTGAGCAATGGTTTCAACCATGTACCGGATGTGTTCTGCCATGCGCTCCTGAAACTCAACATCGTCATCAAACGCACGGGTAATGGATTTTTTGCTGGCCCCGTGGCGTTGCAAATGATCGATGCATAGCGATTCAAACAGGTGCTGGGGCAGGCCTTTTTCCATGTCGTCTGCCAGTTCTGCCT
>131b78a4-a711-43b2-b60a-1de3985c53bd
GCGCCCAGCCTTTCTGAGCCTCAAGACGATCCTGAATGTAATAGCGTTCATGGCTGAACTCCTGAAATAGCTGTGAAAATATCGCCCGCGAAATGCCGGGCTGATTAGGAAAACAGGAAAGGGGGTTAGTGAATGCTTTTGCTTGATCTCAGTTTCAGTATTAATATCCATTTTATAAGCGTCGACGGCTTCACGAAACATCTTTTCATCGCCAATAAAAGTGGCGATAGTGAATTTAGTCTGGATAGCCATAAGTGTTTGATCCATTCTTTGGGACTCCTGGCTGATTAAGTATGTCGATAAGGCGTTTCCATCCGTCGTAATTTACGGGTGATTCGTTCAAGTAAAGATTCGGAAGGGCAGCCAGCAACAGGCCACCCTGCAATGGCATATTGCATGGTGTGCTCCTTATTTATACATAACGAAAAACGCCTCGAGTGAAGCGTTATTGGTATGCGGTAAAACCGCACTCAGGCGGCCTTGATAGTCATATCATCTGAATCAAATATTCCTGATGTATCGATATCGGTAATTCTTATTCCTTCGCTACCATCCATTGGAGGCCATCCTTCCTGACCATTTCCATCATTCCAGTCGAACTCACACACAACACCATATGCATTAGAGTCGCTTGAAATTGCTATAAGCAGAGCATGTTGCGCCAGCATGATTAATACAGCATTTAATACAGAGCCGTGTTTATTGAGTCGGTATTCAGAGTCTGACCAGGTATTGATCTGGTGAAGTTTTTTCCTCTGTCATTACGTCATGGTCGATTTCAATTTCTATTGATGCTTTCCAGTCGTAATCAATGATGTATTTTTTGATGTTTGACATCTGTTCATATCCTCACAGATAAAAAATCGCCCTCACACTGGAGGGCAAAGAAGATTTCCATCAATCCAGAACAAGTCGGCTCCTGTTTAGTAACGAGCGACATTGCTCCGTGTATTCACTCGTTGGAATGAATACAC
>131b78a4-a711-43b2-b60a-1de3985c53bd
TTCCATCAATCCAGAACAAGTCGGCTCCTGTTTAGTAACGAGCGACATTGCTCCGTGTATTCACTCGTTGGAATGAATACACCTGTCTCTTATACACATCTCCTGCAGGAGATCTTATACACATCTCCTGCAGGAGATGTGTATAAGAGACGGATAACCTGCCGGACCACCGACTG
>131b78a4-a711-43b2-b60a-1de3985c53bd
CGAGCGACATTGCTCCGTGTATTCACTCGTTGGAATGAATACACCTGTCTCTTATACACATCTCCTGCAGGAGATGTGTATAAGAGACGGATAACCTGCCGGACCACCGACTG

I can see that there is some overlap between sub read 3 and 4 when written out to FASTQ.

gunzip -c APACseqAdapter/fastq_runid_e08a91d5edc5ac90e3d8c8b2faebd0af477edb35_barcode11_pass_split_split.fastq.gz| grep "131b78a4-a711-43b2-b60a-1de3985c53bd" -A 1
@131b78a4-a711-43b2-b60a-1de3985c53bd_1 runid=e08a91d5edc5ac90e3d8c8b2faebd0af477edb35 sampleid=LamdaDNA_3 read=57364 ch=26 start_time=2021-11-13T21:51:20Z barcode=barcode11 0->519
CTGTCTCTTATACACATCTCCTGCAGGAGATGTGTATAAGAGACAGACATGGGGATCATCCCGCCGAAGGAGTCCTTCCTGCCCCCCTGTTGCAGCAGGATCAGCCACGGACTTTGCCCGCCTGCAAGCTGCGTGGCCACGTCAAATTAATTGTGCAGGCAGCATACGCATGGCGGCTTTATACTGCCCGACGGAAATCCCCGCTTTCTGTGCGGCCAGCGCCTGTCGGCTCAGCGACTGTTCAACGACTGCCGCTGTTTTTTCGCATCACTTTCCGTACCAGAAAAATGACGCCTGACTCTGGCCATCTGCTCGTCAAATCTGGCCGCATCCAGACTCAAATCAACGACCAGATCGCCTACCGGTTCAGCCATACCGGACTCCTCCTGCGATCCTTCTGATACTGTCATCAGCATTACGTCATCCTCCGTCATGTCCGCCACATCGGGAAGCGGGATAACTTCATTCCGTCCGGGCAAAGCGGACACCTCCGGCAAGCCCTGCCGCTTTCTGCATCAG
--
@131b78a4-a711-43b2-b60a-1de3985c53bd_2 runid=e08a91d5edc5ac90e3d8c8b2faebd0af477edb35 sampleid=LamdaDNA_3 read=57364 ch=26 start_time=2021-11-13T21:51:20Z barcode=barcode11 549->1265
GTGTATAAGCTGGAACAGGCACACTTGAATCATGGCTTTATGACGTAACATCCGTTTGGGATGCGACTGCCACGGCCCGTGATTTCTCTGCCTTCGCGAGTTTTGAATGGTTCGCGGCGGCATTCATCCATCCATTCGGTAACGCGAATCGGATGATTACGGTCCTTGCGGTAAATCCGGCATGTACAGGATTCATTGTCCTGCTCAAAGTCCATGCCATCAAACTGCTGGTTTTCATTGATGATGCGGGACCAGCCATCAACGCCCACCACCGGAACGATGCCATTCTGCTTATCAGGAAAGGCGTAAATTTCTTTCGTCCACGGATTAAGGCCGTACTGGTTGGCAACGATCAGTAATGCGATGAACTGCACGTAGCTGGCATCACCTTTAAATGCCGTCTGGCGAAGTGGTGATCAGTTCCTGTGGGTCGACAGAATCCATGCCGACACGTTCAGCCAGCTTCCCAGCCAGCGTTGCGAGTGCAGTACTCATTCGTTTTATACCCTCTGAATCAATATCAACCTGGTGGTGAGCAATGGTTTCAACCATGTACCGGATGTGTTCTGCCATGCGCTCCTGAAACTCAACATCGTCATCAAACGCACGGGTAATGGATTTTTTGCTGGCCCCGTGGCGTTGCAAATGATCGATGCATAGCGATTCAAACAGGTGCTGGGGCAGGCCTTTTTCCATGTCGTCTGCCAGTTCTGCCT
--
@131b78a4-a711-43b2-b60a-1de3985c53bd_3 runid=e08a91d5edc5ac90e3d8c8b2faebd0af477edb35 sampleid=LamdaDNA_3 read=57364 ch=26 start_time=2021-11-13T21:51:20Z barcode=barcode11 1298->2272
GCGCCCAGCCTTTCTGAGCCTCAAGACGATCCTGAATGTAATAGCGTTCATGGCTGAACTCCTGAAATAGCTGTGAAAATATCGCCCGCGAAATGCCGGGCTGATTAGGAAAACAGGAAAGGGGGTTAGTGAATGCTTTTGCTTGATCTCAGTTTCAGTATTAATATCCATTTTATAAGCGTCGACGGCTTCACGAAACATCTTTTCATCGCCAATAAAAGTGGCGATAGTGAATTTAGTCTGGATAGCCATAAGTGTTTGATCCATTCTTTGGGACTCCTGGCTGATTAAGTATGTCGATAAGGCGTTTCCATCCGTCGTAATTTACGGGTGATTCGTTCAAGTAAAGATTCGGAAGGGCAGCCAGCAACAGGCCACCCTGCAATGGCATATTGCATGGTGTGCTCCTTATTTATACATAACGAAAAACGCCTCGAGTGAAGCGTTATTGGTATGCGGTAAAACCGCACTCAGGCGGCCTTGATAGTCATATCATCTGAATCAAATATTCCTGATGTATCGATATCGGTAATTCTTATTCCTTCGCTACCATCCATTGGAGGCCATCCTTCCTGACCATTTCCATCATTCCAGTCGAACTCACACACAACACCATATGCATTAGAGTCGCTTGAAATTGCTATAAGCAGAGCATGTTGCGCCAGCATGATTAATACAGCATTTAATACAGAGCCGTGTTTATTGAGTCGGTATTCAGAGTCTGACCAGGTATTGATCTGGTGAAGTTTTTTCCTCTGTCATTACGTCATGGTCGATTTCAATTTCTATTGATGCTTTCCAGTCGTAATCAATGATGTATTTTTTGATGTTTGACATCTGTTCATATCCTCACAGATAAAAAATCGCCCTCACACTGGAGGGCAAAGAAGATTTCCATCAATCCAGAACAAGTCGGCTCCTGTTTAGTAACGAGCGACATTGCTCCGTGTATTCACTCGTTGGAATGAATACAC
--
@131b78a4-a711-43b2-b60a-1de3985c53bd_5 runid=e08a91d5edc5ac90e3d8c8b2faebd0af477edb35 sampleid=LamdaDNA_3 read=57364 ch=26 start_time=2021-11-13T21:51:20Z barcode=barcode11 2316->2341
GGATAACCTGCCGGACCACCGACTG

how to use assess_split_on_adapter

I have finished runing split_on_adapter program, but I don't know how to use assess_split_on_adapter program to assess the splitting, since assess_split_on_adapter produces too little help infomation when I run duplex_tools assess_split_on_adapter -h

usage: Duplex Sequencing Tools assess_split_on_adapter [-h] [--debug | --quiet] [--suffix SUFFIX]
                                                       seqkit_stats_nosecondary edited_reads unedited_reads split_multiple_times

positional arguments:
  seqkit_stats_nosecondary
  edited_reads
  unedited_reads
  split_multiple_times

optional arguments:
  -h, --help            show this help message and exit
  --debug               Verbose logging of debug information.
  --quiet               Minimal logging; warnings only).
  --suffix SUFFIX

Could you please help me out on what parameters should I pass to the assess_split_on_adapter ?

pod5 version of duplex_tools issue

I used the pod5 conversion included in duplex_tools.
It was not working properly and it seems to be fixed in the new versions of pod5.
See the discussion here:
https://github.com/nanoporetech/pod5-file-format/issues/31
Could you update the included pod5 program?

couldn't install on linux or pc

Hello, I am getting an error trying to set up duplex-tools on both linux and windows. I tried updating python and pip, setuptools, wheels, and making sure the directory was in the path variable, all to no avail. Please help.

Only works with gzipped fastq files

In testing this on Ubuntu, I noticed that read splitting only works with gzipped fastq files.

If the files aren't gzipped, it does not show an error but rather just says:

0it [00:00, ?it/s]
Split 0 reads kept 0 reads

Just raising it here in case it hasn't been noted before.

nanoporetech / duplex-tools Goto Github PK

duplex-tools's Introduction

Duplex Tools

Installation

General Usage

Duplex pairing

Compatible with Dorado

Compatible with Guppy+Dorado

Additional tools

Usage with Dorado (recommended)

1a) Simplex basecall with dorado

2a) Find duplex pairs for Dorado stereo/basespace basecalling

2b) Find additional duplex pairs in non-split reads (optional)

3) Stereo basecall all the reads

Usage with Guppy

Help

duplex-tools's People

Contributors

Stargazers

Watchers

Forkers

duplex-tools's Issues

Conda environment

create

packages

duplex-tools

Recommend Projects

Recommend Topics

Recommend Org