jfjlaros / demultiplex Goto Github PK

View Code? Open in Web Editor NEW

28.0 3.0 4.0 101 KB

Versatile FASTA/FASTQ demultiplexer.

License: MIT License

Python 100.00%

ngs fasta fastq demultiplex

demultiplex's Introduction

Demultiplex: FASTA/FASTQ demultiplexer

https://readthedocs.org/projects/demultiplex/badge/?version=latest

Versatile NGS demultiplexer with the following features:

Support for FASTA and FASTQ files.
Support for gzip and bzip2 compressed files.
Support for multiple reads per fragment, e.g., paired-end.
Handles barcodes in the header and in the reads.
Handles barcodes at unknown locations in reads (e.g., PacBio or Nanopore barcodes).
Support for selection of part of a barcode.
Allows for mismatches, insertions and deletions.
Barcode guessing by frequency or fixed amount.
Handles large numbers (over one million) of barcodes.

Please see ReadTheDocs for the latest documentation.

demultiplex's People

Contributors

Stargazers

Watchers

Forkers

diongthb sarvan24 khl0798 morgen01

demultiplex's Issues

Degenerated constructs

Greetings!

I was wondering if demultiplex can work with degenerate constructs, e.g. TTMTRGRACAGGCTCCTC.

All reads go into unknown

Hi jfjlaros, I've got an issue that I can run the command successfully, but all my reads go into unknown.

My barcode is in the sequence, so I ran the command "demultiplex demux -r -s 1 -e 1"

My barcode file looks like this:
inline1 ACTGCGAA
inline7 ACTCCTAC

My fastq file is:
@M01065:233:000000000-CT4R7:1:1102:19932:1263 1:N:0:ATCACG
AACTGCGAAGGACTACTCGGGTTTCTAATCCTGTTTGCTCCCCACGCTTTCGCACCTGAGCGTCAGTCTTTGTCCAGGGGGCCGCCTTCGCCACCGGTATTCCTCCAGATATCTACGCATTTCACTGCTACACGTGGAATTCTACCCCCCTCTGACACACTCTAGCCGTGCAGTCACAAATGCAATTCCCAGGTTGAGCCCGGGGATTTCACATCTGTTTTACACAACCGCCTGCGCACGCTGTACGCCCAGTAATTCCGGTTAACGCTTGCACCCTCCGTATTACCGCGGCGGCTGACA
+
CCCCCGGGGGGFGGFG9FFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGG7FGGFGGGGGEFEE<AEGGGGGFGGD@=CFGGGGGGGGGGGGGGGGGGGFFGGGGGGGGGGGGGGGCECGGGGGGGGCEFGGGGGGFDGFGGGGGGGGGFF9FF?FEGC9CCCF8C47FGGGFCAFGGGGGGEFFGEGC:(1:A7C@C3FC+40++:.=FG:>FB?:>4(4996::<97893:6():=89<:*29:2);81+27))4(4:(,-.(.))94907()((-4:@0

Thanks.

demultiplexing illumina with forward and reverse tag

Hi,

I would like to demultiplex Illumina sequences of the ITS2 region. The barcodes are in the head of the sequence and I have two separated files gITS7 which is forward and ITS4 which is the reverse. Could you please tell me how to deal with two barcodes files, one containing the forward tags and the other the reverse?

Thanks a lot!

Parallel processing with multiple cores

I am a new user to this tool and the options for long-read is exactly what I was looking for. However, I didnt find any option to use multiple CPUs/cores that the processes could be run in parallel. This could come really in handy, maybe the tool automatically uses all the processors available?

pairwise2.py doesn't work. Biopython deprecated it.

Hello,

I have installed Demultiplex 1.2.2 in linux using the codes below. My account is in a cluster.

git clone https://github.com/jfjlaros/demultiplex
cd demultiplex
pip install .

I run the code below:
demultiplex guess -o Hrp_barcodes_i7_demultiplex --format=x -n 1000 clean_HRP_Pooled_R1.fastq

Then I get this error:

/home/yeserin/.local/lib/python3.9/site-packages/Bio/pairwise2.py:278: BiopythonDeprecationWarning: Bio.pairwise2 has been deprecated, and we intend to remove it in a future release of Biopython. As an alternative, please consider using Bio.Align.PairwiseAligner as a replacement, and contact the Biopython developers if you still need the Bio.pairwise2 module.

I tried to install the software using python 3.6 instead. But the problem remains.

I really need to use your software. I couldn't find a good alternative. How can I solve the problem? Do I need to install Bio.Align.PairwiseAligner? Then, which document should I modify for Demultiplex to use Bio.Align.PairwiseAligner instead of pairwise2.py

Thanks,
Yeserin.

Biopython Warning

Hi!

really great package, thank you!

I am getting a BiopythonWarning when I use demux, see example below.

$ demultiplex demux -m 0 -r -s 1 -e 12 barcodes.tsv Read1.fastq Read2.fastq
~/virtualenvs/demultiplex_test/lib/python3.7/site-packages/Bio/Seq.py:163: BiopythonWarning: Biopython Seq objects now use string comparison. Older versions of Biopython used object comparison. During this transition, please use hash(id(my_seq)) or my_dict[id(my_seq)] if you want the old behaviour, or use hash(str(my_seq)) or my_dict[str(my_seq)] for the new string hashing behaviour.
  "the new string hashing behaviour.", BiopythonWarning)

So far, demuxing seems to work, but would be great if this could be fixed, as we want to use it in one of our pipelines :)

Thanks!

Best,
Bela

Does demultiplex pick up where it left off?

I am running demultiplex as a job on an HPCC cluster. The job time limit was set to 25 hours and it is about to max out and stop running the job. Will demultiplex pick up where it left off if I resubmit the same job script and have it continue demultiplexing or should I delete the output files and restart the job with a longer time limit?

Thank you in advance!

Demultiplexing not working?

Hey,
Maybe I got something wrong, but it's not demultiplexing my fasta file. I have a big file of pacbio reads, with Pacbio primers + individual barcode. I formated it as follows in a barcodes.csv:

D408 GCAGTCGAACATGTAGCTGACTCAGGTCACCACATATCAGAGTGCGGGTAGT
FAP1360A GCAGTCGAACATGTAGCTGACTCAGGTCACACACACAGACTGTGAGGGTAGT
Ky226 GCAGTCGAACATGTAGCTGACTCAGGTCACACACATCTCGTGAGAGGGTAGT
P092 GCAGTCGAACATGTAGCTGACTCAGGTCACCACGCACACACGCGCGGGTAGT
W64A GCAGTCGAACATGTAGCTGACTCAGGTCACCACTCGACTCTCGCGTGGTAGT

(I erased the reverses, but am not sure if your program automatically reverses barcodes)
Then run:
demultiplex demux barcodes.csv myinput.fasta -r

I only get one input: myinput_UNKNOWN.fasta with all the reads. Whats wrong? I've also trying running it with only one barcode in the .csv file. Same problem.

Cheers,

dual indexes.

Hello,

For example: In bold dual barcode

#R1 read
@SOLEXA1_0069_FC:3:1:1673:948#ACAGTG/1
GACTAACCGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGAATGTTAGCCGTCGGGCAGTATACTGTTCGG
+
BMMQNTWSWWb_____b_bb__________Y_________YYYYY[[[Y[__________XXRWXVVVVTYYYYYT

#R2 read
@SOLEXA1_0069_FC:3:1:1673:948#ACAGTG/2
CTGAAGGGTTGCGCTCGTTGCGGGACTTAACCCAACATCTCACGACACGAGCTGACGACAGCCATGCAGCACCTGT
+
ghgaggfghhhhhhhhhhghhhhhhhhhhfhhhghfWffch[hhgahhedffddR[^W^Zc^\_cac[Wb]^W^

The barcodes are here (16 possible):
TCAGTCAG
CTGACTGA
TCAGGACT
GACTGACT
AGTCAGTC
GACTTCAG
GACTAGTC
GACTCTGA
TCAGCTGA
AGTCTCAG
AGTCGACT
CTGAAGTC
CTGAGACT
AGTCCTGA
TCAGAGTC
CTGATCAG

Is there an option to trim the barcode after the split?

Hi, Is there an option to trim the barcode after the split? if not, is there an option to add this feature?

How does one set the mismatch, deletion settings?

I've looked at the documentation and could not find how does one define the amount or allowed mismatches and deletions.. any advice?

Thanks!

All reads go to UNKNOWN

Hi, thank you so much for your help with understanding how demultiplex work. Now I am facing a new problem, since every read was falling into UNKNOWN I subsampled the first 2500 to do some tries on those. My barcodes file looks like this now (it is a very big file, this is just the first four of them):

P1004_En	TAGGAACTGGCC+TCCCTTGTCTCC
P1022_En	CTAGCGAACATC+TCCCTTGTCTCC
P1019_En	GACAGGAGATAG+TCCCTTGTCTCC
P1015_En	ATTCCTGTGAGT+TCCCTTGTCTCC

And my first reads look like this:

@A00406:67:HK3KJDRXX:1:2101:1090:1000 1:N:0:TGCAGTGTGGAG+NGAGTACGGTTT NTGTCAGCCGCCGCGGTAATACGTAGGGAGCAAGCGTTGTACGGATTTAATGGGCGTAAAGCGCGAGTAGGCGGCCCAGAAGGTCAGCTGTGAAATCTCGGGGCTAAACTACGATCTGTCAATGGAAACAGCATTGCTAGAGTGCGGAAGTGGAAACAGGAATTCTAGGTGTAGCGGTGTAATGCGAAGATATCGGGAGGAACACCGGT GGCGAAGGCGGCGTACTGGAACGCAACTGACGCTGATGAGCG + #::F:F:,FF::FFFF,F,,FF:FFF,F,F:FFFFF,FF,F:FF:FFF:FF,:FFF,FFF:,FFF,FFFFFF:FFF,,F:F,FF:,,,FFF:FF,:FFF,,F::FFFF:,:,FF,F,FFF,,F,::F::FFF,,FFFFF:FFF,F,:FFFFFF,:F,::FFF:FF,,FFF:FFFF,FFF,:FFFFF:FFFFFFF,FFFFF:FFF:F,F: :F:F::F,:FFFF,,FFFF,FF,,,FFFF::,FFF,:,,:FF @A00406:67:HK3KJDRXX:1:2101:1108:1000 1:N:0:GAGGAAATTAAG+NTCACAAGTTTT NTGTCAGCCGCCGCGGTAATACGAAGGGTGCAAGCGTTTATCGGAATTACTGGGCGTAAAGCGAGCGAAGGCGGATGTGCAAGACAGGTGTGAAATCACAGGGCTTAACAAGGGAACTTCACTTGTGACTGCACGGCTGGAGTTCGGAAGAGGGGGATGGAATTCGTCGTGTAGCAGTGAAATGCGTAGATATGAGGAGGAACACCGGT GGCGAAGGCAGTCACCTGGGCCAGGACTGACGCTCATGAACG + #:FFFFFFFFFFFFFF:F,FFFF,FFFF,FFF::F::F,,FFFFF,FFFFFFFFFFFFFFFFF:FFF,FFFFFF,F,FF:FFF:FF:,FFFFFFFFF,F,F:F:FFF:FF,FF::,,:::FFFFFFFF::FFF,,FFF,FFFF,FF,,FF,:F:,F,,FFFFFFF,:,FFFFF:FFFFFFFFF:FFFFFFFFFFFFF:FFF:FFFFF,: FFFF,FFFFFF,,,FFFFF,FF,,,FFFFF:,,F,F,F,,:F @A00406:67:HK3KJDRXX:1:2101:1253:1000 1:N:0:AGCTGGAAGTCC+NTCACCAGGAGT NTGCCAGCAGCCGCGGTAATACGTCGGGTGCAAGCGTGGATCGGAATTACTGGGCGTAAAGCGTGCGCAGGCGGTTATGCAAGACAGATGGGAAATCCCCGGGCTCAACCTGGGAACTGCATTTGTGACTGCATGGCTCGAGTACGGCAGAGGGGGCTGGCATTCCGCGTGTAGCAGTGAAATGCGTAGATCTGCGGAGGAACCCCCAT GGCGCCGGCAAACCCCAGGGCCTGTACTGACGCGCCGGCACG + #,:F:,:FF:F:F::F:,,FFFF:FFFFF:FFFFFF:,,FFFFFFFF:,:,:FFFF:F:F,FF,FFFF:FFFFF:F:F,FF:F,F,F::F:FFF,FFFFFF:FFFFF:FF,F,FF,FFFF,,,:FF:F:,::FF,FF,:F,F,,,,:FFF,FFF,:,F,F,F,,FFFF:FF,,FFFF,F,,F,:F:,:F,F:,:FFFF:F,:,::F,F, FFF:::,:F,:,FFF:,,FFFF::,:F,:FF:F,FF,FF,F, @A00406:67:HK3KJDRXX:1:2101:1416:1000 1:N:0:ACGAACCCATAA+NTCTAAAAGCCA NTGTCAGCAGCCGCGGTGACACGTAGGCACCAAGCGTTGTCCGGATTTACTGGGCGTAAAGGGATTGCAGGCTGCCCCTCAAGTGGTGCATGAAAGGGCTCGGCTCAACCCCGCTAGGTTATGCCAGACGGAGGGGCTAGAGATCGAGAGCGGGACGTGGAATTCCGGGTGTAGTGGTGAAATGCGTAGAGATCCGGAGGAACACCAGA GGCGAAGGCGGCTTCCTGGCTCGCATCTGACGCTCAGACACG +

The code I was planning to use is the following:
demultiplex demux -e 22 lane1_barcodes.tsv subsample/Lane_1_Undetermined_I984_L1_R1.subsample.fastq subsample/Lane_1_Undetermined_I984_L1_R2.subsample.fastq

But it still charachterizes all the reads as unknown. When I tried the guess command, it gave me the following list of possible barcodes:

1 GGGGGGGGGGGG+ACGAGACTGATT
2 GGGGGGGGGGGG+AGCGGAGGTTAG
3 GGGGGGGGGGGG+AGTTACGAGCTA
4 GGGGGGGGGGGG+ATCGCACAGTAA
5 GGGGGGGGGGGG+GCGGGCCCGCCC
6 GGGGGGGGGGGG+GCTGTACGGATT
7 GGGGGGGGGGGG+GGGGGGGGGGGG
8 GGGGGGGGGGGG+GTCGTGTAGCCT
9 GGGGGGGGGGGG+TCTTTCCCTACA
10 GGGGGGGGGGGG+TGGTCAACGATA
11 NNNNNNNNNNNN+NCTNNNNNNNNN
12 TCCTCGTCGACA+TCCCTTGTCTCC

But this ones do not match the ones on the headers. Do you know why that might be happening?

demultiplex unknown results

hi, I use demultiplex to separate my long reads file. it is a useful tool. but there are some questions about tools usage. After demultiplex, I found some reads which are assigned to unknown file also exist in one of barcode_results_file, and I also found there are some reads repeatedly exist in unknown files I want to know why this happen?

positions for barcodes do not work at all!

The version in pip does not seem to work at all! I manually checked barcodes and they exist, but I get everything UNKNOWN for:

demultiplex demux -r -s 3 -e 9 barcodes.txt S3987Nr1.1.fastq S3987Nr1.2.fastq

barcodes.txt

[Uploading
S3987Nr1.2_sample4-n10000.fastq.txt
S3987Nr1.1_sample4-n10000.fastq.txt…]()

Dual barcodes with match

I am demultiplexing nanopore reads using dual barcodes and the match function as these barcodes are located inside the reads at unknown locations. Ive noticed that when I increase the "-m" option to 2, reads are being assigned to more than one barcode and therefore duplicated in multiple files. Is there an option to assign multiple matching reads to the unknown output?
I also wanted to ask, when the mismatch is assigned as 1, does this allow for one mismatch per barcode in the dual set or for the 2 barcodes in total?

Thanks.

Where is the barcode ?

Hi,

just a simple question to clarify my mind...
Is the barcode supposed to be in the label of the read or in the sequence of the read ?
Because from your data example, the barcode is not in the sequence. It's only as a suffix of the read name => @HWI-ABCDE:0:0:0000:0000#ACTA/1

So, I have no idea how to demultiplex my reads when they looks like this :

@read1
ACTA NNNNNNNNNNNNNNNNNNNNN
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@read2
ACTA NNNNNNNNNNNNNNNNNNNNN
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@read3
ACTA NNNNNNNNNNNNNNNNNNNNN
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

Cannot demultiplex based on barcode in header

I have paired end sequencing data of a NextSeq run.
I want to demultiplex my libraries based on an adapter in the header - since this tool should be able to do this I used this.
I unzipped my fastq-files of read 1 and read 2 and generated a file with the barcodes used.
I generated a .txt file which looks like the following example:
adaptername1 sequence1
adaptername2 sequence2
adaptername3 sequence3
--> I am using a textfile because when I am using a .csv file as recommended on the webpage it is not working.

Then I used the demultiplex demux command. The command is running but in the end I only get the UNKNOWN fastq files of read 1 and read 2.
Is there anything I have missed?

Thank you!

Is it possible for the barcode to contain non-ATCG characters?

Hi jfjlaros,

I want to use the demultiplex demux command to separate my fastq files that contains non-ATCG character barcode in my header.

My header looks like this:
@M01065:238:000000000-KJ23L:1:1101:23857:16859:AACCTCTCCGGAACCTACGCCTCATCT_1
@M01065:238:000000000-KJ23L:1:1106:20412:2690:AACCTCTCCGGAACCTACGCCTCATCT_2

and my barcode file looks like this:
index1 AACCTCTCCGGAACCTACGCCTCATCT_1
index2 AACCTCTCCGGAACCTACGCCTCATCT_2

Is there a way for me to use your code to demultiplex this? Thank you so much!

can't run demultiplex with SyntaxError

Hi,
I have installed the demultiplex with the lastest version. but I cannot run successfully.
My python version is 3.6.7. Any help?

$ demultiplex -h
Traceback (most recent call last):
File "/home/hulab/chen/miniconda2/bin/demultiplex", line 6, in
from demultiplex.cli import main
File "/home/hulab/chen/miniconda2/lib/python3.6/site-packages/demultiplex/cli.py", line 92
except IOError , error:
^
SyntaxError: invalid syntax

Paired end support for match.

Hi,

Really like your tool, super useful, but I was just wondering if it would be at all possible to add paired-end support for the match functionality, or if you could point out how to alter the code to do this.

Thanks

All reads categorized as UNKNOWN

Hi, I have barcode sequence in the header that I would like to use to split my reads accordingly. But it would seem that all my reads got categorized as UNKNOWN. In the example below, AACGGT is the forward barcode sequence, but demultiplex isn't able to recognize it.

@NB502048:452:HVVMCAFXY:1:11101:2591:1020:AACGGT+AAGCCT 1:N:0:GTATCGTCGT
TTATGGACAACAGTCAAACAACAATTCTTTGTACTTTTTTTTTCCTTAGTCTTTCTTTGAAGCAGCAAGTATGATGAGCAAGCTTTCTCACAAGCATTTGGTTTTAAATTATGGAGTATGTTTCTGTGGAGACGAGAGTAGGT
+

Please advise? Thanks!

Demultiplex v1.2.2 misses barcodes starting with a (series of) DELetions aka supposedly incompletely amplified PCR products

Hi,
when analyzing why 55% of the barcodes in PCR-amplicons are not recognized with demultiplex demux -d -m 1 -r -e 8 A.tsv foo_R1.fastq.gz foo_R2.fastq.gz I see an algorithmic bug. The PCR-amplicons were possibly obtained during too short PCR amplification time, IMO. I will need to check that with wet lab.
Edit: The extension time was 17 seconds, Kapa High Fidelity polymerase, PCR product size ~455 nt.

Nevertheless, at least the leading DELetion should have been caught by -d -m 1 -r -e 8. There are 3 1-nt spanning DELetions inside of the barcode too, which should have been caught by the Levenshtein distance (see rows with counts 27, 35, 145 below).

$ cat A.tsv
fwd_barcode_name fwd_barcode_sequence
JZ71_WU_F GCCCCTCT
JZ73_WUA_F CCGTGGGT
JZ74_WUME_F TAAGTAGA
JZ75_WUMN_F GCAACGTG
$

BTW, by search for a REV barcode in R2 read can rescue 5% of the data, but still 50% of reads are left in UNKNOWN_UNKNOWN group. I feel the PCR products will be in the other half of the data rotated incl. the barcodes or something like that. Need to dig into that still.
Edit: Indeed, the Kapa High Fidelity polymerase is a proof-reading polymerase so the ~455 PCR products have blunt ends, hence the sequencing library inserts are in both orientations. That supposedly explains the remaining 50% of the cases.

Demultiplex cannot separate read-pairs by R1 barcode and R2 barcode simultaneously if at least either of the two is found

Hi,
I tried to demultiplex at once both R1 and R2 files but then after checking that R2 barcodes are NOT in the UNKNOWN files it turned out they are. It seems demultiplex throws the who read -pair into UNKNOWN if R1 barcode is not found, possibly not even checking for barcode in R2.

What I am after is to get into UNKNOWN files only read for which neither R1 barcode nor R2 barcode have been found. I assume this should be a runtime option. Some users may be more happy with only reads which have readable both R1 and R2 barcodes, but that is not my case. Either R1 barcode or R2 barcode detection is enough to assign the sample.

demultiplex demux -d -m 1 -r -e 8 two-column-barcode-names-barcode-seqs.tsv ../foo_R1.fastq.gz ../foo_R2.fastq.gz

The documentation at https://demultiplex.readthedocs.io/en/latest/usage.html is somewhat difficult to grasp. I had to figure out I should follow "Other files" section with -r argument for other platform, to force search for the barcode in the read sequence itself, not in just the FASTA/Q header.

Further, I failed to understand the "Multiple barcodes" section, seems I cannot provide in the CSV/TSV file for barcode1 and barcode2 barcodes for the R1 and R2 files at once.

# fwd_barcode_names fwd_barcode_sequences rev_barcode_names rev_barcode_sequences
barcode1 ATCG barcode2 TGCA

So in the end my above problem stems from the fact that I did not provide the reverse read R2 barcodes to demultiplex at all. My bad. But is there a way for that at all?

Also, for me it is more convenient to name the output files:

foo_"$fwd_barcode_sequence"+"$rev_barcode_sequence"_R1.fastq.gz
foo_"$fwd_barcode_sequence"+"$rev_barcode_sequence"_R2.fastq.gz

or at least

foo_"$fwd_barcode_sequence"_R1.fastq.gz
foo_"$rev_barcode_sequence"_R2.fastq.gz

so in brief, not renaming them using the barcode names parsed from the CSV/TSV file(s) would be better.

Also to say, it is maybe not even written in the docs that the barcode sequences in the "other" mode are searched at the very beginning of the particular read, assumingly record.seq.startswith(some_barcode_in_verbatim_as_provided_in_TSV_file). Fortunately the behavior matches my expectations. However, please clarify that in the docs.

Thank you.

demultiplex match outputs all input reads to each barcode

I am trying to demultiplex some reads that have internal barcodes inserted through restriction digest and then ligation.

I gave the full length of the bridge adapter containing the barcodes so they share some common sequences across them but differ at barcode sequence of course:

name,seq 
barcode1,DpnII-BC1(sense orientation)-EcoR1-BC1(revcom)
barcode2,DpnII-BC2(sense)-EcoR1-BC2(revcom)
....
....
....

KVII_F_R_1,GATC**GAGCTCGA**GAATTC**TCGAGCTC**GATC

When I run it assigns all the input reads to each of the output FASTQ files.I am unsure if I am misunderstanding something or doing something very wrong. I tried supply just the 8nt index in barcode file but it is doing the same thing.

demultiplex match barcodes.csv reads.fastq

Issue deleted

tags 5' and 3' in two different files usage

Hi,

I am trying to demultiplex Illumina fastq files R1 and R2 and I have two files for the tags:
forward tags:
gITS7_tag_41;CCACGTCACT
gITS7_tag_42;CCACTATCGT
gITS7_tag_44;CCAGATACTT
gITS7_tag_4;CACATAGTCT
and reverse:
ITS4a_tag_41;CCGACTGTC
ITS4a_tag_42;CCGATACTG
ITS4a_tag_44;CCGCTATAC
ITS4a_tag_4;CACATGTCG

how can I use these two files to demultiplex my fastq files?
thanks a lo!

anaconda2 errors

demultiplex demux barcodes.txt demultiplex.fq
Traceback (most recent call last):
File "/home/raw937/anaconda2/bin/demultiplex", line 7, in
from demultiplex.cli import main
File "/home/raw937/anaconda2/lib/python2.7/site-packages/demultiplex/init.py", line 13, in
from .demultiplex import Extractor, count, demultiplex
File "/home/raw937/anaconda2/lib/python2.7/site-packages/demultiplex/demultiplex.py", line 6, in
from fastools import guess_file_format, guess_header_format
File "/home/raw937/anaconda2/lib/python2.7/site-packages/fastools/init.py", line 5, in
from .fastools import *
File "/home/raw937/anaconda2/lib/python2.7/site-packages/fastools/fastools.py", line 5, in
from urllib.error import HTTPError
ImportError: No module named error

demultiplex does not find any barcodes if -e NUM is larger than lenghth of the shortest actually provided barcode in a TSV file

Hi,
I probably misunderstood the sparse documentation and used demultiplex demux -d -m 4 -r -e 22 to search for 16 to 19 nt long barcodes followed by a primer sequence. Everything ends up up in UNKNOWN. It seems I can only run demultiplex demux -d -m 4 -r -e 16 if the shortest provided sequence in the TSV file is 16 nt long. I think it is a bug.

It is unclear what does demultiplex demux -r alone without -s START -e STOP, please improve the manual at least. The -h runtime option provides merely nothing useful either.

Thank you,

Demultiplex with i7 index on dual index FASTQ

I am trying to demultiplex the FASTQ data with just the i7 index. The library is dual index and the BCL data is no longer available. Is this something I can accomplish using this application ?

Example

Index on read
CCGCTAGCGG+TGCATGGCCA

Want to demux only by

CCGCTAGCGG on the barcodes.csv sheet.

There are 1000's of unique i5 reads and I want them binned just to the limited i7 indexes.

Output

Dear all,

is there a way to specify an output directory for the demultiplexed files?

All best and thanks in advance,
Kevin

demultiplex: error: invalid barcodes file format

Hi,
I have been having the following error when I run demultiplex:

nid00014(1006)$ demultiplex demux barcodes.csv Lane_1_L1_R1.fastq Lane_1_L1_R2.fastq 
usage: demultiplex [-h] [-v] {guess,demux,match} ...
demultiplex: error: invalid barcodes file format

The file barcodes.csv looks like this:

nid00014(1009)$ head barcodes.csv
P1004_En TCCCTTGTCTCC TAGGAACTGGCC
P1022_En TCCCTTGTCTCC CTAGCGAACATC
P1019_En TCCCTTGTCTCC GACAGGAGATAG
P1015_En TCCCTTGTCTCC ATTCCTGTGAGT
P1003_En TCCCTTGTCTCC GAGGCTCATCAT
P1002_En TCCCTTGTCTCC TCCTCTGTCGAC
P1013_En TCCCTTGTCTCC CTATTTGCGACA
P1011_En TCCCTTGTCTCC AGTAGAGGGATG
P1009_En TCCCTTGTCTCC CGCAGCGGTATA
P1020_En TCCCTTGTCTCC AATGCCTCAACT

I have also tried to use a comma separated file and a tab separated file. Do you know what may have been causing the problem/how to solve it?

Any help will be much appreciated!

Thank you
Maria

dealing with conflicts

Judging by the no. of reads in each barcode FASTQ output it would seem reads may be assigned to multiple barcodes. I am guessing that conflicts are ignored?

Is there a way to output a file showing the mappings of reads to barcode FASTQs to easily check multi assignments

Working with compressed or piped files

I'm demultiplexing files from several NovaSeq runs. I really like this demultiplex software for its ease of use. But one feature that would be really handy is for the program to read and write compressed data. Having to decompress the file takes up a lot of space and leads to redundancy on the server. Alternatively, reading the sequencing file (at lease for single end) from STDIN could be very helpful as well.

Does this tool works for Cell hashing data generated for cell ranger multi?

I have a multiplexed scFFPE data for cell ranger multi. Can I use demultiplex tool to demux these data?
I have tried but using following setting and the program is running for a week almost and it is still not done yet. So I wonder if it is doing it correctly.

My command to run the program is-
demultiplex match -m 3 -d samplesheet_re.tsv sample1_S1_R1_001.fastq.gz sample1_S1_R2_001.fastq.gz

My barcode file looks like this-

BC001	ACTTTAGG	ACTTTAGG
BC001	CTTTAGGC	ACTTTAGG
BC001	CGAGGGTA	ACTTTAGG
BC001	GAGGGTAC	ACTTTAGG
BC001	GACACTAC	ACTTTAGG
BC001	ACACTACC	ACTTTAGG
BC001	TTGCACCT	ACTTTAGG
BC001	TGCACCTC	ACTTTAGG
BC002	AACGGGAA	AACGGGAA
BC002	ACGGGAAC	AACGGGAA
BC002	CGAATTGC	AACGGGAA
BC002	GAATTGCC	AACGGGAA
BC002	GTTCCATT	AACGGGAA
BC002	TTCCATTC	AACGGGAA
BC002	TCGTACCG	AACGGGAA
BC002	CGTACCGC	AACGGGAA
BC003	AGTAGGCT	AGTAGGCT
BC003	GTAGGCTC	AGTAGGCT
BC003	CTGTACGA	AGTAGGCT
BC003	TGTACGAC	AGTAGGCT
BC003	GCACCAAG	AGTAGGCT
BC003	CACCAAGC	AGTAGGCT
BC003	TACGTTTC	AGTAGGCT
BC003	ACGTTTCC	AGTAGGCT
BC004	ATGTTGAC	ATGTTGAC
BC004	TGTTGACC	ATGTTGAC
BC004	CACCAACG	ATGTTGAC
BC004	ACCAACGC	ATGTTGAC
BC004	GCTACCGA	ATGTTGAC
BC004	CTACCGAC	ATGTTGAC
BC004	TGAGGTTT	ATGTTGAC
BC004	GAGGTTTC	ATGTTGAC
BC005	ACAGACCT	ACAGACCT
BC005	CAGACCTC	ACAGACCT
BC005	CGGTCGAA	ACAGACCT
BC005	GGTCGAAC	ACAGACCT
BC005	GTCCTTTC	ACAGACCT
BC005	TCCTTTCC	ACAGACCT
BC005	TATAGAGG	ACAGACCT
BC005	ATAGAGGC	ACAGACCT

Thank you for your help!

Does demultiplex support hamming distance?

Does it?