bcgsc / rna-bloom Goto Github PK

:hibiscus: reference-free transcriptome assembly for short and long reads

License: Other

Java 99.95% Shell 0.05%

denovo-assembly rna-seq single-cell-rna-seq bioinformatics-tool nanopore-sequencing bulk-rna-seq pacbio-sequencing

rna-bloom's Introduction

RNA-Bloom is a fast and memory-efficient de novo transcript sequence assembler. It is designed for the following sequencing data types:

single-end/paired-end bulk RNA-seq (strand-specific/agnostic)
paired-end single-cell RNA-seq (strand-specific/agnostic)
long-read RNA-seq (ONT cDNA/direct RNA, PacBio cDNA)

Written by Ka Ming Nip 📧

©️ 2018-present Canada's Michael Smith Genome Sciences Centre, BC Cancer

Dependency 📌

Java SE Development Kit (JDK) 11 (JDK 17 is slightly faster)
External software used:

software	short reads	long reads
minimap2 >=2.22	required	required
Racon	not used	required
ntCard >=1.2.1	required	required

⚠️ Their executables must be accessible from your PATH!

Installation 🔧

RNA-Bloom can be installed in two ways:

(A) install with `conda` or `mamba`:

conda install -c bioconda rnabloom

mamba install -c bioconda rnabloom

All dependent software (listed above) will be installed. RNA-Bloom can be run as rnabloom ...

(B) download from GitHub:

Download the binary tarball rnabloom_vX.X.X.tar.gz from the releases section.
Extract the downloaded tarball with the command:

tar -zxf rnabloom_vX.X.X.tar.gz

RNA-Bloom can be run as java -jar /path/to/RNA-Bloom.jar ...

Quick Start for Short Reads 🏃

⚠️ Input reads must be in either FASTQ or FASTA format and may be compressed with GZIP.

ℹ️ Note that -left, -right, -sef, and -ser can accept multiple file paths separated by the whitespace character.

(A) assemble bulk RNA-seq data:

paired-end reads only
- when left reads are sense and right reads are antisense, use -revcomp-right to reverse-complement right reads
- when left reads are antisense and right reads are sense, use -revcomp-left to reverse-complement left reads
- for non-stranded data, use either -revcomp-right or -revcomp-left

java -jar RNA-Bloom.jar -left LEFT.fastq -right RIGHT.fastq -revcomp-right -t THREADS -outdir OUTDIR

single-end reads only
- use -sef for forward reads and -ser for reverse reads

java -jar RNA-Bloom.jar -sef SE.fastq -t THREADS -outdir OUTDIR

paired-end and single-end reads

java -jar RNA-Bloom.jar -left LEFT.fastq -right RIGHT.fastq -revcomp-right -sef SE.fastq -t THREADS -outdir OUTDIR

final output files:

file name	description
`rnabloom.transcripts.fa`	assembled transcripts longer than length threshold (default: 200)
`rnabloom.transcripts.short.fa`	assembled transcripts shorter than length threshold
`rnabloom.transcripts.nr.fa`	assembled transcripts with redundancy reduced

(B) assemble multi-sample RNA-seq data with pooled assembly mode:

java -jar RNA-Bloom.jar -pool READSLIST.txt -revcomp-right -t THREADS -outdir OUTDIR

This is especially useful for single-cell datasets. RNA-Bloom was tested on Smart-seq2 and SMARTer datasets. It is not supported for long-read data (-long) at this time.

file format for the `-pool` option:

This is a tabular file that describes the read file paths for all cells/samples to be used pooled assembly.

Column header is on the first line, leading with #
Columns are separated by space/tab characters
Each sample can have more than one lines; lines sharing the same name will be grouped together during assembly

column	description
`name`	sample name
`left`	path to one left read file
`right`	path to one right read file
`sef`	path to one single-end forward read file
`ser`	path to one single-end reverse read file

(i) paired-end reads only:

Only name, left, and right columns are specified for a total of 3 columns. The legacy header-less tri-column format is still supported.

#name left right
cell1 /path/to/cell1/left.fastq /path/to/cell1/right.fastq
cell2 /path/to/cell2/left.fastq /path/to/cell2/right.fastq
cell3 /path/to/cell3/left.fastq /path/to/cell3/right.fastq

(ii) paired and unpaired reads:

In addition to name, left, and right columns, either sef, ser or both are specified for a total of 4~5 columns.

#name left right sef ser
cell1 /path/to/cell1/left.fastq /path/to/cell1/right.fastq /path/to/cell1/sef.fastq /path/to/cell1/ser.fastq
cell2 /path/to/cell2/left.fastq /path/to/cell2/right.fastq /path/to/cell2/sef.fastq /path/to/cell2/ser.fastq
cell3 /path/to/cell3/left.fastq /path/to/cell3/right.fastq /path/to/cell3/sef.fastq /path/to/cell3/ser.fastq

final output files per cell:

file name	description
`rnabloom.transcripts.fa`	assembled transcripts longer than length threshold (default: 200)
`rnabloom.transcripts.short.fa`	assembled transcripts shorter than length threshold
`rnabloom.transcripts.nr.fa`	assembled transcripts with redundancy reduced

(C) strand-specific assembly:

java -jar RNA-Bloom.jar -stranded ...

The -stranded option indicates that input reads are strand-specific.

Strand-specific reads are typically in the F2R1 orientation, where /2 denotes left reads in forward orientation and /1 denotes right reads in reverse orientation.

Configure the read file paths accordingly for bulk RNA-seq data and indicate read orientation:

-stranded -left /path/to/reads_2.fastq -right /path/to/reads_1.fastq -revcomp-right

and for scRNA-seq data:

cell1 /path/to/cell1/reads_2.fastq /path/to/cell1/reads_1.fastq

(D) reference-guided assembly:

java -jar RNA-Bloom.jar -ref TRANSCRIPTS.fasta ...

The -ref option specifies the reference transcriptome FASTA file for guiding short-read assembly. It is not supported for long-read data (-long) at this time.

Quick Start for Long Reads 🏃

⚠️ It is strongly recommended to trim adapters in your reads before assembly. For example, see Porechop for more information.

⚠️ Input reads must not have purely integer IDs (e.g. 1, 2, 3), which could be in conflict with RNA-Bloom's sequence IDs. Please rename your read IDs (with seqtk rename) if necessary.

ℹ️ Note that -long, -sef, and -ser can accept multiple file paths separated by the whitespace character.

(A) assemble long-read cDNA sequencing data:

Default presets for -long are intended for ONT data. Please add the -lrpb flag for PacBio data.

java -jar RNA-Bloom.jar -long LONG.fastq -t THREADS -outdir OUTDIR

Input reads are expected to be in a mix of both forward and reverse orientations.

Options -pool and -ref are not supported for long-read data at this time.

(B) assemble nanopore direct RNA sequencing data:

java -jar RNA-Bloom.jar -long LONG.fastq -stranded -t THREADS -outdir OUTDIR

Input reads are expected to be only in the forward orientation.

By default, uracil (U) is written as T. Use the -uracil option to write U instead of T in the output assembly.

ntCard v1.2.1 supports uracil in reads.

(C) assemble long-read sequencing data with short-read polishing:

cDNA data:

java -jar RNA-Bloom.jar -long LONG.fastq -sef SHORT.fastq -t THREADS -outdir OUTDIR

direct RNA data:

java -jar RNA-Bloom.jar -stranded -long LONG.fastq -sef SHORT_FORWARD.fastq -ser SHORT_REVERSE.fastq -t THREADS -outdir OUTDIR

final output files:

file name	description
`rnabloom.transcripts.fa`	assembled transcripts longer than min. length threshold (default: 200)
`rnabloom.transcripts.short.fa`	assembled transcripts shorter than min. length threshold

General Settings ⚙️

(A) set Bloom filter sizes automatically:

If ntcard is found in your PATH, then the -ntcard option is automatically turned on to count the number of unique k-mers in your reads.

java -jar RNA-Bloom.jar -fpr 0.01 ...

This sets the size of Bloom filters automatically to accommodate a false positive rate (FPR) of ~1%.

Alternatively, you can specify the exact number of unique k-mers:

java -jar RNA-Bloom.jar -fpr 0.01 -nk 28077715 ...

This sets the size of Bloom filters automatically to accommodate 28,077,715 unique k-mers for a FPR of ~1%.

As a rule of thumb, a lower FPR may result in a better assembly but requires more memory for a larger Bloom filter.

(B) set the total size of Bloom filters:

java -jar RNA-Bloom.jar -mem 10 ...

This sets the total size to 10 GB. If neither -nk, -ntcard, or -mem are used, then the total size is configured based on the size of input read files.

(C) stop at an intermediate stage:

java -jar RNA-Bloom.jar -stage N ...

N	short reads	long reads
1	construct graph	construct graph
2	assemble fragments	correct reads
3	assemble transcripts	assemble transcripts

This is a very useful option if you only want to assemble fragments or correct long reads (ie. with -stage 2)!

(D) list all available options in RNA-Bloom:

java -jar RNA-Bloom.jar -help

(E) limit the size of Java heap:

java -Xmx2g -jar RNA-Bloom.jar ...

or if you installed with conda:

export JAVA_TOOL_OPTIONS="-Xmx2g"
rnabloom ...

This limits the maximum Java heap to 2 GB with the -Xmx option. Note that java options has no effect on Bloom filter sizes.

See documentation for other JVM options.

Implementation 📝

RNA-Bloom is written in Java with Apache NetBeans IDE. It uses the following libraries:

Citing RNA-Bloom 📜

If you use RNA-Bloom in your work, please cite our manuscript(s).

Long-read RNA-seq assembly:

Ka Ming Nip, Saber Hafezqorani, Kristina K. Gagalova, Readman Chiu, Chen Yang, René L. Warren, and Inanc Birol. Reference-free assembly of long-read transcriptome sequencing data with RNA-Bloom2. Nature Communications. 2023 May 22;14(1):2940. doi: 10.1038/s41467-023-38553-y

Short-read RNA-seq assembly:

Ka Ming Nip, Readman Chiu, Chen Yang, Justin Chu, Hamid Mohamadi, René L. Warren, and Inanc Birol. RNA-Bloom enables reference-free and reference-guided sequence assembly for single-cell transcriptomes. Genome Research. 2020 Aug;30(8):1191-1200. doi: 10.1101/gr.260174.119. Epub 2020 Aug 17.

rna-bloom's People

Contributors

Stargazers

Watchers

Forkers

zsutx2005 standardgalactic genomicsnx bit-vs-it jwcodee ad3002

rna-bloom's Issues

How to restart a failed run

Please report

[v1.4.3 ] version of RNA-Bloom with java -jar RNA-Bloom.jar -version
[ 11.0.9.1] version of java with java -version
exact command used to run RNA-Bloom

Hi,
I was running RNAbloom with long and short reads and it failed during the getting transcripts step. I'd like to know how can I restart the run from the latest checkpoint.

Thank you.

Jèssica

Release memory before redundancy reducing step.

Hi,

Is it possible to run redundancy reducing step only or release memory usage before redundancy reducing step? Rna-bloom + Minimap2 may double the momery consuming.

Best,
Kun

Errors (reading file and I/O) for one of 9 pooled assemblies

I ran RNA-Bloom in pool mode without a reference. Each replicate from a species/condition was included in a pool. Nine pools were submitted as separate array jobs. All nine pooled assemblies ran to completion (apparently), but when I checked the error logs, one of the nine contained a lot of ominous messages starting with "rnabloom.io.FileFormatException: Error reading file" and I/O errors and exceptions. I'm wondering if this is just noise or something pernicious.

[2.0.1] version of RNA-Bloom with java -jar RNA-Bloom.jar -version
[17.0.6] version of java with java -version
[java -Xmx10g -jar ~/bin/thirdparty/RNA-Bloom_v2.0.1/RNA-Bloom.jar -stranded -ntcard -fpr 0.005 -k 25-75:5 -extend -t 32 -outdir $outdir -rcr -pool "$sample_list"] exact command used to run RNA-Bloom

I've attached my sample list:
20231109_rnabloom_pool.pdf

I've also attached log files containing output and error messages captured by SLURM for the affected samples. The out.pdf contains the RNA-Bloom progress report output after a bunch of details that I routinely log. The err.pdf file contains the error messages that have me concerned:
12875770_3_rnabloom.out.pdf
12875770_3_rnabloom.err.pdf

Thank you!

Racon Installation

Hi Ka Ming,

I am trying to install the dependecies for RNABloom, namely Racon and keep seeing following error

CMake Error at CMakeLists.txt:51 (add_subdirectory):
  The source directory

    /Research/Programs/Racon/racon-master/vendor/bioparser

  does not contain a CMakeLists.txt file.

I followed the instructions:

git clone --recursive https://github.com/lbcb-sci/racon.git racon
cd racon
mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Release ..

I have following compilers version in my path

CMake/3.2.3-goolf-1.4.10
GCC/4.7.2

I am on a cluster and trying to install locally.

You may have some suggestions to try.

Thanks
Deep

Sequence header

Hi,

I am so sorry for this question, but I didn't find an explanation of the sequence header of RNA-Bloom output.

All my sequences have this type of header: >E4.L.4 l=306. Please, confirm if I am right:

E4 is the gene number and L4 is the isoform number?

Many thanks in advance!

RNA-Bloom v1.3.1
java version 11.0.9
exact command used to run RNA-Bloom

rnabloom -left sample_1_R1.fastq.gz -right sample_1_R2.fastq.gz -revcomp-right -ntcard -t 80 -outdir sample1_transcriptome

Filter output by coverage?

Hi,

Just wondering if it's possible to get an average coverage for each assembled transcript? Or can we filter the output transcripts based on a coverage threshold? Thanks!

RNA-Bloom2 version

Hi,
I found your RNA-Bloom2 manuscript at bioRxive, and I would like to make sure that I correctly understand the versioning:
Is RNA-Bloom2 of the manuscript identical to RNA-Bloom v2.0.1 here at github or at bioconda?

Best regards

Transcript headers follow different formats

Please report

version of RNA-Bloom with java -jar RNA-Bloom.jar -version
version of java with java -version
exact command used to run RNA-Bloom

Trying to run RNA-Bloom indiscriminately on input files to see if they assemble. I don't check the files before as I want to leave it to RNA-Bloom to decide if it can assemble anything. Interestingly, RNA-Bloom produces different header formats in FASTA for different outputs.

Sometimes I get:
>3 l=228 c=1.1 s=8
other times I get:
>s1

Note that these are with different inputs. Is it possible to output the same header format each time? In the latter format, does coverage=1?

Thanks!!

RNA-Bloom v2.0.0

java --version
openjdk 17.0.3-internal 2022-04-19
OpenJDK Runtime Environment (build 17.0.3-internal+0-adhoc..src)
OpenJDK 64-Bit Server VM (build 17.0.3-internal+0-adhoc..src, mixed mode, sharing)

Command:

rnabloom -outdir rnabloom_out -t 8 -long input.fastq -ntcard

Sample input read to reproduce single-element header:

@read1
AATTTGGGTGTTTAACCAGTCATCGCCTACCGTGACTTCGGATTCATCGTGTTTCGTTTTCGTGCGCCGCTTCAACATGGGGCTAATCATTGCTTTCGTGCGCCATTCAACATGGAATAATCATTGCTTTTTCGTGCGCCGCTTCAACATGGGGGGCCACGCGCGCGTCCCCCGAAGGCGCGTAACGCTGTGGCGGCCTGCTT
+
%*'('((,./;:3,''%%&#$%(*$$&(*-30441004/*.1110)*.06{?;?<)57??@76341{9334?C9B@:999JA?;88<@::7610/--+224.,,'&&''-612105'&&,127<<820.-:::34475{;545-?8454;==??8877...F{{{{<//101/.*,/12{{1.'&&$$$$%$'('''$%&&&'

How to remove the isoform in the trasncript.fa file?

Hello ,
I am doing the denvo transcriptome assembly, please guide me how to remove the isoforms in the transcript.fa file,
The output of transcript.fa file has transcript start as E1,E2, E3.
what does it mean?

Please report

version of RNA-Bloom with java -jar RNA-Bloom.jar -version
RNA-Bloom v1.3.1
version of java with java -version
openjdk version "1.8.0_265"
exact command used to run RNA-Bloom
transcript.fa

Fix bug when read names contain `/`

Please report

version of RNA-Bloom with java -jar RNA-Bloom.jar -version
2.0.0 and earlier versions
version of java with java -version
jre-11-openjdk-11.0.16.0.8-1.el7_9.x86_64
exact command used to run RNA-Bloom
long read assembly mode (-long)

Names in long reads containing suffixes /1 and /2 would be trimmed as if they were paired-end reads.
For example, name/1 and name/2 would be trimmed to name, resulting in two names sharing the same name. This would lead to an error in later stages of the assembly.

Java out of memory exception

I ran bloom on multiple fastq files together by giving it 200GB ram. It gave me out of memory exception mid way. I was running the command using a salloc allocation on the server and for that I need to specify the amount of space I am allocating for it on the server. I gave it 200GB to begin with but I guess it wasn't enough. I there a way to see how much space it would need for all the jobs beforehand?

Also, I want to produce a transcriptome.fq file for each of my 17 input files. I put the path to all input files inside a txt file and gave that txt file as the input. Will it go through each file and create a transcriptome.fq file for each one separately (desired output) or would it create only one transcriptome.fq file at the end?

Inputting multiple long-read files at once

All the files I need to run this on are in a directory. Is there a way I can give the path to the directory in -long <path/to/directory> rather than listing out all files like -long <FILEA FILEB ....> ?

Also, is there a way to run bloom with snakemake?

RNA-Bloom Generates Empty FASTA Without Error

As per title. Input file: test.fastq.gz

Command:

rnabloom -t 2 -outdir test_out -long test.fastq -ntcard

It should probably again report too little input data? Big thanks for all of your help!!

RNA-Bloom v2.0.0

java --version
openjdk 17.0.3-internal 2022-04-19
OpenJDK Runtime Environment (build 17.0.3-internal+0-adhoc..src)
OpenJDK 64-Bit Server VM (build 17.0.3-internal+0-adhoc..src, mixed mode, sharing)

Question: De-novo assembly of long reads ONLY - Java memory error

Hello,
I have a question about the use case of RNA-Bloom.
I have some Pac-bio CCS reads and nanopore long reads for a certain maize genotype which I am using for a de novo transcriptome assembly. I tried to do so but met with an "Exception in thread "main" java.lang.OutOfMemoryError: Java heap space" error. I did try to increase/ decrease the heap size and play around with memory settings on our cluster (it uses a slurm job manager)

I wonder if I am doing something wrong - in fact, I am not sure if I can only use long reads? or do I need rna-seq as well?

null exception

Hi, I have been trying to run java -jar RNA-Bloom.jar -stage 3 -long data.fastq -ref Arabidopsis_thaliana.TAIR10.47.gtf -o ${path} with java 11, but that fails with the following output:

> Stage 1: Construct graph from reads (k=17)
[1] Parsing ${path}/sra_2_data.fastq...
Parsed 490,413 reads in total.
Augmenting graph with reference transcripts...
[2] Parsing ${path}/Arabidopsis_thaliana.TAIR10.47.gtf...
DBG Bloom filter FPR:                 1.3991246 %
Counting Bloom filter FPR:            0.06179431 %
> Stage 1 completed in 55.178s

> Stage 2: Correct long reads for "rnabloom"
Parsing ${path}/sra_2_data.fastq...
Corrected Read Lengths Sampling Distribution (n=1000)
  min  q1  med  q3  max
  202  401  923  1847  10000
ERROR: null
java.util.NoSuchElementException
  at rnabloom.io.FastxSequenceIterator.nextWithName(FastxSequenceIterator.java:94)
  at rnabloom.RNABloom.correctLongReadsMultithreaded(RNABloom.java:3412)
  at rnabloom.RNABloom.correctLongReads(RNABloom.java:4498)
  at rnabloom.RNABloom.main(RNABloom.java:6264)

I noticed it creates the file rnabloom.longreads.corrected.short.fa but the file is empty.

I tried also to run with java8, but then it gets stuck here:

> Stage 2: Correct long reads for "rnabloom"
Parsing ${path}/sra_2_data.fastq...

Not sure what it is doing but is using 12 threads, but only 3 to 4% cpu

FastQ gzipped files not accepted as input?

Please report

version of RNA-Bloom with java -jar RNA-Bloom.jar -version
Previous command results in: Error: Unable to access jarfile RNA-Bloom.jar
So command used for RNA-Bloom version: rnabloom -v
RNA-Bloom v1.4.3
Ka Ming Nip, Canada's Michael Smith Genome Sciences Centre, BC Cancer
Copyright 2018-present
version of java with java -version
openjdk version "11.0.13" 2021-10-19
OpenJDK Runtime Environment JBR-11.0.13.7-1751.21-jcef (build 11.0.13+7-b1751.21)
OpenJDK 64-Bit Server VM JBR-11.0.13.7-1751.21-jcef (build 11.0.13+7-b1751.21, mixed mode)
exact command used to run RNA-Bloom
rnabloom -pool readslist.txt -revcomp-right -ntcard -t 10 -mergepool -outdir /home/auesro/Desktop/RNA_Bloom/output -k 20

Error:

RNA-Bloom v1.4.3
args: [-pool, readslist.txt, -revcomp-right, -ntcard, -t, 10, -mergepool, -outdir, /home/auesro/Desktop/RNA_Bloom/output, -k, 20]

name:   rnabloom
outdir: /home/auesro/Desktop/RNA_Bloom/output
Pooled assembly mode is ON!
Parsing pool reads list file `readslist.txt`...
ERROR: Unsupported file format detected in input file `left`. Only FASTA and FASTQ formats are supported.
rnabloom.io.FileFormatException: Unsupported file format detected in input file `left`. Only FASTA and FASTQ formats are supported.
	at rnabloom.RNABloom.checkInputFileFormat(RNABloom.java:344)
	at rnabloom.RNABloom.main(RNABloom.java:6357)

My fastq files are gzipped but according to the Readme.md, gzip compression is supported warning Input reads must be in either FASTQ or FASTA format and may be compressed with GZIP.

Any ideas?

Thanks!

ERROR: Incorrect FASTA header format

Hi @kmnip,

Thanks for your support on my other issues. Here's another interesting one. I'm pretty sure the input FASTQ is valid and again this is just too few reads/too short causing some kind of FASTA invalid error. Thanks for your help!

root@06a8b6dc9fba:/data/retry# rnabloom -outdir rnabloom_out -t 8 -long filtered.fastq -ntcard                                                           [2/1879]
RNA-Bloom v1.4.3
args: [-outdir, rnabloom_out, -t, 8, -long, filtered.fastq, -ntcard]

name:   rnabloom
outdir: rnabloom_out
WARNING: Output directory does not exist!
Created output directory at `rnabloom_out`

K-mer counting with ntCard...
Running command: `ntcard -t 8 -k 17 -c 65535 -p rnabloom_out/rnabloom @rnabloom_out/rnabloom.ntcard.readslist.txt`...
Parsing histogram file `rnabloom_out/rnabloom_k17.hist`...
Unique k-mers (k=17):     2,368
Unique k-mers (k=17,c>1): 192
K-mer counting completed in 3.973s

Bloom filters          Memory (GB)
====================================
de Bruijn graph:       5.232985E-6
k-mer counting:        3.3946708E-6
====================================
Total:                 8.627656E-6

> Stage 1: Construct graph from reads (k=17)
Parsing `filtered.fastq`...
Parsed 41 sequences in 0.013s
DBG Bloom filter FPR:                 1.56 %
Counting Bloom filter FPR:            0.81 %
> Stage 1 completed in 0.024s

> Stage 2: Correct long reads for "rnabloom"
Parsing `filtered.fastq`...
Corrected Read Lengths Sampling Distribution (n=26)
        min     q1      med     q3      max
        18      23      63      92      213
Parsed 41 sequences.
        Kept:      26   (63.4 %)
        Discarded: 15   (36.6 %)
Corrected reads in 0.292s
Extracting seed sequences...
Bloom filter FPR:       0.0119 %
before: 1       after: 1 (100.0 %)
Extraction completed in 0.104s
> Stage 2 completed in 0.397s

> Stage 3: Assemble long reads for "rnabloom"
ERROR: Incorrect FASTA header format
rnabloom.io.FileFormatException: Incorrect FASTA header format
        at rnabloom.io.FastaReader.nextWithComment(FastaReader.java:240)
        at rnabloom.RNABloom.splitFastaByLength(RNABloom.java:5269)
        at rnabloom.RNABloom.main(RNABloom.java:7083)

Feature Request: More verbose logging?

Please report

version of RNA-Bloom with java -jar RNA-Bloom.jar -version
version of java with java -version
exact command used to run RNA-Bloom

Same software versions as #46

Thanks for the amazing tool. For the following job, I suspect it failed because there wasn't enough input data, but the log message is fairly vague. Was it a corrupt FASTQ? Did I OOM? I don't know based on this message. Is it possible to make it a bit clearer on what exactly went wrong, so I can decide when I need to investigate further? I think there is a separate error message that is sometimes triggered when there isn't enough input data. Thanks!

rnabloom -outdir rnabloom_out -t 8 -long filtered.fastq -ntcard
--
ERROR:root:stdout: RNA-Bloom v2.0.0
args: [-outdir, rnabloom_out, -t, 8, -long, filtered.fastq, -ntcard]
name:   rnabloom
outdir: rnabloom_out
WARNING: Output directory does not exist!
Created output directory at `rnabloom_out`
K-mer counting with ntCard...
Running command: `ntcard -t 8 -k 25 -c 65535 -p rnabloom_out/rnabloom @rnabloom_out/rnabloom.ntcard.readslist.txt`...
Parsing histogram file `rnabloom_out/rnabloom_k25.hist`...
Unique k-mers (k=25):     448
Unique k-mers (k=25,c>1): 0
WARNING: 0 non-singleton (c>1) k-mers detected!
K-mer counting completed in 3.367s
Bloom filters          Memory (GB)
====================================
de Bruijn graph:       9.901123E-7
k-mer counting:        7.9208985E-6
====================================
Total:                 8.911011E-6
> Stage 1: Construct graph from reads (k=25)
Parsing `filtered.fastq`...
Parsed 4 sequences in 0.004s
DBG Bloom filter FPR:                 1.06 %
Counting Bloom filter FPR:            0.0241 %
> Stage 1 completed in 0.009s
> Stage 2: Correct long reads for "rnabloom"
Parsing `filtered.fastq`...
Corrected Read Lengths Sampling Distribution (n=4)
min	q1	med	q3	max
153	155	160	166	170
Parsed 4 sequences.
Kept:      4	(100.0 %)
Discarded: 0	(0.0 %)
Corrected reads in 0.224s
Extracting seed sequences...
strobemers: n=3, k=11, wMin=12, wMax=61, depth=3
Bloom filter FPR:	0.389 %
before: 4	after: 4 (100.0 %)
too short: 0
Extraction completed in 0.109s
> Stage 2 completed in 0.333s
> Stage 3: Assemble long reads for "rnabloom"
Overlapping sequences...
Parsed 0 overlap records in 0.0s
total reads:    4
- unique:      0	(0.0 %)
- multi-seg: 0
Unique reads extracted in 0.001s
ERROR: Error assembling long reads!

FastqReader Error?

I am trying to run RNA-Bloom for Nanopore cDNA sequencing reads, but it seems like it is failing....

Below is the message I am getting if I run the program with an input fastq (uncompressed) file.

ERROR: Unsupported file format detected in input file `sample.fastq`. Only FASTA and FASTQ formats are supported.
rnabloom.io.FileFormatException: Unsupported file format detected in input file `sample.fastq`. Only FASTA and FASTQ formats are supported.
        at rnabloom.RNABloom.checkInputFileFormat(RNABloom.java:309)
        at rnabloom.RNABloom.main(RNABloom.java:4750)

Below is the message I am getting if I run the program with an input bgzip compressed fastq file. It runs for some time and then it dies. There's bunch of intermediary files.

...
Parsed 4,308,866 sequences.
        Corrected: 4,308,203(99.98461%)
        Discarded: 663(0.015386879%)
Reads corrected in 54m 50s
Clustering long reads for "rnabloom"
ERROR: null
java.lang.NullPointerException
        at rnabloom.io.FastaReader.<init>(FastaReader.java:44)
        at rnabloom.RNABloom.clusterLongReads(RNABloom.java:2210)
        at rnabloom.RNABloom.clusterLongReads(RNABloom.java:3740)
        at rnabloom.RNABloom.main(RNABloom.java:5150)

Below are the args :
args: [-ntcard, -c, 3, -k, 17, -indel, 10, -e, 3, -p, 0.8, -long, sample.fastq, -t, 16, -outdir, .]

Below is the java version :

java version "11.0.2" 2019-01-15 LTS
Java(TM) SE Runtime Environment 18.9 (build 11.0.2+9-LTS)
Java HotSpot(TM) 64-Bit Server VM 18.9 (build 11.0.2+9-LTS, mixed mode)

I've tried with different cDNA samples (all Nanopore) but they all give the same error.

Reg: Normalization of PE datasets

Hi,

I have around 20 samples RNASeq data with ~25Gb data in each sample. Generally, I tend to normalize the dataset for a kmer target depth of 100 at k=25 (using bbnorm/diginorm etc) beforehand so that I can hand-off easily to various assemblers to save on time.

So, this begets the question, should I be normalizing the dataset before handing off to RNABloom for parity with other assemblers, or is this not required, given that ntcard is going to anyways recompute the bloom filters?

Also, just curious, since ntCard is being used, have you tried a multi-kmer assembly?

instruction about the output file

Hi,

Thanks for making this useful tool. I just finished running "reference-free pooled". I can see there are a lot of output under the subdirectory.

It looks like Each cell have 3 fa file. "transcripts.fa", "transcripts.nr.fa", "transcripts.short.fa". And the fa header seem to has their owner potential meaning here. May I ask is there any types of instructions/explaination about the output? And what do we need to do if we want to get a universal transcriptome reference sequence ( or next step)?

Thanks in advance.

Bests

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space

Hello, I am getting a out of memory error when trying to assemble my reads, I am using 80 threads and 900GB of RAM to assembly. There are a total of 4,389,293 reads in my FASTQ file.

Version info:
RNA-Bloom v2.0.1
openjdk 20-internal 2023-03-21

The command I used is:
rnabloom -long assembly/cleaned.reads.fastq -t 80 -mem 900 -o assembly/bloom

Log:

> Stage 3: Assemble long reads for "rnabloom"
Overlapping sequences...
Parsed 2,964,907,952 overlap records in 1d 0h 19m 53s
total reads:    4,324,473
 - unique:	1,868,715	(43.2 %)
   - multi-seg: 1,477,350
Unique reads extracted in 10m 30s
Overlapping sequences...
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
        at java.base/java.util.Arrays.copyOfRange(Arrays.java:3822)
        at java.base/java.lang.StringLatin1.newString(StringLatin1.java:763)
        at java.base/java.lang.String.substring(String.java:2725)
        at java.base/java.lang.String.subSequence(String.java:2758)
        at java.base/java.util.regex.Matcher.getSubSequence(Matcher.java:1789)
        at java.base/java.util.regex.Matcher.group(Matcher.java:661)
        at rnabloom.util.PafUtils.hasGoodAlignment(PafUtils.java:83)
        at rnabloom.olc.Layout.hasGoodAlignment(Layout.java:149)
        at rnabloom.olc.Layout.populateGraphFromOverlaps(Layout.java:3028)
        at rnabloom.olc.Layout.extractSimplePaths(Layout.java:3350)
        at rnabloom.olc.OverlapLayoutConsensus.layoutSimple(OverlapLayoutConsensus.java:824)
        at rnabloom.olc.OverlapLayoutConsensus.overlapWithMinimapAndLayoutSimple(OverlapLayoutConsensus.java:545)
        at rnabloom.olc.OverlapLayoutConsensus.uniqueOLC(OverlapLayoutConsensus.java:1180)
        at rnabloom.RNABloom.assembleUnclusteredLongReads(RNABloom.java:3314)
        at rnabloom.RNABloom.main(RNABloom.java:7430)

Output doubts single-cell RNA-seq assembly

version of RNA-Bloom with java -jar RNA-Bloom.jar -version

root@0d73be8b2e5c:/data/output# rnabloom -version
RNA-Bloom v1.3.1
Ka Ming Nip, Canada's Michael Smith Genome Sciences Centre, BC Cancer
Copyright 2018

version of java with java -version

root@0d73be8b2e5c:/data# java -version
openjdk version "1.8.0_275"
OpenJDK Runtime Environment (build 1.8.0_275-8u275-b01-0ubuntu1~18.04-b01)
OpenJDK 64-Bit Server VM (build 25.275-b01, mixed mode)

exact command used to run RNA-Bloom
root@0d73be8b2e5c:/datat# rnabloom -pool readslist.txt -revcomp-right -ntcard -mergepool -outdir output/

Hi RNA-Bloom Team,
I've made a first tiny test with your assembler and I have a few questions regarding the output:

I've noticed that, for each cell, it generates 4 fasta files: cell.transcripts.fa, cell.transcripts.nr.fa, cell.transcripts.nr.short.fa, and cell.transcripts.short.fa. What is the difference between them?
What are all the .nbits files generated?
What I'm trying to do is to establish a workflow to analyze single-cell RNAseq data for non-model organisms. So what I want to do is to reconstruct the whole transcriptome by pooling the reads for all cells and assembling them, so later I can perform a gene prediction and use this as a reference transcriptome to obtain the gene expression matrix. So my question is if I can use the output of the -mergepool option to perform this analysis or it was meant for other things, because I don't really understand well what this option is doing: is it reconstructing the transcriptome taking into account the reads coming from all cells? or is it only merging the transcriptomes obtained by each cell independently?

I hope that I explained my questions clearly!
Marta.

Error occur during RNA-Bloom command

I use this command-- java -jar RNA-Bloom.jar -long combined.fasta -stranded -ntcard -t 30 -outdir output

Counting Bloom filter FPR: 1.0123985 %

Stage 1 completed in 3m 35s

Stage 2: Correct long reads for "rnabloom"
Parsing combined.fasta...
Corrected Read Lengths Sampling Distribution (n=1000)
min q1 med q3 max
200 423 671 1076 1895
Parsed 1,231,437 sequences.
Kept: 1,146,017(93.063385%)
Discarded: 85,420(6.9366117%)
Artifacts: 3(2.4361782E-4%)
Stage 2 completed in 11m 9s
Stage 3: Assemble long reads for "rnabloom"
Overlapped sequences: 261,062
- discarded: 105,394
- artifacts: 1
- unique: 119,308
- dovetail: 234
G: |V|=234 |E|=311
G: |V|=234 |E|=72
before: 1,146,004 after: 119,236
ERROR: Error assembling long reads!

See also rnabloom.longreads.assembly_map.paf.gz.log file result ---
[M::mm_idx_gen::2.8941.35] collected minimizers
[M::mm_idx_gen::3.0132.39] sorted minimizers
[M::main::3.0132.39] loaded/built the index for 119236 target sequence(s)
[M::mm_mapopt_update::3.0312.38] mid_occ = 61895
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 119236
[M::mm_idx_stat::3.043*2.37] distinct minimizers: 787299 (67.99% are singletons); average occurrences: 34.539; average spacing: 5.415; total length: 147235275
Killed

See also rnabloom.transcripts.fa.log file result --
[racon::Polisher::initialize] loaded target sequences 0.707022 s
[racon::Polisher::initialize] loaded sequences 6.976079 s
[racon::Polisher::initialize] error: empty overlap set!

So please tell me there is no overlap set so the final output transcript file is empty.
If any suggestion to get result please tell me.
Thanks in advance

3'UTR transgene identification from 10x scRNAseq reads

Hi all,

I have a set of 10x scRNAseq (nuclei instead of cells, to be precise) datasets derived from a mouse line expressing several transgenes (sfGFP and Bgal, in this case). I only know (and only hypothetically) the sequence for the CDS of these transgenes. Given that 10x mRNA sequencing mostly targets the most 3' end of the 3'UTR of mRNAs, I need to be able to reconstruct as much as possible of the transgene sequence in order to correctly map reads (most of them targeting the 3'UTR) and identify positive nuclei expressing the transgenes.

Any suggestions?

Thanks!

Problem parsing fasta file

Dear author of RNA-Bloom

I am using your software to assemble some direct RNA reads for different species however I am obtaining different errors in some of them.

Input file and command:

rnabloom -long $READS -stranded -t 8 -outdir $OUTDIR

The output that I get is the following:

RNA-Bloom v2.0.0
args: [-long, Input.fa, -stranded, -t, 8, -outdir, Output]

name:   rnabloom
outdir: Output

Turning on option `-ntcard` to count k-mers

K-mer counting with ntCard...
Running command: `ntcard -t 8 -k 25 -c 65535 -p Output/rnabloom @Output/rnabloom.ntcard.readslist.txt`...
Parsing histogram file `Output/rnabloom_k25.hist`...
Unique k-mers (k=25):     57,234,431
Unique k-mers (k=25,c>1): 11,105,972
K-mer counting completed in 21.059s

Bloom filters          Memory (GB)
====================================
de Bruijn graph:       0.12647936
k-mer counting:        0.19634001
====================================
Total:                 0.32281935

> Stage 1: Construct graph from reads (k=25)
Parsing `Input.fa`...
Parsed 477,740 sequences in 1m 0s
DBG Bloom filter FPR:                 1.06 %
Counting Bloom filter FPR:            1.17 %
> Stage 1 completed in 1m 1s

> Stage 2: Correct long reads for "rnabloom"
Parsing `Input.fa`...
Index -1 out of bounds for length 4
Index -1 out of bounds for length 4
java.lang.ArrayIndexOutOfBoundsException: Index -1 out of bounds for length 4
	at rnabloom.util.SeqUtils.isLowComplexityLong(SeqUtils.java:619)
	at rnabloom.util.SeqUtils.trimLowComplexityRegions(SeqUtils.java:848)
	at rnabloom.RNABloom$LongReadCorrectionWorker.run(RNABloom.java:3791)
	at java.base/java.lang.Thread.run(Thread.java:834)
java.lang.ArrayIndexOutOfBoundsException: Index -1 out of bounds for length 4
	at rnabloom.util.SeqUtils.isLowComplexityLong(SeqUtils.java:619)
	at rnabloom.util.SeqUtils.trimLowComplexityRegions(SeqUtils.java:848)
	at rnabloom.RNABloom$LongReadCorrectionWorker.run(RNABloom.java:3791)
	at java.base/java.lang.Thread.run(Thread.java:834)
null
java.lang.ArrayIndexOutOfBoundsException
null
java.lang.ArrayIndexOutOfBoundsException
null
java.lang.ArrayIndexOutOfBoundsException
null
java.lang.ArrayIndexOutOfBoundsException
Index -1 out of bounds for length 4
java.lang.ArrayIndexOutOfBoundsException: Index -1 out of bounds for length 4
	at rnabloom.util.SeqUtils.isLowComplexityLong(SeqUtils.java:605)
	at rnabloom.util.SeqUtils.trimLowComplexityRegions(SeqUtils.java:848)
	at rnabloom.RNABloom$LongReadCorrectionWorker.run(RNABloom.java:3791)
	at java.base/java.lang.Thread.run(Thread.java:834)
null
java.lang.ArrayIndexOutOfBoundsException
Corrected Read Lengths Sampling Distribution (n=4528)
	min	q1	med	q3	max
	239	776	1112	1635	5315
ERROR: null
java.lang.ArrayIndexOutOfBoundsException

Program version:

RNA-Bloom v2.0.0
openjdk version "11.0.1" 2018-10-16 LTS
OpenJDK Runtime Environment Zulu11.2+3 (build 11.0.1+13-LTS)
OpenJDK 64-Bit Server VM Zulu11.2+3 (build 11.0.1+13-LTS, mixed mode)

Any help that you can provide would be appreciated.

ERROR: Error during redundancy reduction!

Please report

version of RNA-Bloom with java -jar RNA-Bloom.jar -version
/Bio/User/software/anaconda3/envs/rnabloom/bin/rnabloom
RNA-Bloom v2.0.0
version of java with java -version

/Bio/User/software/anaconda3/envs/rnabloom/bin/java -version
openjdk version "17.0.3-internal" 2022-04-19
OpenJDK Runtime Environment (build 17.0.3-internal+0-adhoc..src)
OpenJDK 64-Bit Server VM (build 17.0.3-internal+0-adhoc..src, mixed mode, sharing)

exact command used to run RNA-Bloom

source /Bio/User/software/anaconda3/bin/activate /Bio/User/software/anaconda3/envs/rnabloom;/Bio/User/software/anaconda3/envs/rnabloom/bin/rnabloom -revcomp-right -t 10 -ntcard -prefix R21051817_000 -left /state/partition1/WORK/Bio/Project/xiek/project/GDDXXXX_std_4/pipe/pipe3/02.mrna/04.assemble/denovoasm/R21051817_1.fq.gz -right /state/partition1/WORK/Bio/Project/xiek/project/GDDXXXX_std_4/pipe/pipe3/02.mrna/04.assemble/denovoasm/R21051817_2.fq.gz -outdir /state/partition1/WORK/Bio/Project/xiek/project/GDDXXXX_std_4/pipe/pipe3/02.mrna/04.assemble/denovoasm/R21051817_rnabloom -q 3 -Q 20 -c 1 -e 1 -bound 1000 -extend -stratum e0

Reducing redundancy in assembled transcripts...
Parsed 0 overlap records in 0.011s
before: 88,142 after: 88,142
ERROR: Error during redundancy reduction!

Best,
Kun

Assemble multipple bulk RNA samples

Hi,

If I have many bulk RNA samples (from different tissues or different samples), what is the best way to assemble these datas:

merge all fastq files by cat (zcat *.R1.fq.gz|gzip -c > merge_1.fq.gz; zcat *.R2.fq.gz|gzip -c > merge_2.fq.gz;) and then use rna-bloom to assemble merge fastq file.
use rna-bloom to assemble each sample seperately and merge the assemblies.

Best,
Kun

RNABloom not creating transcriptome files

I ran rnabloom on several of my input files individually. out of my 10 samples, it did not produce any transcriptome file for 2 of them. Why is that? It finished running but there are no output files.
I ran the following command: rnabloom -long sample.fastq -t 48 -outdir .../.../sample -k 10 -e 5

These are my versions:

RNA-Bloom v2.0.1
openjdk 20.0.2-internal 2023-07-18

c=null in output FASTA

Thanks for the amazing software! Ran some cDNA long-read sequencing through assembly:
rnabloom -long input.fastq -ntcard -t 8 -outdir assembled

It all succeeded; however, when looking at my output data, I see that c=null in some of the FASTA sequence headers. I understood that the c= tag was meant to include coverage, and most of theses tags look correct!

Example:

>3 l=228 c=null s=8
GCTGAAAGCATCTAAGTGTGAAACCCACCTCAAGATGAGATTTCCCATGATTTTATATCAGTAAGACTATCCTCAGTGGGAAATCTGTCTTGCCCTCCCTCCCGGGACCCCCCTAGGCCCGCCCCGGCATTTATCCCTTCCCCCCCGGCGGAACAACGAGTACGCCGGCGGTAAATCCACTCTGTCCTCTCGCGCAAAACGGATCGGCCTCCGCCCGCGACGGATAGA

Please report

test$ java -version
openjdk version "11.0.9.1-internal" 2020-11-04
OpenJDK Runtime Environment (build 11.0.9.1-internal+0-adhoc..src)
OpenJDK 64-Bit Server VM (build 11.0.9.1-internal+0-adhoc..src, mixed mode)

test$ rnabloom -v
RNA-Bloom v1.4.3
Ka Ming Nip, Canada's Michael Smith Genome Sciences Centre, BC Cancer
Copyright 2018-present

Feature request: long-read reference-guided assembly

It would be very helpful to have a reference-guided assembly available for the long-read functionality of rnabloom.

ERROR: Infinite or NaN

Thanks for the great tool! Hitting a bit of an error...I suspect it may be because I have so few reads or they're too short but it's a bit cryptic...

root@f069f7fa8f9e:/data/retry# rnabloom -outdir rnabloom_out -t 8 -long filtered.fastq -ntcard                                                           [58/423]
RNA-Bloom v1.4.3
args: [-outdir, rnabloom_out, -t, 8, -long, filtered.fastq, -ntcard]

name:   rnabloom
outdir: rnabloom_out
WARNING: Output directory does not exist!
Created output directory at `rnabloom_out`

K-mer counting with ntCard...
Running command: `ntcard -t 8 -k 17 -c 65535 -p rnabloom_out/rnabloom @rnabloom_out/rnabloom.ntcard.readslist.txt`...
Parsing histogram file `rnabloom_out/rnabloom_k17.hist`...
Unique k-mers (k=17):     1,600
Unique k-mers (k=17,c>1): 0
K-mer counting completed in 3.977s

Bloom filters          Memory (GB)
====================================
de Bruijn graph:       3.5357662E-6
====================================
Total:                 3.5357662E-6

> Stage 1: Construct graph from reads (k=17)
Parsing `filtered.fastq`...
Exception in thread "Thread-6" Exception in thread "Thread-5" Exception in thread "Thread-3" Exception in thread "Thread-2" java.lang.RuntimeException: java.lang.Arithme$
icException: / by zero
        at rnabloom.RNABloom$FastqToGraphWorker.run(RNABloom.java:617)
        at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.lang.ArithmeticException: / by zero
        at rnabloom.bloom.CountingBloomFilter.getIndex(CountingBloomFilter.java:103)
        at rnabloom.bloom.CountingBloomFilter.increment(CountingBloomFilter.java:132)
        at rnabloom.graph.BloomFilterDeBruijnGraph.add(BloomFilterDeBruijnGraph.java:411)
        at rnabloom.RNABloom$FastqToGraphWorker.run(RNABloom.java:606)
        ... 1 more
Exception in thread "Thread-7" Exception in thread "Thread-0" java.lang.RuntimeException: java.lang.ArithmeticException: / by zero
        at rnabloom.RNABloom$FastqToGraphWorker.run(RNABloom.java:617)
        at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.lang.ArithmeticException: / by zero
        at rnabloom.bloom.CountingBloomFilter.getIndex(CountingBloomFilter.java:103)
        at rnabloom.bloom.CountingBloomFilter.increment(CountingBloomFilter.java:132)
        at rnabloom.graph.BloomFilterDeBruijnGraph.add(BloomFilterDeBruijnGraph.java:411)
        at rnabloom.RNABloom$FastqToGraphWorker.run(RNABloom.java:606)
        ... 1 more
Exception in thread "Thread-4" Exception in thread "Thread-1" java.lang.RuntimeException: java.lang.ArithmeticException: / by zero
        at rnabloom.RNABloom$FastqToGraphWorker.run(RNABloom.java:617)
        at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.lang.ArithmeticException: / by zero
        at rnabloom.bloom.CountingBloomFilter.getIndex(CountingBloomFilter.java:103)
        at rnabloom.bloom.CountingBloomFilter.increment(CountingBloomFilter.java:132)
        at rnabloom.graph.BloomFilterDeBruijnGraph.add(BloomFilterDeBruijnGraph.java:411)
        at rnabloom.RNABloom$FastqToGraphWorker.run(RNABloom.java:606)
        ... 1 more
java.lang.RuntimeException: java.lang.ArithmeticException: / by zero
        at rnabloom.RNABloom$FastqToGraphWorker.run(RNABloom.java:617)
        at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.lang.ArithmeticException: / by zero
        at rnabloom.bloom.CountingBloomFilter.getIndex(CountingBloomFilter.java:103)
        at rnabloom.bloom.CountingBloomFilter.increment(CountingBloomFilter.java:132)
        at rnabloom.graph.BloomFilterDeBruijnGraph.add(BloomFilterDeBruijnGraph.java:411)
        at rnabloom.RNABloom$FastqToGraphWorker.run(RNABloom.java:606)
        ... 1 more
java.lang.RuntimeException: java.lang.ArithmeticException: / by zero
        at rnabloom.RNABloom$FastqToGraphWorker.run(RNABloom.java:617)
        at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.lang.ArithmeticException: / by zero
        at rnabloom.bloom.CountingBloomFilter.getIndex(CountingBloomFilter.java:103)
        at rnabloom.bloom.CountingBloomFilter.increment(CountingBloomFilter.java:132)
        at rnabloom.graph.BloomFilterDeBruijnGraph.add(BloomFilterDeBruijnGraph.java:411)
        at rnabloom.RNABloom$FastqToGraphWorker.run(RNABloom.java:606)
        ... 1 more
java.lang.RuntimeException: java.lang.ArithmeticException: / by zero
        at rnabloom.RNABloom$FastqToGraphWorker.run(RNABloom.java:617)
        at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.lang.ArithmeticException: / by zero
        at rnabloom.bloom.CountingBloomFilter.getIndex(CountingBloomFilter.java:103)
        at rnabloom.bloom.CountingBloomFilter.increment(CountingBloomFilter.java:132)
        at rnabloom.graph.BloomFilterDeBruijnGraph.add(BloomFilterDeBruijnGraph.java:411)
        at rnabloom.RNABloom$FastqToGraphWorker.run(RNABloom.java:606)
        ... 1 more
java.lang.RuntimeException: java.lang.ArithmeticException: / by zero
        at rnabloom.RNABloom$FastqToGraphWorker.run(RNABloom.java:617)
        at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.lang.ArithmeticException: / by zero
        at rnabloom.bloom.CountingBloomFilter.getIndex(CountingBloomFilter.java:103)
        at rnabloom.bloom.CountingBloomFilter.increment(CountingBloomFilter.java:132)
        at rnabloom.graph.BloomFilterDeBruijnGraph.add(BloomFilterDeBruijnGraph.java:411)
        at rnabloom.RNABloom$FastqToGraphWorker.run(RNABloom.java:606)
        ... 1 more
java.lang.RuntimeException: java.lang.ArithmeticException: / by zero
        at rnabloom.RNABloom$FastqToGraphWorker.run(RNABloom.java:617)
        at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.lang.ArithmeticException: / by zero
        at rnabloom.bloom.CountingBloomFilter.getIndex(CountingBloomFilter.java:103)
        at rnabloom.bloom.CountingBloomFilter.increment(CountingBloomFilter.java:132)
        at rnabloom.graph.BloomFilterDeBruijnGraph.add(BloomFilterDeBruijnGraph.java:411)
        at rnabloom.RNABloom$FastqToGraphWorker.run(RNABloom.java:606)
        ... 1 more
Parsed 3 sequences in 0.015s
DBG Bloom filter FPR:                 0.0452 %
DBG Bloom filter FPR:                 0.0452 %
ERROR: Infinite or NaN
java.lang.NumberFormatException: Infinite or NaN
        at java.base/java.math.BigDecimal.<init>(BigDecimal.java:923)
        at java.base/java.math.BigDecimal.<init>(BigDecimal.java:900)
        at rnabloom.util.Common.roundToSigFigs(Common.java:36)
        at rnabloom.util.Common.convertToRoundedPercent(Common.java:32)
        at rnabloom.RNABloom.populateGraph2(RNABloom.java:1322)
        at rnabloom.RNABloom.main(RNABloom.java:6823)

version of RNA-Bloom with java -jar RNA-Bloom.jar -version
version of java with java -version

root@f069f7fa8f9e:/data/retry# java --version
openjdk 11.0.9.1-internal 2020-11-04
OpenJDK Runtime Environment (build 11.0.9.1-internal+0-adhoc..src)
OpenJDK 64-Bit Server VM (build 11.0.9.1-internal+0-adhoc..src, mixed mode)

exact command used to run RNA-Bloom

Output error message for pooled assembly of single-end reads

Please report

version of RNA-Bloom with java -jar RNA-Bloom.jar -version
2.0.1
version of java with java -version
20
exact command used to run RNA-Bloom
rnabloom -pool readslist.txt

If an input readslist file for the -pool option contains only single-end reads, stage 1 will still run to the end.

Technically, RNA-Bloom should exit with an error indicating "pooled assembly of single-end reads is not supported."
I can also emphasize this (and explain why) in the readme section for -pool option:
https://github.com/bcgsc/RNA-Bloom#file-format-for-the--pool-option

Correct Strand Information and Pairing

Hi Ka Ming,

The strandedness confuses me ( and hopefully others), so just to get this part right. I have Illumina PE which is stranded and PacBio Long reads, which are not stranded.

Now typically before alignments and processing I consult the following to ensure I am giving the right orientation " https://github.com/igordot/genomics/blob/master/notes/rna-seq-strand.md"
The reverse option is true for a typical dUTP ( Illumina TruSeq or NEB etc).

In the detailed README you have mentioned that:
-left /path/to/reads_2.fastq -right /path/to/reads_1.fastq

So the order you have mentioned above for left (R2) and right (r1) refers to the reverse as above mentioned?
Thanks!!

Error assembling long reads

Hi Ka Ming,

I tried to do hybrid assembly with Illumina PE reads and PacBio long reads. Got "Error assembling long reads".
Here is a log that may be of help

RNA-Bloom v1.3.0
args: [-left, ../Sample_363_364_R2.fastq, -right, ../Sample_363_364_R1.fastq, -long, ../../../ZMW12345_Control.polished.hq.fasta, -ntcard, -t, 10, -outdir, Hybrid_Control_RNABloom]

name:   rnabloom
outdir: Hybrid_Control_RNABloom
WARNING: Output directory does not exist!
Created output directory at `Hybrid_Control_RNABloom`

K-mer counting with ntCard...
Running command: `ntcard -t 10 -k 17 -c 65535 -p Hybrid_Control_RNABloom/rnabloom @Hybrid_Control_RNABloom/rnabloom.ntcard.readslist.txt`...
Parsing histogram file `Hybrid_Control_RNABloom/rnabloom_k17.hist`...
Unique k-mers (k=17): 713,045,380
Min k-mer coverage threshold: 2
K-mer counting completed in 9m 27s

Bloom filters          Memory (GB)
====================================
de Bruijn graph:       1.5757214
k-mer counting:        5.7748213
====================================
Total:                 7.3505425

> Stage 1: Construct graph from reads (k=17)
[1] Parsing `../Sample_363_364_R2.fastq`...
[3] Parsing `../../../ZMW12345_Control.polished.hq.fasta`...
[2] Parsing `../Sample_363_364_R1.fastq`...
[3] Parsed 22,862 sequences.
[2] Parsed 217,746,749 sequences.
[1] Parsed 217,746,749 sequences.
Parsed 435,516,360 reads in total.
DBG Bloom filter FPR:                 0.9953146 %
Counting Bloom filter FPR:            1.0024519 %
> Stage 1 completed in 2h 40m 34s

> Stage 2: Correct long reads for "rnabloom"
Parsing `../../../ZMW12345_Control.polished.hq.fasta`...
Corrected Read Lengths Sampling Distribution (n=1000)
        min     q1      med     q3      max
        5540    6298    6778    7486    11727
Parsed 22,862 sequences.
        Kept:      22,659(99.11206%)
        Discarded: 203(0.8879363%)
        Artifacts: 43(0.18808503%)
> Stage 2 completed in 3m 2s

> Stage 3: Assemble long reads for "rnabloom"
Overlapped sequences: 1
         - artifacts: 1
         - unique:    1
before: 22,659  after: 22,659
ERROR: Error assembling long reads!

I am not sure what the actual error is from here.

Searching more logs and came across: "ranbloom.transcripts.fa.log"

[racon::Polisher::initialize] loaded target sequences 0.433007 s
[racon::Polisher::initialize] loaded sequences 0.469123 s
[racon::Overlap::transmute] error: unequal lengths in sequence and overlap file for sequence transcript!

Not sure what does this means. PacBio reads will be of different lengths, Illumina PE reads are of same length 2"150bp.
Let me what else I can provide to help you.
May be I would have to wait for you to release PacBio optimized version.

Thanks!

Coverage does not add up

Thanks again for your help!

When using RNA-Bloom, I expected (perhaps naively) that the sum of coverage of each contig X length of same contig would not exceed the number of bases in my input dataset. Or at least would be close, accounting for some margin of error. (Ie reads would only contribute to coverage of a single transcript, as they only originated from one transcript).

However, this does not seem to be the case.

Here is my calculation of input bases from the transcripts FASTA:
assembly_info.xls
Sum of coverage of each contig X length of same contig = 354,613,891bp

assembly_size# seqkit stats input.fastq 
file            format  type  num_seqs      sum_len  min_len  avg_len  max_len
input.fastq  FASTQ   DNA    415,469  200,885,818      100    483.5    4,601

So: is a single read having its bases assigned to multiple transcripts, and therefore increasing the coverage of multiple transcripts?

Please report

Command and version is the same as in this issue: #18

ERROR: For input string: "9223372036854775808"

Another error which seems to be from too few/too short reads as input? But pretty cryptic. Thanks again!!

root@633c29844365:/data/retry# rnabloom -outdir rnabloom_out -t 8 -long filtered.fastq -ntcard
RNA-Bloom v1.4.3
args: [-outdir, rnabloom_out, -t, 8, -long, filtered.fastq, -ntcard]

name:   rnabloom
outdir: rnabloom_out
WARNING: Output directory does not exist!
Created output directory at `rnabloom_out`

K-mer counting with ntCard...
Running command: `ntcard -t 8 -k 17 -c 65535 -p rnabloom_out/rnabloom @rnabloom_out/rnabloom.ntcard.readslist.txt`...
Parsing histogram file `rnabloom_out/rnabloom_k17.hist`...
ERROR: For input string: "9223372036854775808"
java.lang.NumberFormatException: For input string: "9223372036854775808"
        at java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
        at java.base/java.lang.Long.parseLong(Long.java:696)
        at java.base/java.lang.Long.parseLong(Long.java:817)
        at rnabloom.util.NTCardHistogram.<init>(NTCardHistogram.java:44)
        at rnabloom.RNABloom.getNTCardHistogram(RNABloom.java:5568)
        at rnabloom.RNABloom.main(RNABloom.java:6631)

version of RNA-Bloom with java -jar RNA-Bloom.jar -version
version of java with java -version

root@633c29844365:/data/retry# java --version
openjdk 11.0.9.1-internal 2020-11-04
OpenJDK Runtime Environment (build 11.0.9.1-internal+0-adhoc..src)
OpenJDK 64-Bit Server VM (build 11.0.9.1-internal+0-adhoc..src, mixed mode)

exact command used to run RNA-Bloom

conda install url not working

Please report

Hello I was trying to install rna bloom with the mamba install command but it gives me an error message saying invalid URL and so it is not installing it. I am not sure what is wrong?
I used the following command: mamba install -c bioconda rnabloom

It says URL rejected: Malformed input to URL function
Donwload error (3) URL using bad/illegal format or missing URL

We really need to keep E0.L. sequence?

Hi,

I found many E0.L. sequences (2 to 3 or more times than other E sequences) in rna-bloom's assembly (in all samples I have tested), but the busco results with or withou E0.L. sequences are almost the same. Is it necessary to keep E0.L. sequences?

Or is there an option to tell rna-bloom not to output E0.L. sequences?

Best,
Kun

questions about how to get genes from the output

Please report

version of RNA-Bloom with java -jar RNA-Bloom.jar -version
RNA-Bloom v2.0.1
version of java with java -version
openjdk version "18.0.1" 2022-04-19
exact command used to run RNA-Bloom
rnabloom -long ${FILE} -t 48 -outdir ${NAME}

Hi Ka Ming,

I'm using RNA-bloom2 to assemble long-read cDNA RNA-seq data. I have a question about the output. I can see the transcripts.fa files have the sequences for each transcripts, but how can I know which transcripts are from the same gene?
I don't see that information contained in the header. Some example headers are shown here:

>rb_90719 l=1982 c=0.25546062 path=[94775+,95098+]
>rb_90720 l=407 c=0.21744472 s=103012

Also, I'm not sure why some header show s while others show path, any difference?

Thank you so much if you could help to explain it.

Cheers,
Alex

Running time

Hello,

I was running RNA-Bloom to assemble nanopore PCR cDNA sequencing data. I used a default option as suggested, but the Stage 2: Correct long reads for "rnabloom", Parsing 'xxx.fasta'... never ends (> 7 days). Could you give me any advice on this? I have attached the log below. Thanks.

'''
RNA-Bloom v1.3.1
args: [-long, /xxx.fasta, -ntcard, -t, 8, -outdir, nanopore_4]

name: rnabloom
outdir: nanopore_4
WARNING: Output directory does not exist!
Created output directory at nanopore_4

K-mer counting with ntCard...
Running command: ntcard -t 8 -k 17 -c 65535 -p nanopore_4/rnabloom @nanopore_4/rnabloom.ntcard.readslist.txt...
Parsing histogram file nanopore_4/rnabloom_k17.hist...
Unique k-mers (k=17): 1,572,283,379
Min k-mer coverage threshold: 2
K-mer counting completed in 2m 23s

Bloom filters Memory (GB)

de Bruijn graph: 3.4745061
k-mer counting: 8.826317

Total: 12.300823

Stage 1: Construct graph from reads (k=17)
[1] Parsing /xxx.fasta...
[1] Parsed 12,632,257 sequences.
Parsed 12,632,257 reads in total.
DBG Bloom filter FPR: 0.9975167 %
Counting Bloom filter FPR: 1.0151776 %
Stage 1 completed in 23m 46s

Stage 2: Correct long reads for "rnabloom"
Parsing /xxx.fasta...
'''

Execution halted at mergepool stage

Hi,

I am running a pooled assembly with -mergepool and I get the following at the last stage.

>> Merging transcripts from all samples...
/home/kgagalova/miniconda3/envs/py3.6/bin/rnabloom: line 2: 2782284 Killed                  java -jar "/home/kgagalova/miniconda3/envs/py3.6/lib/rnabloom-v1.3.1.jar" "$@"

Those are all my files in the working directory

total 25843292
drwxrwxr-x 9 kgagalova kgagalova          28 Oct 15 22:12 ./
drwxrwxr-x 5 kgagalova kgagalova           9 Oct 22 13:25 ../
-rw-rw-r-- 1 kgagalova kgagalova           0 Oct 15 16:50 DBG.DONE
-rw-rw-r-- 1 kgagalova kgagalova           0 Oct 15 20:50 FRAGMENTS.DONE
-rw-rw-r-- 1 kgagalova kgagalova  1134665888 Oct 15 22:12 rnabloom.all.fa
-rw-rw-r-- 1 kgagalova kgagalova         426 Oct 15 22:13 rnabloom.all_nr.fa.log
-rw-rw-r-- 1 kgagalova kgagalova          98 Oct 15 16:48 rnabloom.graph
-rw-rw-r-- 1 kgagalova kgagalova 21385170463 Oct 15 16:49 rnabloom.graph.cbf
-rw-rw-r-- 1 kgagalova kgagalova          44 Oct 15 16:49 rnabloom.graph.cbf.desc
-rw-rw-r-- 1 kgagalova kgagalova 15643353278 Oct 15 16:49 rnabloom.graph.dbgbf
-rw-rw-r-- 1 kgagalova kgagalova          44 Oct 15 16:48 rnabloom.graph.dbgbf.desc
-rw-rw-r-- 1 kgagalova kgagalova 15643353278 Oct 15 16:50 rnabloom.graph.rpkbf
-rw-rw-r-- 1 kgagalova kgagalova          45 Oct 15 16:49 rnabloom.graph.rpkbf.desc
-rw-rw-r-- 1 kgagalova kgagalova      543760 Oct 15 15:46 rnabloom_k25.hist
-rw-rw-r-- 1 kgagalova kgagalova      541968 Oct 15 15:47 rnabloom_k30.hist
-rw-rw-r-- 1 kgagalova kgagalova      540583 Oct 15 15:49 rnabloom_k35.hist
-rw-rw-r-- 1 kgagalova kgagalova      538669 Oct 15 15:50 rnabloom_k40.hist
-rw-rw-r-- 1 kgagalova kgagalova      537163 Oct 15 15:51 rnabloom_k45.hist
-rw-rw-r-- 1 kgagalova kgagalova      535593 Oct 15 15:52 rnabloom_k50.hist
-rw-rw-r-- 1 kgagalova kgagalova        2573 Oct 15 15:45 rnabloom.ntcard.readslist.txt
-rw-rw-r-- 1 kgagalova kgagalova         253 Oct 15 15:52 STARTED

I believe it crashes after minimap. This is the log file - rnabloom.all_nr.fa.log:

[M::mm_idx_gen::15.105*1.58] collected minimizers
[M::mm_idx_gen::15.898*3.57] sorted minimizers
[M::main::15.898*3.57] loaded/built the index for 1528110 target sequence(s)
[M::mm_mapopt_update::16.192*3.52] mid_occ = 659
[M::mm_idx_stat] kmer size: 15; skip: 5; is_hpc: 0; #seq: 1528110
[M::mm_idx_stat::16.333*3.50] distinct minimizers: 24112798 (28.61% are singletons); average occurrences: 15.092; average spacing: 2.999

Any suggestions? I did a test with downsizing my pooled assembly by about the half, I still get the same error

rna bloom version - 1.3.1
java version 11.0.8
options java -Xmx200g
comand

rnabloom -k 25-75:5 -fpr 0.005 -extend -savebf -t 48 -ref ${refs} -pool ${readslist} -mergepool -rcr -ntcard -stranded -outdir ${outdir}

java.lang.RuntimeException: java.lang.StringIndexOutOfBoundsException: String index out of range: -1

Hi,

I got these errors and rna-bloom stuck forever, what's wrong?
Exception in thread "Thread-16" Exception in thread "Thread-23" Exception in thread "Thread-21" Exception in thread "Thread-22" Exception in thread "Thread-20" Exception in thread "Thread-19" Exception in thread "Thread-17" Exception in thread "Thread-18" java.lang.RuntimeException: java.lang.StringIndexOutOfBoundsException: String index out of range: -1
at rnabloom.RNABloom$FastqToGraphWorker.run(RNABloom.java:617)
at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.lang.StringIndexOutOfBoundsException: String index out of range: -1
at java.base/java.lang.StringLatin1.charAt(StringLatin1.java:47)
at java.base/java.lang.String.charAt(String.java:693)
at rnabloom.bloom.hash.NTHash.NTPC64(NTHash.java:471)
at rnabloom.bloom.hash.NTHash.NTMC64(NTHash.java:702)
at rnabloom.bloom.hash.CanonicalPairedNTHashIterator.next(CanonicalPairedNTHashIterator.java:41)
at rnabloom.RNABloom$FastqToGraphWorker.run(RNABloom.java:574)
... 1 more
java.lang.RuntimeException: java.lang.StringIndexOutOfBoundsException: String index out of range: -1
at rnabloom.RNABloom$FastqToGraphWorker.run(RNABloom.java:617)
at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.lang.StringIndexOutOfBoundsException: String index out of range: -1
at java.base/java.lang.StringLatin1.charAt(StringLatin1.java:47)
at java.base/java.lang.String.charAt(String.java:693)
at rnabloom.bloom.hash.NTHash.NTPC64(NTHash.java:471)
at rnabloom.bloom.hash.NTHash.NTMC64(NTHash.java:702)
at rnabloom.bloom.hash.CanonicalPairedNTHashIterator.next(CanonicalPairedNTHashIterator.java:41)
at rnabloom.RNABloom$FastqToGraphWorker.run(RNABloom.java:574)
... 1 more

Best,
Kun

Q. Error in extracting unique reads at Stage 3 (Assembling long reads for "rnabloom")

Hello, I can't seem to go past stage 2 whenever I try to use RNA-Bloom2 with ONT long-read RNA-Seq data.

RNA-Bloom2 Input CMD:

#n.b.1, `.` directory contains combined reads of ONT long-read RNA-Seq FASTQ files and output directory "rnabloom_assembly"
#n.b.2, `ntCard` and `racon` were both present in /my_path/ environment to run RNA-Bloom2
$ java -jar /my_path/RNA-Bloom_v2.0.1/RNA-Bloom.jar -version
RNA-Bloom v2.0.1
Ka Ming Nip, Canada's Michael Smith Genome Sciences Centre, BC Cancer
Copyright 2018-present
$ java -version
openjdk version "11.0.13" 2021-10-19
OpenJDK Runtime Environment JBR-11.0.13.7-1751.21-jcef (build 11.0.13+7-b1751.21)
OpenJDK 64-Bit Server VM JBR-11.0.13.7-1751.21-jcef (build 11.0.13+7-b1751.21, mixed mode)
$ java -jar /my_path/RNA-Bloom_v2.0.1/RNA-Bloom.jar -long ./*.fastq -t 36 -outdir ./rnabloom_assembly/

CMD log & error prompts:

name:   rnabloom
outdir: ./rnabloom_assembly/

Turning on option `-ntcard` to count k-mers

K-mer counting with ntCard...
Running command: `ntcard -t 36 -k 25 -c 65535 -p ./rnabloom_assembly//rnabloom @./rnabloom_assembly//rnabloom.ntcard.readslist.txt`...
Parsing histogram file `./rnabloom_assembly//rnabloom_k25.hist`...
Unique k-mers (k=25):     693,428,654
Unique k-mers (k=25,c>1): 98,918,078
K-mer counting completed in 6.905s

Bloom filters          Memory (GB)
====================================
de Bruijn graph:       1.5323714
k-mer counting:        1.7487507
====================================
Total:                 3.2811222

> Stage 1: Construct graph from reads (k=25)
Parsing `./pass_barcode01_0.fastq`...
Parsed 4,000 sequences in 0.467s
Parsing `./pass_barcode01_1.fastq`...
Parsed 4,000 sequences in 0.451s
Parsing `./pass_barcode01_2.fastq`...
Parsed 4,000 sequences in 0.147s
... etc ...
Parsed 6,183,653 sequences from 1552 files.
DBG Bloom filter FPR:                 0.721 %
Counting Bloom filter FPR:            0.8 %
> Stage 1 completed in 4m 22s

> Stage 2: Correct long reads for "rnabloom"
Parsing `./pass_barcode01_0.fastq`...
Parsing `./pass_barcode01_1.fastq`...
Parsing `./pass_barcode01_2.fastq`...
Corrected Read Lengths Sampling Distribution (n=10000)
	min	q1	med	q3	max
	46	215	298	462	2916
Parsing `./pass_barcode01_3.fastq`...
Parsing `./pass_barcode01_4.fastq`...
Parsing `./pass_barcode01_5.fastq`...
... etc ...
Parsed 6,183,653 sequences.
	Kept:      6,181,644	(100.0 %)
	Discarded: 2,009	(0.0325 %)
	Artifacts: 23,076	(0.37317747%)
Corrected reads in 3m 51s
Extracting seed sequences...
strobemers: n=3, k=11, wMin=12, wMax=61, depth=3
Bloom filter FPR:	5.51 %
before: 6,073,495	after: 1,485,544 (24.5 %)
too short: 0
Extraction completed in 11m 14s
> Stage 2 completed in 15m 6s

> Stage 3: Assemble long reads for "rnabloom"
Overlapping sequences...
Parsed 0 overlap records in 0.003s
total reads:    1,485,544
 - unique:      0	(0.0 %)
   - multi-seg: 0
Unique reads extracted in 5.006s
ERROR: Error extracting unique reads!
ERROR: Error assembling long reads!

output directory:

$ ll rnabloom_assembly/
total 1193474
-rw-rw-r-- 1 ~ ~~         0 Aug 25 09:16 LONGREADS.CORRECTED
-rw-rw-r-- 1 ~ ~~    517516 Aug 25 08:55 rnabloom_k25.hist
-rw-rw-r-- 1 ~ ~~        21 Aug 25 08:55 rnabloom_k25.hist.log
-rw-rw-r-- 1 ~ ~~        20 Aug 25 09:19 rnabloom.longreads.assembly1.nr.fa.gz
-rw-rw-r-- 1 ~ ~~        31 Aug 25 09:19 rnabloom.longreads.assembly1.nr.fa.gz.log
-rw-rw-r-- 1 ~ ~~ 956597733 Aug 25 09:04 rnabloom.longreads.corrected.long.fa.gz
-rw-rw-r-- 1 ~ ~~     40421 Aug 25 09:04 rnabloom.longreads.corrected.long.lengths.txt
-rw-rw-r-- 1 ~ ~~ 253420242 Aug 25 09:16 rnabloom.longreads.corrected.long.seed.fa.gz
-rw-rw-r-- 1 ~ ~~   4358775 Aug 25 09:04 rnabloom.longreads.corrected.polya.txt.gz
-rw-rw-r-- 1 ~ ~~    172811 Aug 25 09:04 rnabloom.longreads.corrected.repeats.fa.gz
-rw-rw-r-- 1 ~ ~~   6566759 Aug 25 09:04 rnabloom.longreads.corrected.short.fa.gz
-rw-rw-r-- 1 ~ ~~     82849 Aug 25 09:19 rnabloom.ntcard.readslist.txt
-rw-rw-r-- 1 ~ ~~     82889 Aug 25 09:19 STARTED

Thanks.

Improved README: Add output files description

Hi @kmnip,

Thanks for the great tool. I believe the RNA-Bloom would benefit from a README section describing the output files.
I do not have full comprehension of the output files to put a PR but can help if wanted.

Thanks.
Anicet

Program doesn't finish on tiny test dataset (illumina PE & ONP)

Thanks for your active work on this program! In my opinion, hybrid-capable assemblers are the future. Sadly, I encountered a problem with RNA-Bloom.
For a tiny test dataset (Illumina PE and Nanopore cDNA-reads), the program doesn't finish or terminate on its own. It runs until being terminated by the job controller (Slurm) due to timeout. Some errors occur early on (see attached). I'm using v1.3, but these errors and behaviour also occured in v1.2.3. Any advice on this?

Installation and version:

installed via conda
version: 1.3 and 1.2.3

INPUT:
A pool of 2 tissues with two replicates, a total of only 400 PE-reads. Separately, 100 long ONP-reads are given.

SETTINGS:
rnabloom --pool "$samplesFile"
-long "$ONP_FL"
--threads 4
--memory 12
-prefix "$prefix"
-ntcard
--outdir "$outdir"

Attachements:
rnabloom_out.txt
rnabloom_err.txt

Improve hairpin/palindrome removal in long-read assembly

Please report

version of RNA-Bloom with java -jar RNA-Bloom.jar -version
2.0.1
version of java with java -version
20
exact command used to run RNA-Bloom
rnabloom -long reads.fastq.gz

Currently, hairpins are still kept if there are read support. The default behavior needs to be more aggressive.

Transcriptome assembly from ONT direct RNA sequencing seems to be incomplete

After running Oxford nanopore direct RNA sequencing with total RNA as input, I preprocessed the data through fastp and then input the result for assembly with RNAbloom2 with the following command:
rnabloom -long /path/file.fastq -outdir /path

All the other parameters are set as default.

I ran busco for the result of RNAbloom to see the completeness of the transcriptome but the completeness is below 10%.

What is your advice on this matter?

Thank you.

Error with conda installation

Hey Ka Ming,
I am having this error when using conda for installation. I am using python 3.6

conda install -c bioconda rnabloom
Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: / 
Found conflicts! Looking for incompatible packages.
This can take several minutes.  Press CTRL-C to abort.
failed                                                                                                                                        

UnsatisfiableError: The following specifications were found to be incompatible with each other:

Output in format: Requested package -> Available versions

Not sure what's going on
Thank you in advance for the help!

Illumina PE (Stranded) and PacBio Reads for RNA-Seq

Hello,
While looking for a hybrid assembler came across RNABloom. I test rnaSpades and was not happy with the assembly. I read about RNABloom and its mentioned that ONT reads can be used. I am wondering if RNABloom is tested with PacBio reads and can I use it as hybrid assembler (illumnina +PacBio for transcriptome assembly).
Thanks,
Deep

bcgsc / rna-bloom Goto Github PK

rna-bloom's Introduction

Dependency 📌

Installation 🔧

(A) install with conda or mamba:

(B) download from GitHub:

Quick Start for Short Reads 🏃

(A) assemble bulk RNA-seq data:

final output files:

(B) assemble multi-sample RNA-seq data with pooled assembly mode:

file format for the -pool option:

(i) paired-end reads only:

(ii) paired and unpaired reads:

final output files per cell:

(C) strand-specific assembly:

(D) reference-guided assembly:

Quick Start for Long Reads 🏃

(A) assemble long-read cDNA sequencing data:

(B) assemble nanopore direct RNA sequencing data:

(C) assemble long-read sequencing data with short-read polishing:

final output files:

General Settings ⚙️

(A) set Bloom filter sizes automatically:

(B) set the total size of Bloom filters:

(C) stop at an intermediate stage:

(D) list all available options in RNA-Bloom:

(E) limit the size of Java heap:

Implementation 📝

Citing RNA-Bloom 📜

rna-bloom's People

Contributors

Stargazers

Watchers

Forkers

rna-bloom's Issues

Please report

Please report

Please report

Please report

Please report

Please report

Please report

Please report

Please report

Please report

Please report

Please report

Bloom filters Memory (GB)

de Bruijn graph: 3.4745061 k-mer counting: 8.826317

Please report

Recommend Projects

Recommend Topics

Recommend Org

(A) install with `conda` or `mamba`:

file format for the `-pool` option:

de Bruijn graph: 3.4745061
k-mer counting: 8.826317