eudoraleer / scasa Goto Github PK

SCASA: Single cell transcript quantification tool

License: GNU General Public License v3.0

R 75.86% Perl 21.75% Shell 2.39%

single-cell-analysis single-cell-rna-seq single-cell transcriptomics transcript-quantification

scasa's Issues

fail in docker: Total 0 white-listed Barcodes

Thanks for your work.
Everything works well with the demo testing data. However, when I have tested scRNAseq data from 10xgenomics 3' v2. Although the first part runs well. all reads were filtered out due to no white listed BC as lalevin.log file below:

[2024-03-18 20:40:50.427] [alevinLog] [info] Found 70629 transcripts(+0 decoys, +0 short and +0 duplicate names in the index)
[2024-03-18 20:40:50.466] [alevinLog] [info] Filled with 70629 txp to gene entries
[2024-03-18 20:40:50.472] [alevinLog] [info] Found all transcripts to gene mappings
[2024-03-18 20:40:50.479] [alevinLog] [info] Processing barcodes files (if Present)

[2024-03-18 20:54:52.178] [alevinLog] [info] Done barcode density calculation.
[2024-03-18 20:54:52.178] [alevinLog] [info] # Barcodes Used: �[32m316403175�[0m / �[31m316403175�[0m.
[2024-03-18 20:54:52.178] [alevinLog] [info] Done importing white-list Barcodes
[2024-03-18 20:54:52.178] [alevinLog] [warning] Skipping 200 Barcodes as no read was mapped
[2024-03-18 20:54:52.178] [alevinLog] [info] Total 0 white-listed Barcodes
[2024-03-18 20:54:52.178] [alevinLog] [info] Sorting and dumping raw barcodes
[2024-03-18 20:54:58.507] [alevinLog] [warning] Total 100% reads will be thrown away because of noisy Cellular barcodes.
[2024-03-18 20:54:58.507] [alevinLog] [info] Done populating Z matrix
[2024-03-18 20:54:58.507] [alevinLog] [warning] 0 Whitelisted Barcodes with 0 frequency
[2024-03-18 20:54:58.507] [alevinLog] [info] Total 0 CB got sequence corrected
[2024-03-18 20:54:58.507] [alevinLog] [info] Done indexing Barcodes
[2024-03-18 20:54:58.507] [alevinLog] [info] Total Unique barcodes found: 892257
[2024-03-18 20:54:58.507] [alevinLog] [info] Used Barcodes except Whitelist: 0
I used the whitelist file from cellranger, and here it is my docker parameters:
#!/bin/bash

main parameters

INPUT="/mnt/d/data_analysis/Analysis_2023_NAFLD_sc/9.isoform.analysis_v2/5.NAFLD.docker/1.fq/GSM4041150"
OUTPUT="/mnt/d/data_analysis/Analysis_2023_NAFLD_sc/9.isoform.analysis_v2/4.test.docker/ScasaOut_v10"
ref="/mnt/d/data_analysis/Analysis_2023_NAFLD_sc/9.isoform.analysis_v2/4.test.docker/refMrna.fa.gz"
index="YES" #when index="YES", scasa will index the reference fasta file and write in index_dir. This index_dir cam be reused for other run
#index_dir="/path/to/PreBuilt_REF_INDEX" #when index="NO", scasa will use directly the reference indexing in index_dir
nthreads=12
tech="10xv2"
whitelist="/mnt/d/data_analysis/Analysis_2023_NAFLD_sc/9.isoform.analysis_v2/4.test.docker/Test_Dataset/737K-august-2016.txt"
cellthreshold="none"
project="My_Project"

other parameters

samplesheet="NULL"
mapper="salmon_alevin"
xmatrix="alevin"
postalign_dir=""
createxmatrix="NO"

May I ask for your help? Thanks a lot!
KR
Lin

Quantification Fails

Hi,

Thank you for the tool. I am using it for isoform quantification. As a test run I am testing it on one sample from 10xV2.

The command that I am using

scasa --fastq ERX3806131/4861STDY7462259_R1.fastq.gz,ERX3806131/4861STDY7462259_R2.fastq.gz --ref $refPath --nthreads 8 --out Scasa_out

Processing message that I get

##############################################################

SCASA V1.0.0

SINGLE CELL TRANSCRIPT QUANTIFICATION TOOL

Version Date: 2021-04-07

FOR ANY ISSUES, CONTACT: [email protected]

https://github.com/eudoraleer/scasa/

##############################################################

mkdir: cannot create directory ‘Scasa_out/SCASA_My_Project_20220527085718/’: File exists

Preparing for alignment..
Indexing reference..
Directory Scasa_out/SCASA_My_Project_20220527085718/0PRESETS//REF_INDEX/ already exists. Writing into existing directory..
Version Info: ### PLEASE UPGRADE SALMON ###

A newer version of salmon with important bug fixes and improvements is available.

The newest version, available at https://github.com/COMBINE-lab/salmon/releases
contains new features, improvements, and bug fixes; please upgrade at your
earliest convenience.

Sign up for the salmon mailing list to hear about new versions, features and updates at:
https://oceangenomics.com/subscribe
###[2022-05-27 08:57:18.759] [jLog] [warning] The salmon index is being built without any decoy sequences. It is recommended that decoy sequence (either computed auxiliary decoy sequence or the genome of the organism) be provided during indexing. Further details can be found at https://salmon.readthedocs.io/en/latest/salmon.html#preparing-transcriptome-indices-mapping-based-mode.
[2022-05-27 08:57:18.759] [jLog] [info] building index
out : Scasa_out/SCASA_My_Project_20220527085718/0PRESETS//REF_INDEX/
[2022-05-27 08:57:18.759] [puff::index::jointLog] [info] Running fixFasta

[Step 1 of 4] : counting k-mers

[2022-05-27 08:57:26.609] [puff::index::jointLog] [warning] Removed 237 transcripts that were sequence duplicates of indexed transcripts.
[2022-05-27 08:57:26.609] [puff::index::jointLog] [warning] If you wish to retain duplicate transcripts, please use the --keepDuplicates flag
[2022-05-27 08:57:26.610] [puff::index::jointLog] [info] Replaced 5 non-ATCG nucleotides
[2022-05-27 08:57:26.610] [puff::index::jointLog] [info] Clipped poly-A tails from 12501 transcripts
wrote 70629 cleaned references
[2022-05-27 08:57:27.256] [puff::index::jointLog] [info] Filter size not provided; estimating from number of distinct k-mers
[2022-05-27 08:57:30.321] [puff::index::jointLog] [info] ntHll estimated 84081876 distinct k-mers, setting filter size to 2^31
Threads = 2
Vertex length = 31
Hash functions = 5
Filter size = 2147483648
Capacity = 2
Files:
Scasa_out/SCASA_My_Project_20220527085718/0PRESETS//REF_INDEX/ref_k31_fixed.fa

Round 0, 0:2147483648
Pass Filling Filtering
1 25 69
2 4 0
True junctions count = 266516
False junctions count = 404633
Hash table size = 671149
Candidate marks count = 4093104

Reallocating bifurcations time: 0
True marks count: 2954071
Edges construction time: 4

Distinct junctions = 266516

allowedIn: 12
Max Junction ID: 308126
seen.size():2465017 kmerInfo.size():308127
approximateContigTotalLength: 65012593
counters for complex kmers:
(prec>1 & succ>1)=25336 | (succ>1 & isStart)=59 | (prec>1 & isEnd)=73 | (isStart & isEnd)=10
contig count: 417773 element count: 96576272 complex nodes: 25478

of ones in rank vector: 417772

[2022-05-27 08:59:28.244] [puff::index::jointLog] [info] Starting the Pufferfish indexing by reading the GFA binary file.
[2022-05-27 08:59:28.244] [puff::index::jointLog] [info] Setting the index/BinaryGfa directory Scasa_out/SCASA_My_Project_20220527085718/0PRESETS//REF_INDEX
size = 96576272

| Loading contigs | Time = 9.0059 ms

size = 96576272

| Loading contig boundaries | Time = 5.0433 ms

Number of ones: 417772
Number of ones per inventory item: 512
Inventory entries filled: 816
417772
[2022-05-27 08:59:28.456] [puff::index::jointLog] [info] Done wrapping the rank vector with a rank9sel structure.
[2022-05-27 08:59:28.460] [puff::index::jointLog] [info] contig count for validation: 417772
[2022-05-27 08:59:28.648] [puff::index::jointLog] [info] Total # of Contigs : 417772
[2022-05-27 08:59:28.648] [puff::index::jointLog] [info] Total # of numerical Contigs : 417772
[2022-05-27 08:59:28.676] [puff::index::jointLog] [info] Total # of contig vec entries: 3035777
[2022-05-27 08:59:28.676] [puff::index::jointLog] [info] bits per offset entry 22
[2022-05-27 08:59:28.787] [puff::index::jointLog] [info] Done constructing the contig vector. 417773
[2022-05-27 08:59:28.924] [puff::index::jointLog] [info] # segments = 417772
[2022-05-27 08:59:28.924] [puff::index::jointLog] [info] total length = 96576272
[2022-05-27 08:59:28.957] [puff::index::jointLog] [info] Reading the reference files ...
[2022-05-27 08:59:29.688] [puff::index::jointLog] [info] positional integer width = 27
[2022-05-27 08:59:29.688] [puff::index::jointLog] [info] seqSize = 96576272
[2022-05-27 08:59:29.688] [puff::index::jointLog] [info] rankSize = 96576272
[2022-05-27 08:59:29.688] [puff::index::jointLog] [info] edgeVecSize = 0
[2022-05-27 08:59:29.688] [puff::index::jointLog] [info] num keys = 84043112
for info, total work write each : 2.331 total work inram from level 3 : 4.322 total work raw : 25.000
[Building BooPHF] 100 % elapsed: 0 min 10 sec remaining: 0 min 0 sec
Bitarray 440364608 bits (100.00 %) (array + ranks )
final hash 0 bits (0.00 %) (nb in final hash 0)
[2022-05-27 08:59:39.507] [puff::index::jointLog] [info] mphf size = 52.4956 MB
[2022-05-27 08:59:39.580] [puff::index::jointLog] [info] chunk size = 48288136
[2022-05-27 08:59:39.580] [puff::index::jointLog] [info] chunk 0 = [0, 48288136)
[2022-05-27 08:59:39.580] [puff::index::jointLog] [info] chunk 1 = [48288136, 96576242)
[2022-05-27 08:59:52.325] [puff::index::jointLog] [info] finished populating pos vector
[2022-05-27 08:59:52.325] [puff::index::jointLog] [info] writing index components
[2022-05-27 08:59:52.728] [puff::index::jointLog] [info] finished writing dense pufferfish index
[2022-05-27 08:59:52.766] [jLog] [info] done building index
Finnished indexing reference..
Begins pseudo-alignment..
nohup: redirecting stderr to stdout

The ERROR that I am getting as soon as the quantification step starts is below

Congratulations! Pseudo-alignment has completed in 1590 seconds!
Scasa quantification has started..
Begin Scasa quantification for sample 4861STDY7462259..
Loading required package: iterators
Loading required package: parallel
Error in { : task 1 failed - "NA/NaN argument"
Calls: %dopar% ->
Execution halted
Loading required package: iterators
Loading required package: parallel
Error in readChar(con, 5L, useBytes = TRUE) : cannot open the connection
Calls: load -> readChar
In addition: Warning message:
In readChar(con, 5L, useBytes = TRUE) :
cannot open compressed file '/home/jupyter/Scasa_out/SCASA_My_Project_20220527085718/2QUANT/4861STDY7462259_quant/Sample_eqClass.RData', probable reason 'No such file or directory'
Execution halted
Error in readChar(con, 5L, useBytes = TRUE) : cannot open the connection
Calls: load -> readChar
In addition: Warning message:
In readChar(con, 5L, useBytes = TRUE) :
cannot open compressed file 'Scasa_out/SCASA_My_Project_20220527085718/2QUANT//4861STDY7462259_quant//scasa_isoform_expression.RData', probable reason 'No such file or directory'
Execution halted
Congratulations! Scasa single cell RNA-Seq transcript quantification has completed in 30 seconds!
All done!

I have installed all the R packages and I am not sure why the quantification is not being performed.

Could you please help.

Thank you

Smart-seq2 full length RNA-seq data

Dear Author,

Thank you for the wonderful tool.
I wonder if it can analysis Smart-seq2 full length single cell RNA-seq data as well.
If it can, what should be come for --tech ?

Best
Heo

Scasa:v1.0.1 docker shows error

I am trying to use the scasav1.0.1 docker in a Mac M1 system and I get the following error when running the test dataset. The main issue is that it says that it cannot read the input file right before "Starting the Pufferfish indexing by reading the GFA binary file." line.

I have already changed the "/path/to" to the scasa path in the docker_params.sh file and changed the -v argument of the if statements in the runScasaDocker.sh file to the -n argument.

The entire console log is shown below:

                 You are running Scasa v1.0.1 using docker ....

Loading parameters from file...
WARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested

##############################################################

SCASA V1.0.1

SINGLE CELL TRANSCRIPT QUANTIFICATION TOOL

Version Date: 2022-03-24

FOR ANY ISSUES, CONTACT: [email protected]

https://github.com/eudoraleer/scasa/

##############################################################

Directory /source/output already exists. Writing into existing directory..
mkdir: cannot create directory '/source/output/SCASA_My_Project_20230322024150/': File exists

Preparing for alignment..
Indexing reference..
Directory /source/output/SCASA_My_Project_20230322024150/0PRESETS//REF_INDEX/ already exists. Writing into existing directory..
Version Info: ### PLEASE UPGRADE SALMON ###

A newer version of salmon with important bug fixes and improvements is available.

The newest version, available at https://github.com/COMBINE-lab/salmon/releases
contains new features, improvements, and bug fixes; please upgrade at your
earliest convenience.

Sign up for the salmon mailing list to hear about new versions, features and updates at:
https://oceangenomics.com/subscribe
###[2023-03-22 02:41:50.723] [jLog] [warning] The salmon index is being built without any decoy sequences. It is recommended that decoy sequence (either computed auxiliary decoy sequence or the genome of the organism) be provided during indexing. Further details can be found at https://salmon.readthedocs.io/en/latest/salmon.html#preparing-transcriptome-indices-mapping-based-mode.
[2023-03-22 02:41:50.725] [jLog] [info] building index
out : /source/output/SCASA_My_Project_20230322024150/0PRESETS//REF_INDEX/
[2023-03-22 02:41:50.730] [puff::index::jointLog] [info] Running fixFasta

[Step 1 of 4] : counting k-mers

[2023-03-22 02:42:49.745] [puff::index::jointLog] [warning] Removed 237 transcripts that were sequence duplicates of indexed transcripts.
[2023-03-22 02:42:49.746] [puff::index::jointLog] [warning] If you wish to retain duplicate transcripts, please use the `--keepDuplicates` flag
[2023-03-22 02:42:49.751] [puff::index::jointLog] [info] Replaced 5 non-ATCG nucleotides
[2023-03-22 02:42:49.751] [puff::index::jointLog] [info] Clipped poly-A tails from 12501 transcripts
wrote 70629 cleaned references
[2023-03-22 02:43:20.214] [puff::index::jointLog] [info] Filter size not provided; estimating from number of distinct k-mers
[2023-03-22 02:43:29.656] [puff::index::jointLog] [info] ntHll estimated 84081876 distinct k-mers, setting filter size to 2^31
Threads = 2
Vertex length = 31
Hash functions = 5
Filter size = 2147483648
Capacity = 2
Files:
/source/output/SCASA_My_Project_20230322024150/0PRESETS//REF_INDEX/ref_k31_fixed.fa

Round 0, 0:2147483648
Pass Filling Filtering
1 126 216
2 41 1
True junctions count = 266516
False junctions count = 405083
Hash table size = 671599
Candidate marks count = 4093845

Error: Can't read from a temporary file
allowedIn: 12
error: Can't read the input file
[2023-03-22 02:50:25.299] [puff::index::jointLog] [info] Starting the Pufferfish indexing by reading the GFA binary file.
[2023-03-22 02:50:25.300] [puff::index::jointLog] [info] Setting the index/BinaryGfa directory /source/output/SCASA_My_Project_20230322024150/0PRESETS//REF_INDEX
size = 0

| Loading contigs | Time = 732.71 us

size = 0

| Loading contig boundaries | Time = 794.58 us

Number of ones: 0
Number of ones per inventory item: 512
Inventory entries filled: 1
274911049145
[2023-03-22 02:50:25.303] [puff::index::jointLog] [info] Done wrapping the rank vector with a rank9sel structure.
Finnished indexing reference..
Begins pseudo-alignment..
nohup: redirecting stderr to stdout
Congratulations! Pseudo-alignment has completed in 30 seconds!
Scasa quantification has started..
Begin Scasa quantification for sample Sample_01_S1_L001..
Error in file(con, "r") : cannot open the connection
Calls: readLines -> file
In addition: Warning message:
In file(con, "r") :
cannot open file '/source/output/SCASA_My_Project_20230322024150/1ALIGN//Sample_01_S1_L001_alignout/alevin/bfh.txt': No such file or directory
Execution halted
Loading required package: iterators
Loading required package: parallel
Error in readChar(con, 5L, useBytes = TRUE) : cannot open the connection
Calls: load -> readChar
In addition: Warning message:
In readChar(con, 5L, useBytes = TRUE) :
cannot open compressed file '/source/output/SCASA_My_Project_20230322024150/2QUANT/Sample_01_S1_L001_quant/Sample_eqClass.RData', probable reason 'No such file or directory'
Execution halted
Error in readChar(con, 5L, useBytes = TRUE) : cannot open the connection
Calls: load -> readChar
In addition: Warning message:
In readChar(con, 5L, useBytes = TRUE) :
cannot open compressed file '/source/output/SCASA_My_Project_20230322024150/2QUANT//Sample_01_S1_L001_quant//scasa_isoform_expression.RData', probable reason 'No such file or directory'
Execution halted

no mapped reads in test on SRA sample

Hi, I was trying to run a test of scasa on the following SRA sample: https://trace.ncbi.nlm.nih.gov/Traces/?view=run_browser&acc=SRR23717237&display=metadata .

There is an error in the alignment log (My_Project_20230408030128.align.SRR23717237_S1_L001.20230408030128.o), which is captured below. It seems that no reads are being aligned.

I am using the default reference and tried two different whitelist files, one the full whitelist file downloaded from https://github.com/10XGenomics/cellranger/raw/master/lib/python/cellranger/barcodes/3M-february-2018.txt.gz, and second just the barcodes from the sample, but with the same error message.

Error: See the warning message below regarding CB+UMI length, and the mapping rate of 0%.

[2023-04-08 03:05:40.692] [jointLog] [info] Computed 0 rich equivalence classes for further processing
[2023-04-08 03:05:40.692] [jointLog] [info] Counted 0 total reads in the equivalence classes
[2023-04-08 03:05:40.692] [jointLog] [info] Number of fragments discarded because they are best-mapped to decoys : 0
[2023-04-08 03:05:40.700] [jointLog] [warning] Found 145185909 reads with CB+UMI length smaller than expected.
Please report on github if this number is too large
[2023-04-08 03:05:40.700] [jointLog] [info] Mapping rate = 0%

[2023-04-08 03:05:40.700] [jointLog] [info] finished quantifyLibrary()
[2023-04-08 03:05:40.725] [alevinLog] [info] Starting optimizer

[2023-04-08 03:05:41.434] [alevinLog] [info] Total 0.00 UMI after deduplicating.
[2023-04-08 03:05:41.434] [alevinLog] [info] Total 0 BiDirected Edges.
[2023-04-08 03:05:41.434] [alevinLog] [info] Total 0 UniDirected Edges.
[2023-04-08 03:05:41.434] [alevinLog] [warning] Skipped 1051 barcodes due to No mapped read
[2023-04-08 03:05:41.437] [alevinLog] [info] Starting dumping cell v gene counts in mtx format
[2023-04-08 03:05:41.437] [alevinLog] [error] Can't import Binary file quants.mat.gz, it doesn't exist

Using your tool for single cell data barcoded with split-seq technology and not droplet based

Hello,
thank you for the very interesting tool. I am having single cell data barcoded with the split-seq technology and I would like to do a prediction with the shorts reads of potential splicing events. Is the tool supporting this type of sequencing data? I saw only 10X technology in the settings.

Best regards
VK

mkdir: cannot create directory ‘.//SCASA_My_Project_20240627144303/’: File exists

You can just change system("mkdir $file_variable") to system("mkdir -p $file_variable") in order to avoid this issue.

Also, you are not checking the exit status os system, so if the subprocess fails, the parent process will not fail. I believe that this is true for system in both perl and R. That could be very problematic for the code at

scasa/scasa/SCRIPTS/Xmatrix_Gen/scasa_simulate_xmatrix_step1_v1.0.0.R

Line 100 in b86c4a7

system(paste0("cp ",output_dir_chunk,"/*.fasta ",output_dir))

Symlink to binary not supported but no error provided

Running scasa from a symlink is not possible as Scasa expects to be able to access the scasa directory via the "which scasa" command (see SCASA_WRAP_V1.0.0.pl line 36), however, no error message is provided. It is my understanding that using "readlink -f" on the output of the "which scasa" command will return the appropriate directory even if the which scasa command returns the location of the symlink. See https://unix.stackexchange.com/questions/22128/how-to-get-full-path-of-original-file-of-a-soft-symbolic-link

This error is not obvious to the user; Scasa simply "finishes" running without generating an index or count matrix.

Row names with multiple transcripts

Hi,

Thanks for developing this tool. I'm using scasa for isoform quantification in 10x data and i used the annotation file of Homo_sapiens_GENCODE_42 version you provided. However, in the quantification result, the rownames may contain multiple isoforms like this:

I checked that the isoforms in the same line belongs to the same gene. I'm wondering if i did something wrong or how can i interpret these results.

。

Hi，
The error message "FASTQ files found in your input directory did not come in pairs! Please check and resubmit" occurs regardless of whether I use the test data you provided or other data
scasa --fastq Sample_01_S1_L001_R1_001.fastq, Sample_01_S1_L001_R2_001.fastq --ref /home/hfli/long_read_scRNA/Mrna/refMrna.fa --whitelist Sample_01_Whitelist.txt --nthreads 32

##############################################################

SCASA V1.0.1

SINGLE CELL TRANSCRIPT QUANTIFICATION TOOL

Version Date: 2022-03-24

FOR ANY ISSUES, CONTACT: [email protected]

https://github.com/eudoraleer/scasa/

##############################################################

Directory ./ already exists. Writing into existing directory..
mkdir: 无法创建目录".//SCASA_My_Project_20240511164911/": 文件已存在

Preparing for alignment..
Indexing reference..
Directory .//SCASA_My_Project_20240511164911/0PRESETS//REF_INDEX/ already exists. Writing into existing directory..
Version Info: ### PLEASE UPGRADE SALMON ###

A newer version of salmon with important bug fixes and improvements is available.

The newest version, available at https://github.com/COMBINE-lab/salmon/releases
contains new features, improvements, and bug fixes; please upgrade at your
earliest convenience.

Sign up for the salmon mailing list to hear about new versions, features and updates at:
https://oceangenomics.com/subscribe
###[2024-05-11 16:49:12.158] [jLog] [warning] The salmon index is being built without any decoy sequences. It is recommended that decoy sequence (either computed auxiliary decoy sequence or the genome of the organism) be provided during indexing. Further details can be found at https://salmon.readthedocs.io/en/latest/salmon.html#preparing-transcriptome-indices-mapping-based-mode.
[2024-05-11 16:49:12.159] [jLog] [info] building index
out : .//SCASA_My_Project_20240511164911/0PRESETS//REF_INDEX/
[2024-05-11 16:49:12.159] [puff::index::jointLog] [info] Running fixFasta

[Step 1 of 4] : counting k-mers

[2024-05-11 16:49:20.266] [puff::index::jointLog] [warning] Removed 237 transcripts that were sequence duplicates of indexed transcripts.
[2024-05-11 16:49:20.266] [puff::index::jointLog] [warning] If you wish to retain duplicate transcripts, please use the `--keepDuplicates` flag
[2024-05-11 16:49:20.267] [puff::index::jointLog] [info] Replaced 5 non-ATCG nucleotides
[2024-05-11 16:49:20.267] [puff::index::jointLog] [info] Clipped poly-A tails from 12,501 transcripts
wrote 70629 cleaned references
[2024-05-11 16:49:21.041] [puff::index::jointLog] [info] Filter size not provided; estimating from number of distinct k-mers
[2024-05-11 16:49:23.797] [puff::index::jointLog] [info] ntHll estimated 84081876 distinct k-mers, setting filter size to 2^31
Threads = 2
Vertex length = 31
Hash functions = 5
Filter size = 2147483648
Capacity = 2
Files:
.//SCASA_My_Project_20240511164911/0PRESETS//REF_INDEX/ref_k31_fixed.fa

Round 0, 0:2147483648
Pass Filling Filtering
1 33 79
2 5 0
True junctions count = 266516
False junctions count = 404655
Hash table size = 671171
Candidate marks count = 4091391

Reallocating bifurcations time: 0
True marks count: 2954071
Edges construction time: 5

Distinct junctions = 266516

allowedIn: 12
Max Junction ID: 308126
seen.size():2465017 kmerInfo.size():308127
approximateContigTotalLength: 65012593
counters for complex kmers:
(prec>1 & succ>1)=25336 | (succ>1 & isStart)=72 | (prec>1 & isEnd)=60 | (isStart & isEnd)=10
contig count: 417773 element count: 96576272 complex nodes: 25478

of ones in rank vector: 417772

[2024-05-11 16:51:40.180] [puff::index::jointLog] [info] Starting the Pufferfish indexing by reading the GFA binary file.
[2024-05-11 16:51:40.180] [puff::index::jointLog] [info] Setting the index/BinaryGfa directory .//SCASA_My_Project_20240511164911/0PRESETS//REF_INDEX
size = 96576272

| Loading contigs | Time = 21.717 ms

size = 96576272

| Loading contig boundaries | Time = 10.621 ms

Number of ones: 417772
Number of ones per inventory item: 512
Inventory entries filled: 816
417772
[2024-05-11 16:51:40.405] [puff::index::jointLog] [info] Done wrapping the rank vector with a rank9sel structure.
[2024-05-11 16:51:40.415] [puff::index::jointLog] [info] contig count for validation: 417,772
[2024-05-11 16:51:40.603] [puff::index::jointLog] [info] Total # of Contigs : 417,772
[2024-05-11 16:51:40.603] [puff::index::jointLog] [info] Total # of numerical Contigs : 417,772
[2024-05-11 16:51:40.662] [puff::index::jointLog] [info] Total # of contig vec entries: 3,035,777
[2024-05-11 16:51:40.662] [puff::index::jointLog] [info] bits per offset entry 22
[2024-05-11 16:51:40.816] [puff::index::jointLog] [info] Done constructing the contig vector. 417773
[2024-05-11 16:51:40.925] [puff::index::jointLog] [info] # segments = 417,772
[2024-05-11 16:51:40.925] [puff::index::jointLog] [info] total length = 96,576,272
[2024-05-11 16:51:40.993] [puff::index::jointLog] [info] Reading the reference files ...
[2024-05-11 16:51:41.701] [puff::index::jointLog] [info] positional integer width = 27
[2024-05-11 16:51:41.702] [puff::index::jointLog] [info] seqSize = 96,576,272
[2024-05-11 16:51:41.702] [puff::index::jointLog] [info] rankSize = 96,576,272
[2024-05-11 16:51:41.702] [puff::index::jointLog] [info] edgeVecSize = 0
[2024-05-11 16:51:41.702] [puff::index::jointLog] [info] num keys = 84,043,112
for info, total work write each : 2.331 total work inram from level 3 : 4.322 total work raw : 25.000
[Building BooPHF] 100 % elapsed: 0 min 14 sec remaining: 0 min 0 sec
Bitarray 440364608 bits (100.00 %) (array + ranks )
final hash 0 bits (0.00 %) (nb in final hash 0)
[2024-05-11 16:51:56.145] [puff::index::jointLog] [info] mphf size = 52.4956 MB
[2024-05-11 16:51:56.356] [puff::index::jointLog] [info] chunk size = 48,288,136
[2024-05-11 16:51:56.356] [puff::index::jointLog] [info] chunk 0 = [0, 48,288,136)
[2024-05-11 16:51:56.356] [puff::index::jointLog] [info] chunk 1 = [48,288,136, 96,576,242)
[2024-05-11 16:52:13.341] [puff::index::jointLog] [info] finished populating pos vector
[2024-05-11 16:52:13.342] [puff::index::jointLog] [info] writing index components
[2024-05-11 16:52:13.720] [puff::index::jointLog] [info] finished writing dense pufferfish index
[2024-05-11 16:52:13.854] [jLog] [info] done building index
Finnished indexing reference..
Begins pseudo-alignment..
FASTQ files found in your input directory did not come in pairs! Please check and resubmit.

mkdir errors

Hello,

I'm trying to run your tool and can't get it to work. I was running in "already aligned" mode since salmon alevin has already been run. However it can't get beyond the below. These directories do not exist before executing the command.

scasa --postalign_dir salmon/ --align NO --quant NO --in cDNA/ --fastq test_R1.gz,test_R2.gz --ref reference/gencode.v34.transcripts.fa --out /mnt/disks/big_data/calico/scasa

##############################################################
#	SCASA V1.0.0
#	SINGLE CELL TRANSCRIPT QUANTIFICATION TOOL
#	Version Date: 2021-04-07
#	FOR ANY ISSUES, CONTACT: [email protected]
#	https://github.com/eudoraleer/scasa/
##############################################################

Directory scasa already exists. Writing into existing directory..
mkdir: cannot create directory ‘scasa/SCASA_My_Project_20220321141047/’: File exists
All done!

Even on your tutorial example, there are similar errors, but it keeps going:

mkdir: cannot create directory ‘Scasa_out/SCASA_My_Project_20220321143943/’: File exists

Preparing for alignment..
Indexing reference..
Directory Scasa_out/SCASA_My_Project_20220321143943/0PRESETS//REF_INDEX/ already exists. Writing into existing directory..

scasa cli docs: currently scasa only supports hg38

The scasa cli docs state:

    --ref,-r            Provide a directory to reference fasta file <STRING,
                        required, provide a fasta reference file, currently
                        scasa only supports hg38. Users could download fasta
                        reference via scasa Github. No default>

Doesn't https://github.com/eudoraleer/scasa/wiki/How-to-run-Scasa-for-a-new-annotation show that scasa can be run with a reference other than hg38, or am I not understanding the How-to-run-Scasa-for-a-new-annotation docs?

So many Error messages: please help

Directory ./ already exists. Writing into existing directory..
mkdir: cannot create directory ‘.//SCASA_testscasaHNVC02_20230414001259/’: File exists

Preparing for alignment..
Indexing reference..
Directory .//SCASA_testscasaHNVC02_20230414001259/0PRESETS//REF_INDEX/ already exists. Writing into existing directory..
Version Info: ### PLEASE UPGRADE SALMON ###

A newer version of salmon with important bug fixes and improvements is available.

The newest version, available at https://github.com/COMBINE-lab/salmon/releases
contains new features, improvements, and bug fixes; please upgrade at your
earliest convenience.

Sign up for the salmon mailing list to hear about new versions, features and updates at:
https://oceangenomics.com/subscribe
[2023-04-14 00:12:59.520] [jLog] [warning] The salmon index is being built without any decoy sequences. It is recommended that decoy sequence (either computed auxiliary decoy sequence or the genome of the organism) be provided during indexing. Further details can be found at https://salmon.readthedocs.io/en/latest/salmon.html#preparing-transcriptome-indices-mapping-based-mode.
[2023-04-14 00:12:59.520] [jLog] [info] building index
out : .//SCASA_testscasaHNVC02_20230414001259/0PRESETS//REF_INDEX/
[2023-04-14 00:12:59.527] [puff::index::jointLog] [info] Running fixFasta

[Step 1 of 4] : counting k-mers

[2023-04-14 00:13:07.009] [puff::index::jointLog] [warning] Removed 236 transcripts that were sequence duplicates of indexed transcripts.
[2023-04-14 00:13:07.010] [puff::index::jointLog] [warning] If you wish to retain duplicate transcripts, please use the `--keepDuplicates` flag
[2023-04-14 00:13:07.012] [puff::index::jointLog] [info] Replaced 4 non-ATCG nucleotides
[2023-04-14 00:13:07.012] [puff::index::jointLog] [info] Clipped poly-A tails from 11,186 transcripts
wrote 76267 cleaned references
[2023-04-14 00:13:07.789] [puff::index::jointLog] [info] Filter size not provided; estimating from number of distinct k-mers
[2023-04-14 00:13:10.356] [puff::index::jointLog] [info] ntHll estimated 85097693 distinct k-mers, setting filter size to 2^31
Threads = 2
Vertex length = 31
Hash functions = 5
Filter size = 2147483648
Capacity = 2
Files:
.//SCASA_testscasaHNVC02_20230414001259/0PRESETS//REF_INDEX/ref_k31_fixed.fa

Round 0, 0:2147483648
Pass Filling Filtering
1 36 77
2 5 0
True junctions count = 277411
False junctions count = 422333
Hash table size = 699744
Candidate marks count = 4646414

Reallocating bifurcations time: 0
True marks count: 3337299
Edges construction time: 6

Distinct junctions = 277411

TwoPaCo::buildGraphMain:: allocated with scalable_malloc; freeing.
TwoPaCo::buildGraphMain:: Calling scalable_allocation_command(TBBMALLOC_CLEAN_ALL_BUFFERS, 0);
allowedIn: 12
Max Junction ID: 318881
seen.size():2551057 kmerInfo.size():318882
approximateContigTotalLength: 66002535
counters for complex kmers:
(prec>1 & succ>1)=26025 | (succ>1 & isStart)=63 | (prec>1 & isEnd)=73 | (isStart & isEnd)=10
contig count: 433949 element count: 98078572 complex nodes: 26171

of ones in rank vector: 433948

[2023-04-14 00:15:32.167] [puff::index::jointLog] [info] Starting the Pufferfish indexing by reading the GFA binary file.
[2023-04-14 00:15:32.167] [puff::index::jointLog] [info] Setting the index/BinaryGfa directory .//SCASA_testscasaHNVC02_20230414001259/0PRESETS//REF_INDEX
size = 98078572

| Loading contigs | Time = 47.228 ms

size = 98078572

| Loading contig boundaries | Time = 25.94 ms

Number of ones: 433948
Number of ones per inventory item: 512
Inventory entries filled: 848
433948
[2023-04-14 00:15:32.408] [puff::index::jointLog] [info] Done wrapping the rank vector with a rank9sel structure.
[2023-04-14 00:15:32.412] [puff::index::jointLog] [info] contig count for validation: 433,948
[2023-04-14 00:15:32.736] [puff::index::jointLog] [info] Total # of Contigs : 433,948
[2023-04-14 00:15:32.736] [puff::index::jointLog] [info] Total # of numerical Contigs : 433,948
[2023-04-14 00:15:32.756] [puff::index::jointLog] [info] Total # of contig vec entries: 3,427,302
[2023-04-14 00:15:32.756] [puff::index::jointLog] [info] bits per offset entry 22
[2023-04-14 00:15:32.870] [puff::index::jointLog] [info] Done constructing the contig vector. 433949
[2023-04-14 00:15:33.302] [puff::index::jointLog] [info] # segments = 433,948
[2023-04-14 00:15:33.303] [puff::index::jointLog] [info] total length = 98,078,572
[2023-04-14 00:15:33.331] [puff::index::jointLog] [info] Reading the reference files ...
[2023-04-14 00:15:34.093] [puff::index::jointLog] [info] positional integer width = 27
[2023-04-14 00:15:34.093] [puff::index::jointLog] [info] seqSize = 98,078,572
[2023-04-14 00:15:34.093] [puff::index::jointLog] [info] rankSize = 98,078,572
[2023-04-14 00:15:34.093] [puff::index::jointLog] [info] edgeVecSize = 0
[2023-04-14 00:15:34.093] [puff::index::jointLog] [info] num keys = 85,060,132
for info, total work write each : 2.331 total work inram from level 3 : 4.322 total work raw : 25.000
[Building BooPHF] 100 % elapsed: 0 min 8 sec remaining: 0 min 0 sec
Bitarray 445693632 bits (100.00 %) (array + ranks )
final hash 0 bits (0.00 %) (nb in final hash 0)
[2023-04-14 00:15:41.958] [puff::index::jointLog] [info] mphf size = 53.1308 MB
[2023-04-14 00:15:42.025] [puff::index::jointLog] [info] chunk size = 49,039,286
[2023-04-14 00:15:42.025] [puff::index::jointLog] [info] chunk 0 = [0, 49,039,286)
[2023-04-14 00:15:42.025] [puff::index::jointLog] [info] chunk 1 = [49,039,286, 98,078,542)
[2023-04-14 00:15:53.934] [puff::index::jointLog] [info] finished populating pos vector
[2023-04-14 00:15:53.934] [puff::index::jointLog] [info] writing index components
[2023-04-14 00:15:54.455] [puff::index::jointLog] [info] finished writing dense pufferfish index
[2023-04-14 00:15:54.494] [jLog] [info] done building index
Finnished indexing reference..
Begins pseudo-alignment..
nohup: redirecting stderr to stdout
Congratulations! Pseudo-alignment has completed in 30 seconds!
Scasa quantification has started..
Begin Scasa quantification for sample SRR10340946..
Error in file(con, "r") : cannot open the connection
Calls: readLines -> file
In addition: Warning message:
In file(con, "r") :
cannot open file './/SCASA_testscasaHNVC02_20230414001259/1ALIGN//SRR10340946_alignout/alevin/bfh.txt': No such file or directory
Execution halted
Loading required package: iterators
Loading required package: parallel
Error in readChar(con, 5L, useBytes = TRUE) : cannot open the connection
Calls: load -> readChar
In addition: Warning message:
In readChar(con, 5L, useBytes = TRUE) :
cannot open compressed file '/network/rit/lab/conklinlab/Renae/HNVC/HNVC02/SRR10340946/SCASA_testscasaHNVC02_20230414001259/2QUANT/SRR10340946_quant/Sample_eqClass.RData', probable reason 'No such file or directory'
Execution halted
Error in readChar(con, 5L, useBytes = TRUE) : cannot open the connection
Calls: load -> readChar
In addition: Warning message:
In readChar(con, 5L, useBytes = TRUE) :
cannot open compressed file './/SCASA_testscasaHNVC02_20230414001259/2QUANT//SRR10340946_quant//scasa_isoform_expression.RData', probable reason 'No such file or directory'
Execution halted
Congratulations! Scasa single cell RNA-Seq transcript quantification has completed in 30 seconds!
All done!

Alevin Failed

When I used the example data to test the program, I get an error

Error in file(con, "r") : cannot open the connection
Calls: readLines -> file
In addition: Warning message:
In file(con, "r") :
  cannot open file './/SCASA_My_Project_20221014093239/1ALIGN//Sample_01_S1_L001_alignout/alevin/bfh.txt': No such file or directory
Execution halted
Loading required package: iterators
Loading required package: parallel
Error in readChar(con, 5L, useBytes = TRUE) : cannot open the connection
Calls: load -> readChar
In addition: Warning message:
In readChar(con, 5L, useBytes = TRUE) :
  cannot open compressed file '/storage/data/GYD/Softwares/scasa/Test_Dataset/SCASA_My_Project_20221014093239/2QUANT/Sample_01_S1_L001_quant/Sample_eqClass.RData', probable reason 'No such file or directory'
Execution halted
Error in readChar(con, 5L, useBytes = TRUE) : cannot open the connection
Calls: load -> readChar
In addition: Warning message:
In readChar(con, 5L, useBytes = TRUE) :
  cannot open compressed file './/SCASA_My_Project_20221014093239/2QUANT//Sample_01_S1_L001_quant//scasa_isoform_expression.RData', probable reason 'No such file or directory'
Execution halted
Congratulations! Scasa single cell RNA-Seq transcript quantification has completed in 30 seconds!
All done!

Then, I found a alevin log file from the result directory, alevin.log
The alevin.log is show in below

[2022-10-14 09:34:38.938] [alevinLog] [info] Found 76267 transcripts(+0 decoys, +0 short and +0 duplicate names in the index)
[2022-10-14 09:34:38.972] [alevinLog] [info] Filled with 72304 txp to gene entries
[2022-10-14 09:34:38.972] [alevinLog] [error] ERROR: Can't find gene mapping for : NM_001384956 w/ index 56214
[2022-10-14 09:34:38.972] [alevinLog] [error] ERROR: Txp to Gene Map not found for 3963 transcripts. Exiting

Thank you for helping me solve this problem

QOL change: reset command line color changed by die commands in perl script

Hi,

In the PERL script "SCASA_WRAP_V1.0.0.pl", 'die' commands that specify a color permanently change the color of the command line interface after the scasa program exits. Please see https://stackoverflow.com/questions/5691570/coloring-a-perl-die-message for avoiding this issue. It appears to be a fairly simple change.

Thanks!

my $scasa_dir = `which scasa`;

my $scasa_dir = ``which scasa``; does not fail if scasa not in the user's PATH, so the pipeline will continue on, leading to eclectic warnings/errors such as Use of uninitialized value $scasa_dir in concatenation (.) or string.

eudoraleer / scasa Goto Github PK

scasa's Issues

main parameters

other parameters

SCASA V1.0.0

SINGLE CELL TRANSCRIPT QUANTIFICATION TOOL

Version Date: 2021-04-07

FOR ANY ISSUES, CONTACT: [email protected]

A newer version of salmon with important bug fixes and improvements is available.

Round 0, 0:2147483648 Pass Filling Filtering 1 25 69 2 4 0 True junctions count = 266516 False junctions count = 404633 Hash table size = 671149 Candidate marks count = 4093104

Reallocating bifurcations time: 0 True marks count: 2954071 Edges construction time: 4

of ones in rank vector: 417772

[2022-05-27 08:59:28.244] [puff::index::jointLog] [info] Starting the Pufferfish indexing by reading the GFA binary file. [2022-05-27 08:59:28.244] [puff::index::jointLog] [info] Setting the index/BinaryGfa directory Scasa_out/SCASA_My_Project_20220527085718/0PRESETS//REF_INDEX size = 96576272

| Loading contigs | Time = 9.0059 ms

size = 96576272

| Loading contig boundaries | Time = 5.0433 ms

SCASA V1.0.1

SINGLE CELL TRANSCRIPT QUANTIFICATION TOOL

Version Date: 2022-03-24

FOR ANY ISSUES, CONTACT: [email protected]

A newer version of salmon with important bug fixes and improvements is available.

Round 0, 0:2147483648 Pass Filling Filtering 1 126 216 2 41 1 True junctions count = 266516 False junctions count = 405083 Hash table size = 671599 Candidate marks count = 4093845

| Loading contigs | Time = 732.71 us

size = 0

| Loading contig boundaries | Time = 794.58 us

SCASA V1.0.1

SINGLE CELL TRANSCRIPT QUANTIFICATION TOOL

Version Date: 2022-03-24

FOR ANY ISSUES, CONTACT: [email protected]

A newer version of salmon with important bug fixes and improvements is available.

Round 0, 0:2147483648 Pass Filling Filtering 1 33 79 2 5 0 True junctions count = 266516 False junctions count = 404655 Hash table size = 671171 Candidate marks count = 4091391

Reallocating bifurcations time: 0 True marks count: 2954071 Edges construction time: 5

of ones in rank vector: 417772

[2024-05-11 16:51:40.180] [puff::index::jointLog] [info] Starting the Pufferfish indexing by reading the GFA binary file. [2024-05-11 16:51:40.180] [puff::index::jointLog] [info] Setting the index/BinaryGfa directory .//SCASA_My_Project_20240511164911/0PRESETS//REF_INDEX size = 96576272

| Loading contigs | Time = 21.717 ms

size = 96576272

| Loading contig boundaries | Time = 10.621 ms

A newer version of salmon with important bug fixes and improvements is available.

Round 0, 0:2147483648 Pass Filling Filtering 1 36 77 2 5 0 True junctions count = 277411 False junctions count = 422333 Hash table size = 699744 Candidate marks count = 4646414

Reallocating bifurcations time: 0 True marks count: 3337299 Edges construction time: 6

of ones in rank vector: 433948

[2023-04-14 00:15:32.167] [puff::index::jointLog] [info] Starting the Pufferfish indexing by reading the GFA binary file. [2023-04-14 00:15:32.167] [puff::index::jointLog] [info] Setting the index/BinaryGfa directory .//SCASA_testscasaHNVC02_20230414001259/0PRESETS//REF_INDEX size = 98078572

| Loading contigs | Time = 47.228 ms

size = 98078572

| Loading contig boundaries | Time = 25.94 ms

Recommend Projects

Recommend Topics

Recommend Org

Round 0, 0:2147483648
Pass Filling Filtering
1 25 69
2 4 0
True junctions count = 266516
False junctions count = 404633
Hash table size = 671149
Candidate marks count = 4093104

Reallocating bifurcations time: 0
True marks count: 2954071
Edges construction time: 4

[2022-05-27 08:59:28.244] [puff::index::jointLog] [info] Starting the Pufferfish indexing by reading the GFA binary file.
[2022-05-27 08:59:28.244] [puff::index::jointLog] [info] Setting the index/BinaryGfa directory Scasa_out/SCASA_My_Project_20220527085718/0PRESETS//REF_INDEX
size = 96576272

Round 0, 0:2147483648
Pass Filling Filtering
1 126 216
2 41 1
True junctions count = 266516
False junctions count = 405083
Hash table size = 671599
Candidate marks count = 4093845

Round 0, 0:2147483648
Pass Filling Filtering
1 33 79
2 5 0
True junctions count = 266516
False junctions count = 404655
Hash table size = 671171
Candidate marks count = 4091391

Reallocating bifurcations time: 0
True marks count: 2954071
Edges construction time: 5

[2024-05-11 16:51:40.180] [puff::index::jointLog] [info] Starting the Pufferfish indexing by reading the GFA binary file.
[2024-05-11 16:51:40.180] [puff::index::jointLog] [info] Setting the index/BinaryGfa directory .//SCASA_My_Project_20240511164911/0PRESETS//REF_INDEX
size = 96576272

Round 0, 0:2147483648
Pass Filling Filtering
1 36 77
2 5 0
True junctions count = 277411
False junctions count = 422333
Hash table size = 699744
Candidate marks count = 4646414

Reallocating bifurcations time: 0
True marks count: 3337299
Edges construction time: 6

[2023-04-14 00:15:32.167] [puff::index::jointLog] [info] Starting the Pufferfish indexing by reading the GFA binary file.
[2023-04-14 00:15:32.167] [puff::index::jointLog] [info] Setting the index/BinaryGfa directory .//SCASA_testscasaHNVC02_20230414001259/0PRESETS//REF_INDEX
size = 98078572