rwilton / arioc Goto Github PK

Arioc: GPU-accelerated DNA short-read alignment

License: BSD 3-Clause "New" or "Revised" License

bioinformatics genomics read-aligners sequence-alignment dna-sequence-alignment gpu bisulfite-sequencing methylation cuda c-plusplus

arioc's Introduction

Arioc: GPU-accelerated DNA short-read alignment

Download the current release here.

Download the Arioc user guide here or as part of the current release.

Arioc-encoded reference genomes

The Arioc aligners use binary encoded lookup tables as reference genome indexes. For convenience, several Arioc-encoded reference genomes are available for download at ftp://ftp.ccb.jhu.edu/pub/data/Arioc, the FTP server of the Center for Computational Biology at Johns Hopkins University:

genome	subdirectory	content
H sapiens	H_sapiens	NCBI GRCh38: patch 14 (WGS); patch 13 (WGS, WGBS)
M musculus	M_musculus	NCBI GRCm39
S cerevisiae	S_cerevisiae	S288C
T aestivum	T_aestivum	NCBI GCA_002220425.3 (Triticum 4.0)

Related publications

Wilton R, Budavari T, Langmead B, Wheelan SJ, Salzberg SL, Szalay AS. Arioc: high-throughput read alignment with GPU-accelerated exploration of the seed-and-extend search space. PeerJ. 2015, 3:e808. DOI:10.7717/peerj.808

Wilton R, Li X, Feinberg AP, Szalay AS. Arioc: GPU-accelerated alignment of short bisulfite-treated reads. Bioinformatics. 2018, 34(15):2673–2675. DOI:10.1093/bioinformatics/bty167

Wilton R, Szalay AS. Arioc: High-concurrency short-read alignment on multiple GPUs. PLoS Comput Biol. 2020, 16(11):e1008383. DOI:10.1093/10.1371/journal.pcbi.1008383

Wilton R, Szalay AS. Performance optimization in DNA short-read alignment. Bioinformatics. 2022, 38(8):2081–2087. DOI:10.1093/bioinformatics/btac066

Wilton R, Szalay AS. Short-read aligner performance in germline variant identification. Bioinformatics. 2023, 39(8):1–11. DOI:10.1093/bioinformatics/btad480

arioc's People

Contributors

Stargazers

Watchers

Forkers

bioinformaticsarchive sandy4321 animesh a7032018 iamh2o schaudge neta9119 nick-eagles

arioc's Issues

AriocP later than AriocP v1.50.3040.21310 (release) +2021-11-06T22:04:41 segfaults when run in Docker

Dear Richard !
I run AriocP in a docker container. This seems to be broken in releases after 2021-11-06.
Releases after that produce a segfault and coredump immediately after loading the LUTs.

docker run -it --gpus all -v /scratch:/scratch -v $PWD:/analysis fc918299b19f bash -c "cd /scratch/tmp; AriocP /analysis/ariocP.1054455"
=> NOT working
docker run --gpus all -v /scratch:/scratch -v $PWD:/analysis d0a059138d42 bash -c "cd /scratch/tmp; AriocP /analysis/ariocP.1054455"
=> working

<none>                                                                          <none>               dde502acd4df   7 weeks ago     5.85GB NOT working
<none>                                                                          <none>               fc918299b19f   7 weeks ago     5.85GB NOT working
<none>                                                                          <none>               d0a059138d42   7 weeks ago     5.85GB working
<none>                                                                          <none>               de703eee3b01   7 weeks ago     5.85GB working

devarea@sx253:~/2021-10_NovaSeq_Pipelines/Analysis/A274_16$ for i in fc918299b19f d0a059138d42; do docker run -v $PWD:/analysis $i bash -c "/bin/AriocP";done
150704980 [00000001] AriocP v1.50.3053.21328 (release) +2021-11-24T11:46:19
150704986 [00000001] Copyright (c) 2015-2021 Johns Hopkins University.  All rights reserved.
150704986 [00000001]  compiled             : 2021-11-25T11:20:16 (GNU g++ v10.2.1)
150704986 [00000001]  data type sizes      : int=4 long=8 *=8 Jvalue5=5 Jvalue8=8 JtableHeader=5 (little-endian)
150704986 [00000001]  computer name        : 63d58b0d2f78
150704986 [00000001]  operating system     : Debian GNU/Linux 11 (bullseye)
150704986 [00000001]  executable file      : /bin/AriocP
150704988 [00000001] ApplicationException ([0x00000001] AriocCommon/AppGlobalCommon.cpp 175): cannot open configuration file './AriocP.cfg'
150704988 [00000001] AriocP ends (1)
150705893 [00000001] AriocP v1.50.3040.21310 (release) +2021-11-06T22:04:41
150705896 [00000001] Copyright (c) 2015-2021 Johns Hopkins University.  All rights reserved.
150705896 [00000001]  compiled             : 2021-11-13T12:33:01 (GNU g++ v10.2.1)
150705896 [00000001]  data type sizes      : int=4 long=8 *=8 Jvalue5=5 Jvalue8=8 JtableHeader=5 (little-endian)
150705896 [00000001]  computer name        : be8375a0bdce
150705896 [00000001]  operating system     : Debian GNU/Linux 11 (bullseye)
150705896 [00000001]  executable file      : /bin/AriocP
150705897 [00000001] ApplicationException ([0x00000001] AriocCommon/AppGlobalCommon.cpp 175): cannot open configuration file './AriocP.cfg'
150705897 [00000001] AriocP ends (1)

Any idea what to do ? Do you know what changed ?
Cheers,
KK

Disabling softclipping

Hello again! I was wondering if there was a way to disable soft clipping for alignment, or heavily penalize its use.

Im trying to find the differences in alignments between Arioc and Bowtie2, so that I can pitch its use to my team, but Arioc likes to softclip in our variable regions with the default recommended settings. I can give additional points awarded for correct matches to counteract this, but it ends up with very different results from how we use Bowtie2 (and from what I expect to see) and it adds a confounding variable to measuring how its performing.

We currently use global alignment with bowtie2 (Since our reference genomes and reads are almost the same length) and we utilize its ability to not penalize "N"s in the reference genome. If that's all out of scope I understand and will try and narrow down a scoring system that gets the results we expect to see!

Consider putting source code under version control

First I want to say thanks for publishing this project, for making the source code for stable releases available, and for your engagement on this issue tracker.

I'm curious if the decision to keep the source code out of version control is deliberate. I'm sure you're aware of the some of the convenient tools Github provides: browsing source code, viewing diffs between commits, commenting on source code or on commits/edits to the source code, supporting patch submissions, etc. If there were concerns about privacy or access control I would assume the source code for each release wouldn't be public. Are there other concerns about putting Arioc's source code under version control?

Correct way to encode hg38 ?

I try to encode the latest masked hg38 reference and face some problems. the file i want to encode is here: GRCh38
It contains maskings for some problems in the original hg38, see this blogpost and paper.

It has 195 contigs, using wildcard matching for the individual contigs like:

<file SN="*" subId="1">GCA_000001405.15_GRCh38_no_alt_analysis_set_maskedGRC_exclusions_v2.fasta</file>

yields:

worker@sx253:/scratch/karl/arioc/R/GRCh38_masked_v2$ ../../Arioc_150/bin/AriocE AriocE_HG38_p13.ungapped.cfg 215111040 [0001217e] AriocE v1.50.3035.21246 (release) +2021-09-03T17:49:58 cat: /etc/system-release: No such file or directory 215111046 [0001217e] Copyright (c) 2015-2021 Johns Hopkins University. All rights reserved. 215111046 [0001217e] compiled : 2021-09-24T16:15:04 (GNU g++ v8.3.0) 215111046 [0001217e] data type sizes : int=4 long=8 *=8 Jvalue5=5 Jvalue8=8 JtableHeader=5 (little-endian) 215111046 [0001217e] computer name : sx253 215111046 [0001217e] executable file : /scratch/karl/arioc/Arioc_150/bin/AriocE 215111046 [0001217e] configuration file : /scratch/karl/arioc/R/GRCh38_masked_v2/AriocE_HG38_p13.ungapped.cfg 215112579 [0001217e] AriocE::encodeR: encoding 1 file (32 CPU threads available)... 220200556 [00012183] ApplicationException ([0x00074115] AriocE/tuEncodeFASTA.cpp 64): Aggregated R sequence length 3099956958 exceeds the maximum supported length 1073741823: /scratch/karl/arioc/R/GRCh38_masked_v2/GCA_000001405.15_GRCh38_no_alt_analysis_set_maskedGRC_exclusions_v2.fasta

splitting the fasta into individual chromosomes and using one file line for each chromosome (n=195) of the reference yields:

worker@sx253:/scratch/karl/arioc/R/GRCh38_masked_v2$ ../../Arioc_150/bin/AriocE AriocE_HG38_p13.gapped.cfg 230517831 [000126de] AriocE v1.50.3035.21246 (release) +2021-09-03T17:49:58 cat: /etc/system-release: No such file or directory 230517838 [000126de] Copyright (c) 2015-2021 Johns Hopkins University. All rights reserved. 230517838 [000126de] compiled : 2021-09-24T16:15:04 (GNU g++ v8.3.0) 230517838 [000126de] data type sizes : int=4 long=8 *=8 Jvalue5=5 Jvalue8=8 JtableHeader=5 (little-endian) 230517838 [000126de] computer name : sx253 230517838 [000126de] executable file : /scratch/karl/arioc/Arioc_150/bin/AriocE 230517838 [000126de] configuration file : /scratch/karl/arioc/R/GRCh38_masked_v2/AriocE_HG38_p13.gapped.cfg 230518180 [000126de] AriocE::encodeR: encoding 195 files (32 CPU threads available)... 230543830 [000126de] RaiiFile::internalOpen: Unable to open file: /scratch/karl/arioc/R/GRCh38_masked_v2/chromosomes/chrUn_KI270414v1.fa (error 24: Too many open files) Segmentation fault

So, how do i correctly encode the hg38 reference with all its masking chromosomes ?

Issue while running AriocU and AriocP

When I tried to use AriocU for alignment, I got this error:

48:57.530 [00001277] ApplicationException ([0x4727] CppCommon/RaiiDirectory.cpp 321): GetFilenames: scandir failed (errno=2)

Strange, since AriocE binaries are working fine, and I have double checked the paths.

XA Tag

Other aligners (bwa in this case) use the XA tag to output alternative alignments for a read, like:

HWI-A01349_BSF_0941:1:2472:14507:8234#A274_16_S81560 1171 chr1 9997 0 51S100M = 10032 -65 CCCACCACTACCCTCCTCCAGCGCCGACGGCTGCGCCTGAGGCGTATTATACCGATAACCCTAACCCTTACCCTAACACTATCCCTTACACTACCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTCACCCTAACCC F,,::F,,,,::F,F,,,,,,:,,F:,,,,F,,F:,F,,,,,,,,:,,:,,,,,,:F,:,F,F,F,F,,:,,,,,FF,FFF,,:,F,:,,F,,,:,F:,,,,FF:F,,,,:F,F,FF,,FFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFF XA:Z:chrX,+156030605,96M55S,6;chr5,-10067,55S96M,7;chr5,-10085,55S96M,7;chr5,-10025,55S96M,7;chr5,-10001,55S96M,7;chr5,-10061,55S96M,7;chr5,-10055,55S96M,7;chr5,-10103,55S96M,7;chr5,-10007,55S96M,7;chr5,-10337,55S96M,7;chr5,-10037,55S96M,7;chr5,-10049,55S96M,7;chr5,-10097,55S96M,7;chr5,-10091,55S96M,7;chr5,-11357,55S96M,7;chr5,-10043,55S96M,7;chr5,-10079,55S96M,7;chr5,-10073,55S96M,7;chr5,-10013,55S96M,7;chr5,-10019,55S96M,7;chr5,-10343,55S96M,7;chr5,-10031,55S96M,7;chrX,+156030683,59M92S,1;chr3,+198173892,59M92S,1;chr12,-10195,92S59M,1;chr3,-10005,94S52M1I4M,1;chr12_GL877875v1_alt,-195,92S59M,1;chr3_KI270784v1_alt,-61610,92S59M,1; MC:Z:79M72S MD:Z:17A8C3A4A2C3A46A10 PG:Z:MarkDuplicates RG:Z:A274_16 NM:i:7 MQ:i:0 AS:i:65 XS:i:66

I cant find any of these tags in my Arioc mapped BAM.
Are they not supported ?
I have several downstream tools that ask for these tags.

Unable to encode very short reference sequences.

Hello!

I've been trying to Align against some small reference sequences but run into the error "maximum subunit ID in nongapped J table: 127; in gapped J table: 1". As far as I can tell, this is due to the nongapped encoding requiring 84 consecutive NT's without any N's. When I pad the reads with ACTGs such that there is 84 NT's on either side of the string of N's, it works. Is there a way to circumvent this without padding the reference sequences?

It's very possible what I need out of an aligner doesn't mesh with Arioc, but the potential speed up was too much to ignore!

Edit: Changed name of file to avoid confusion
reference_templates.fa

>mutant
ccaacgaagaagaaattaaaactactaacccggtagcaacggagtcctatggacaagtggccacaaaccaccagnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnaccggttgggttcaaaaccaaggaatacttccgggtatggtttggcaggacagagatg
>wt
ccaacgaagaagaaattaaaactactaacccggtagcaacggagtcctatggacaagtggccacaaaccaccagnnnnnnnnnnnnnnnnnnnnnaccggttgggttcaaaaccaaggaatacttccgggtatggtttggcaggacagagatg

Encode_gapped.cfg

<AriocE gpuMask="0x00000001" seed="hsi18_0_30" maxJ="*">
  <dataIn sequenceType="R" srcId="0" filePath="/mnt/ramdisk/kraken/experiments/Experiment">
    <file SN="(*)" subId="1">reference_templates.fa</file>
  </dataIn>
  <dataOut>
    <path>/mnt/ramdisk/kraken/experiments/Experiment/enc/ref</path>
  </dataOut>
</AriocE>

Encode_nongapped.cfg

<AriocE gpuMask="0x00000001" seed="ssi84_2_30" maxJ="*">
  <dataIn sequenceType="R" srcId="0" filePath="/mnt/ramdisk/kraken/experiments/Experiment">
    <file SN="(*)" subId="1">reference_templates.fa</file>
  </dataIn>
  <dataOut>
    <path>/mnt/ramdisk/kraken/experiments/Experiment/enc/ref</path>
  </dataOut>
</AriocE>

Encode_Reads.cfg

<AriocE gpuMask="0x00000001">
<dataIn sequenceType="Q" srcId="1" QNAME="(*.*) */*">
<file subId="2">/mnt/ramdisk/kraken/experiments/Experiment/demux/sample_S_01_S1_R1_001.fastq</file>
</dataIn>
  <dataOut>
    <path>/mnt/ramdisk/kraken/experiments/Experiment/enc/reads</path>
  </dataOut>
</AriocE>

Align_upaired.cfg

<AriocU gpuMask="0x00000001" batchSize="256k">
  <R>/mnt/ramdisk/kraken/experiments/Experiment/enc/ref</R>
  <nongapped seed="ssi84_2_30" maxJ="1" />
  <gapped seed="hsi32_0_30" maxJ="*" Wmxgs="4_2_5_3" Vt="L, 0.6,0.6" AtO="99999999" />

  <Q filePath="/mnt/ramdisk/kraken/experiments/Experiment/enc/reads">
    <unpaired srcId="1" subId="2">
      <file>Sample_S_01_S1_R1_001</file>
    </unpaired>
  </Q>

  <A overwrite="true" mapqVersion="3" maxAperRead="1">
    <sam report="m">/mnt/ramdisk/kraken/experiments/Experiment/bam</sam>
  </A>

</AriocU>

Align log output:

Running arioc with command: AriocU /mnt/ramdisk/kraken/experiments/Experiment/align_reads.cfg
Copyright (c) 2015-2023 Johns Hopkins University.  All rights reserved.
 compiled             : 2024-05-29T01:43:11 (GNU g++ v11.4.0)
 data type sizes      : int=4 long=8 *=8 Jvalue5=5 Jvalue8=8 JtableHeader=5 (little-endian)
 computer name        : 2b1c291e4294
 operating system     : Ubuntu 22.04.4 LTS
 executable file      : /Arioc.x.151/src/AriocU
 configuration file   : /mnt/ramdisk/kraken/experiments/Experiment/align_reads.cfg
AriocBase::loadR: load R from disk...
AriocBase::loadR: loaded 1 R sequence: 19144 bytes (0.0GB) from 2 files in 0ms (infMB/s)
AriocBase::loadHJ: load H and J LUTs from disk...
CudaPinnedPtr<T>::Realloc: cudaHostAlloc( ..., 5368709120 bytes, ... ) on thread 16
CudaPinnedPtr<T>::Realloc: back from cudaHostAlloc()
CudaPinnedPtr<T>::Realloc: cudaHostAlloc( ..., 12 bytes, ... ) on thread 17
CudaPinnedPtr<T>::Realloc: back from cudaHostAlloc()
CudaPinnedPtr<T>::Realloc: cudaHostAlloc( ..., 5368709120 bytes, ... ) on thread 18
CudaPinnedPtr<T>::Realloc: back from cudaHostAlloc()
CudaPinnedPtr<T>::Realloc: cudaHostAlloc( ..., 1412 bytes, ... ) on thread 19
CudaPinnedPtr<T>::Realloc: back from cudaHostAlloc()
ApplicationException ([0x00000007] AriocBase/AriocBase.cpp 749): maximum subunit ID in nongapped J table: 127; in gapped J table: 1
AriocU ends (1)

Non Gapped Encoding log output

Running arioc with command: AriocE /mnt/ramdisk/kraken/experiments/Experiment/encode_nongapped_reference.cfg
AriocE v1.51.3140.23137 (release) +2023-05-17T01:06:49
Copyright (c) 2015-2023 Johns Hopkins University.  All rights reserved.
 compiled             : 2024-05-29T01:37:38 (GNU g++ v11.4.0)
 data type sizes      : int=4 long=8 *=8 Jvalue5=5 Jvalue8=8 JtableHeader=5 (little-endian)
 computer name        : 706d38edd182
 operating system     : Ubuntu 22.04.4 LTS
 executable file      : /Arioc.x.151/src/AriocE
 configuration file   : /mnt/ramdisk/kraken/experiments/Experiment/encode_nongapped_reference.cfg
AriocE::encodeR: encoding 1 file (48 CPU threads available)...
AriocE::encodeR: encoded 1 file
AriocE::sortCgpu: C list sort on GPU is unsupported.  Sorting on CPU instead.
AriocE::sortCcpu: sorting C list (1073741824 hash keys)...
AriocE::sortCcpu: sorted C list
AriocE::computeJlistSizes: m_nJ=0 (0+0) m_iCshort=0 m_iCzero=0
AriocE::computeJlistSizes: log2(nJ)  # J-lists
AriocE::computeJlistSizes: completed: J table uses 0 elements for 0 J values ( -nan% for list counts in J lists)
AriocE::buildJlists: building J lists for 1 file (2 sequences) (48 CPU threads available)...
AriocE::buildJlists: built J lists for 1 files (2 sequences)
AriocE::sortJgpu: sorting J lists on GPU...
AriocE::sortJgpu: J list sort complete
AriocE::countBBJ: counting big-bucket lists...
AriocE::countBBJ: counted 0 big-bucket lists
AriocE::validateHJ: start
AriocE::validateHJ: H table uses 0/1073741824 (0.000%) hash keys
AriocE::validateHJ: J table contains 0 J values (list counts in lists: 0 ( -nan%))
AriocE::validateHJ: maximum nJ=0 for hash key 0x00000000
AriocE::validateHJ: completed
AriocE::validateSubIds: start
AriocE::validateSubIds: validated 1073741824 hash values
AriocE::validateSubIds:  # subIds  # hash keys  avg J-list size
AriocE::validateSubIds:  subId Jf Jrc
AriocE::validateSubIds: completed
AriocE::finalizeJ: finalizing J lists on 48 worker threads...
AriocE::finalizeJ: completed
AriocE ends (0)

Cuda pinned memory error when switching to 2 GPUs

I've been benchmarking AriocP with one GPU (GTX1070) and things were going fine, however, when I added a second GPU (also 1070) to the system I get the following error:

ApplicationException ([0x00006008] C:\Projects VS120\Arioc\CudaCommon\CudaPinnedPtr.h 106): Unable to allocate page-locked system memory (CUDA "pinned" memory). Please ensure that there is sufficient system memory to execute this program.

I get this even when only using one or the other GPU. The system this is on has 32GB RAM, but the reference is fairly small (hg38 chr10 only) so I thought it would work. I've also tried with the S. cerevisae example data, with the same error. I've also played around with the batchSize. Note this is on Windows at the moment.

Issues encoding hg38 with AriocE

Hello,

Thanks for this piece of software, it is nice to have a GPU-accelerated read mapper.
I am trying to identify methylated cytosines in hg38. However, I saw that AriocE has troubles when encoding the 24 hg38 chromosomes simultaneously.

The two configuration files that I am using with AriocE are available here and here. I am using Arioc 1.51.

Basically, AriocE finishes without errors, but I get an error when running AriocP. Inspecting the output of AriocE, I found that it creates a folder named ssi84_2_30_CT where there is a configuration file that, I think, looks strange as is shown below

<?xml version="1.0" encoding="UTF-8"?>
<SAM fn="ssi84_2_30_CT">
    <HD VN="1.6"/>
    <SQ srcId="0" subId="001" rm="chr1" UR="" LN="248956422" AS="hg38" M5="2648ae1bacce4ec4b6cf337dcae37816" SN="chr1"/>
    <SQ srcId="0" subId="002" rm="chr2" UR="" LN="242193529" AS="hg38" M5="4bb4f82880a14111eb7327169ffb729b" SN="chr2"/>
    <SQ srcId="0" subId="003" rm="" UR="" LN="198295559" AS="hg38" M5="a48af509898d3736ba95dc0912c0b461" SN="chr3"/>
    <SQ srcId="0" subId="004" rm="" UR="" LN="190214555" AS="hg38" M5="3210fecf1eb92d5489da4346b3fddc6e" SN="chr4"/>
    <SQ srcId="0" subId="005" rm="" UR="" LN="181538259" AS="hg38" M5="f7f05fb7ceea78cbc32ce652c540ff2d" SN="chr5"/>
    <SQ srcId="0" subId="006" rm="" UR="" LN="170805979" AS="hg38" M5="6a48dfa97e854e3c6f186c8ff973f7dd" SN="chr6"/>
    <SQ srcId="0" subId="007" rm="chr7" UR="" LN="159345973" AS="hg38" M5="94eef2b96fd5a7c8db162c8c74378039" SN="chr7"/>
    <SQ srcId="0" subId="008" rm="chr8" UR="" LN="145138636" AS="hg38" M5="c67955b5f7815a9a1edfaa15893d3616" SN="chr8"/>
    <SQ srcId="0" subId="009" rm="chr9" UR="" LN="138394717" AS="hg38" M5="addd2795560986b7491c40b1faa3978a" SN="chr9"/>
    <SQ srcId="0" subId="003" rm="chr3" UR="" LN="0"/>
    <SQ srcId="0" subId="004" rm="chr4" UR="" LN="0"/>
    <SQ srcId="0" subId="005" rm="chr5" UR="" LN="0"/>
    <SQ srcId="0" subId="006" rm="chr6" UR="" LN="0"/>
    <SQ srcId="0" subId="010" rm="chr10" UR="" LN="133797422" AS="hg38" M5="907112d17fcb73bcab1ed1c72b97ce68" SN="chr10"/>
    <SQ srcId="0" subId="011" rm="chr11" UR="" LN="135086622" AS="hg38" M5="1511375dc2dd1b633af8cf439ae90cec" SN="chr11"/>
    <SQ srcId="0" subId="012" rm="chr12" UR="" LN="133275309" AS="hg38" M5="e81e16d3f44337034695a29b97708fce" SN="chr12"/>
    <SQ srcId="0" subId="013" rm="chr13" UR="" LN="114364328" AS="hg38" M5="17dab79b963ccd8e7377cef59a54fe1c" SN="chr13"/>
    <SQ srcId="0" subId="014" rm="chr14" UR="" LN="107043718" AS="hg38" M5="acbd9552c059d9b403e75ed26c1ce5bc" SN="chr14"/>
    <SQ srcId="0" subId="015" rm="chr15" UR="" LN="101991189" AS="hg38" M5="f036bd11158407596ca6bf3581454706" SN="chr15"/>
    <SQ srcId="0" subId="016" rm="chr16" UR="" LN="90338345" AS="hg38" M5="24e7cabfba3548a2bb4dff582b9ee870" SN="chr16"/>
    <SQ srcId="0" subId="017" rm="chr17" UR="" LN="83257441" AS="hg38" M5="a8499ca51d6fb77332c2d242923994eb" SN="chr17"/>
    <SQ srcId="0" subId="018" rm="chr18" UR="" LN="80373285" AS="hg38" M5="11eeaa801f6b0e2e36a1138616b8ee9a" SN="chr18"/>
    <SQ srcId="0" subId="019" rm="chr19" UR="" LN="58617616" AS="hg38" M5="b0eba2c7bb5c953d1e06a508b5e487de" SN="chr19"/>
    <SQ srcId="0" subId="020" rm="chr20" UR="" LN="64444167" AS="hg38" M5="b18e6c531b0bd70e949a7fc20859cb01" SN="chr20"/>
    <SQ srcId="0" subId="021" rm="chr21" UR="" LN="46709983" AS="hg38" M5="2f45a3455007b7e271509161e52954a9" SN="chr21"/>
    <SQ srcId="0" subId="022" rm="chr22" UR="" LN="50818468" AS="hg38" M5="221733a2a15e2de66d33e73d126c5109" SN="chr22"/>
    <SQ srcId="0" subId="023" rm="chrX" UR="" LN="156040895" AS="hg38" M5="49527016a48497d9d1cbd8e4a9049bd3" SN="chrX"/>
    <SQ srcId="0" subId="024" rm="chrY" UR="" LN="57227415" AS="hg38" M5="b2b7e6369564d89059e763cd6e736837" SN="chrY"/>
    <PG ID="AriocE (ssi84_2_30_CT)" PN="AriocE" VN="1.51.3127.22294" CL="/home/ae/exp/ba-nvbs/ba-hg38.m/ba-arioc.m/SRR770598.m/ba-encode.m/fasta1.cfg" dt="2022-11-05T13:49:56" ms="738209" mJ="*"/>
</SAM>

For example, the chromosome subId=003 appears twice. If I edited manually this configuration file automatically generated by AriocE , by removing the duplicated elements, I can get AriocP works, but it never finishes the mapping. Other things that I tried was to use AriocE with individual chromosomes. Here, everything works well. So it seems that the issue is not with the reference genome.

Do you know what I am doing wrong?

Thanks in advance.

AriocP hangs when aligning WGBS data

[@aedera This is a followup to issue #29.]

I am experiencing problems when using AriocP with the encoding generated by the new build. It runs without throwing errors but it gets stuck during the mapping.

This is the conf file used by AriocP

<?xml version="1.0" encoding="utf-8"?>
<AriocP batchSize="48k">
  <R>../ba-encode.m</R>
  <nongapped seed="ssi84_2_30_CT" />
  <gapped seed="hsi25_0_30_CT" Wmxgs="2_6_5_3" Vt="L,-25,2" />
  <Q filePath="../ba-encode.m">
    <paired subId="1">
      <file>SRR770598_1</file>
      <file>SRR770598_2</file>
    </paired>
  </Q>
  <A overwrite="true" cigarFormat="MIDS">
    <sam report="cdru">.</sam>
  </A>
</AriocP>

And this is the output before stopping it after being there for a couple of hours.

120428160 [00064cd1] AriocP v1.51.3129.22314 (release) +2022-11-10T17:27:27
120428164 [00064cd1] Copyright (c) 2015-2022 Johns Hopkins University.  All rights reserved.
120428164 [00064cd1]  compiled             : 2022-11-11T11:14:17 (GNU g++ v9.4.0)
120428164 [00064cd1]  data type sizes      : int=4 long=8 *=8 Jvalue5=5 Jvalue8=8 JtableHeader=5 (little-endian)
120428164 [00064cd1]  computer name        : camus
120428164 [00064cd1]  operating system     : Ubuntu 20.04.4 LTS
120428164 [00064cd1]  executable file      : /mnt/ssd1/aedera/ba-nvbs/ba-hg38.m/ba-arioc.m/testing_new_version_of_arioc.m/ca-mapping.m/AriocP
120428164 [00064cd1]  configuration file   : /mnt/ssd1/aedera/ba-nvbs/ba-hg38.m/ba-arioc.m/testing_new_version_of_arioc.m/ca-mapping.m/file.cfg
120438233 [00064cd1] AriocBase::loadR: load R from disk...
120438233 [00064cd1] AriocBase::loadR: reference file (subId 0) ignored: ../ba-encode.m/SRR770598_1$a21.sbf
120438233 [00064cd1] AriocBase::loadR: reference file (subId 0) ignored: ../ba-encode.m/SRR770598_2$a21.sbf
120439568 [00064cd1] AriocBase::loadR: loaded 24 R sequences: 2353273816 bytes (2.2GB) from 48 files in 1335ms (1681.1MB/s)
120439568 [00064cd1] AriocBase::loadHJ: load H and J LUTs from disk...
120439568 [00064cf1] CudaPinnedPtr<T>::Realloc: cudaHostAlloc( ..., 5368709120 bytes, ... ) on thread 412913
120442406 [00064cf1] CudaPinnedPtr<T>::Realloc: back from cudaHostAlloc()
120444310 [00064d03] CudaPinnedPtr<T>::Realloc: cudaHostAlloc( ..., 29387864448 bytes, ... ) on thread 412931
120455980 [00064d03] CudaPinnedPtr<T>::Realloc: back from cudaHostAlloc()
120506723 [00064d73] CudaPinnedPtr<T>::Realloc: cudaHostAlloc( ..., 5368709120 bytes, ... ) on thread 413043
120509885 [00064d73] CudaPinnedPtr<T>::Realloc: back from cudaHostAlloc()
120512550 [00064d87] CudaPinnedPtr<T>::Realloc: cudaHostAlloc( ..., 29407356656 bytes, ... ) on thread 413063
120526951 [00064d87] CudaPinnedPtr<T>::Realloc: back from cudaHostAlloc()
120554069 [00064cd1] AriocBase::loadHJ: loaded H and J LUTs from disk in 74501ms (890.1MB/s)
120554069 [00064dfc] void tuWatchdog::main: watchdog interval 0s
120555059 [00064dfd] AriocBase::UpdateProgress:   0%: 0 pairs (0 mates) aligned
Command terminated by signal 2
21419.89user 82.51system 5:59:50elapsed 99%CPU (0avgtext+0avgdata 70373596maxresident)k
7560inputs+14464outputs (33major+21161134minor)pagefaults 0swaps

It seems that two files are being skipped, but not sure if this is expected. Is my conf file okay?

No Source Code Available?

I'm having an issue compiling on centos because my glibc is 2.12 and it requires 2.14 in the precompiled version. So I went to download the source code and all 3 versions(both zips per) don't have the source code and only have a manual and some attribute files.

Am I missing something or do you have an idea on how to run despite the wrong glibc version?

Mapping accuracy ?

Dear Richard !

I try to use Arioc to re-map a BWA mapped WGS BAM File, but i get very weird results. In attached screenshot Arioc BAM is on top, bottom lane is the BWA BAM. I get many regions that show presumably wrongly mapped reads in Arioc.
Arioc mapping was done with these settings:

<nongapped seed="ssi84_2_30" />
<gapped seed="hsi20_0_30" maxJ="64" seedDepth="4" Wmxgs="2,6,5,3" Vt="100" />

Any idea what could be wrong ?

Allow named pipes as input to AriocE

First of all, thanks for creating Arioc and publishing it under a permissive license !

I have seen a closed issue about fastq.gz and i am in a similar situation, i would like to remap bam files without having to go
bam > uncompressed fastq > very large arioc format > uncompressed sam > bam. With my WGS data this really is a pain both in disk space and in network utilization.

I figured that i can use a named pipe to get the output file from AriocP directly into samtools, that works great. However, i could not figure out a way to do that with the input for AriocE. Whenever i use a named pipe as input it balks at me with "Bad address".

My IT guy tells me that you open the input as "file" not as "stream" so this can not easily work. Thus my question: Is there a chance you could make AriocE accept piped fastq data ? As far as i understand that its just a different way of opening the input data, i presume you don't do any random access on the input fastq(s).

This would make AriocE accept any input that one can convert to fastq and stuff into a pipe, which would be great!

Error when running AriocP

I'm trying to get Arioc working with the the example data for GRCh38. I'm getting this error when running AriocP for the paired end data using the provided cfg file:

ApplicationException ([0x03680784] AriocCommon/InputFileGroup.cpp 612): inconsistent value for attribute srcId= for ../../Q/G
RCh38/enc.paired/i100p_1$a21.sbf: expected 1 specified 0

Please check indel normalisation works appropriately for reverse orientation reads.

First thank you for writing a gpu aligner that actually offers appreciable benefits. (Even without nvlink).
I have been checking how the alignments compare to bwa2 alignments and noticed the following difference in detected variant calls.

The top alignment is from arioc the bottom alignment is from bwa2. The insertion event is from a single G to two G's.
It looks like arioc has left normalized the reverse reads differently to the forward aligned reads.

Allocation error

Greetings!

I'm testing Arioc on several single-end read sets. Encoding the reference and the Fastq files completed with no apparent errors, but the AriocU invocation terminated in seconds with the following error message.

134823012 [0000bdac] AriocU v1.30.2427.18290 (release)
134823013 [0000bdac] Copyright (c) 2015-2018 Johns Hopkins University.  All rights reserved.
134823013 [0000bdac]  data type sizes      : int=4 long=8 *=8 Jvalue5=5 Jvalue8=8 JtableHeader=5
134823013 [0000bdac]  executable file      : 
134823013 [0000bdac]  configuration file   : /scratch/standage/temp/config/align-unpaired.xml
134823013 [0000bdac] ApplicationException ([0x48556] CppCommon/WinGlobalPtr.h 134): realloc failed to allocate 0 bytes
134823013 [0000bdac] AriocU ends (1)

I tried fiddling with the batchSize configuration a bit, but this doesn't seem to make a difference. I wasn't hopeful anyway since the problem looks like it's related to attempting to allocate zero bytes as opposed to attempting to allocate too many bytes.

Is this an error with which you're familiar? Are there any obvious mistakes in the configuration files?

Config files

Makefile in linux source for powerpc machine could not work correctly

In the makefile in Linux source, there is a condition sentence to see whether it work on X86 or PowerPC architecture.

TARGET_ARCH := $(shell $(GPP) -dumpmachine)
# ppc64
ifneq (,$(findstring ppc64,$(TARGET_ARCH)))
......

I met some trouble when I run "make", so I looked the makefile and found above sentence. I checked this by running "g++ -dumpmachine" in os shell, output was like this:

root@ubuntu:/home/songdm/Arioc/src# g++ -dumpmachine
powerpc64le-linux-gnu
root@ubuntu:/home/songdm/Arioc/src#

I noticed it returned "powerpc64...." instead of "ppc64".
I changed ppc64 to powerpc64 in makefile and re-make, then it worked correctly.
My os is Ubuntu 16.4. I also tested this command in Redhat 7.5, it returned "ppc64le-linux-gnu".
So I think the condition sentence should include the both two conditions. Or use "uname -m" command to get machine type, it always return "ppc64" or "ppc64le" for PowerPC machine, regardless any os distro.

Arioc underestimating CpG methylation

Hello,

I have been testing Arioc for a few days and I can see some clear speed advantages compared to Bowtie2. Also, by tweaking the parameters, I can get a higher percentage of mapped reads with Arioc. However, my main motivation to use Arioc is to use it in bisulfite sequencing alignment since those can take days with Bismark depending on the number of samples and sequencing depth.

I compared Arioc and Bismark using the same mock BS-Seq data created using the package Sherman:
Sherman --length 150 --number_of_seqs 100000 --genome_folder genome --quality 40 --minfrag 200 --maxfrag 500 --CG_conversion 80 --CH_conversion 2 --error_rate 1 --truth_set --colorspace --paired_end

I iterated through multiple Arioc parameters:
batchSize
maxJ (reads)
maxJ (reference)
seedDepth
AtO
seed (hsi25_0_30_CT vs hsi25_0_31_CT)
maxBQS
Vt

But I could never get the same results as Bismark after Bismark methylation extractor with minimum alignment score as default or --local:
79.7% CpG methylation
2.3% Non-CpG methylation

The best I got with Arioc was:
78.3 to 79% CpG methylation
2.2 to 2.4% Non-CpG methylation
Mostly by increasing maxBQS, which reduces the number of analysed Cs ending up with less analysed Cs than Bismark, and setting At0 to 6.

For the remaining settings I kept:
batchSize=64k
maxJ=16 (reads)
maxJ=16 (reference)
seedDepth=6
Vt="L,-32,2"
seed = hsi25_0_31_CT & ssi84_2_31_CT

Changing these setting had little effect on the CpG methylation percentage.

I don't know what can I do to improve these results. I haven't tried setting different error rates in Sherman which could change the results. Still, I was expecting that Arioc would perform better than Bismark in assigning methylated CpGs. I really would like to implement this package into our pipelines but it needs to be comparable to Bismark. Is there a way to improve the CpG methylation results in Arioc?

tuning alignment soft-clipping for variant recall

I have noticed that Arioc produces slightly more soft clips in the alignments than an alternative (bwa mem).
The IGV screenshot shows an example from the Arioc test data. BWA alignment in the centre: two reads carry the alternative alignment A.
Arioc at the bottom soft clips resulting in a single A and a soft clip.
My untuned simple test bcftools variant calling pipeline then produces an A variant call from BWA but not from Arioc.

Overall I have noticed that Arioc alignments result in precise variant calls (in very high agreement to BWA alignment variant calls) but that BWA alignments will generate more lower quality snps in a subsequent variant calling stage. This is likely caused by the default settings for bcftools variant calling being compatible with BWA. I attribute many of these to the MAPQ differences. However my users would be concerned by the lower recall.

The Arioc papers mention that the non gapped aligner will produce soft clips under certain conditions. Is there a user controllable parameter to tune soft clipping?
Suggestions on using arioc as a drop in replacement aligner for a variant calling pipeline in either a high-precision or a high-recall variant calling setting might be helpful to users.

Append fastq comment to SAM output like bwa-mem -C option

Hi !
I need to preprocess my raw reads to extract molecular barcodes.
The preprocessing puts them into the fastq ID line like this:
@K00336:237:HL2CKBBXY:1:1101:1539:1349 BC:Z:ATGGTAGC+NGATCGAC ZA:Z:NNNNN ZB:Z:NNNNN RX:Z:NNN-NNN QX:Z:$$$ $$$ GAGGCCCTTTGAATGTAATGAATGTGGGAAATCTTTTGGCAGGAAGTAACAACTCATCCTACATACAAGAACACACACTGGNGANANACC +
Running this fastq through arioc leads to incorrect SAM file like:
K00336:237:HL2CKBBXY:1:1102:26545:1666 BC:Z:ATGGTAGC+CGATCGAC ZA:Z:NNNNN ZB:Z:CTAT RX:Z:NNN-CTA QX:Z:$$$ JFJ 83 chr11 110579883 60 90= = 110579757 -216 CATGGTGTCCTTCTTTGTATAGGCTGGGCGGCTGCAAGCCTGCCCTGATGAGGGACCGGGCATTCCGGAAACATGGCTGGCATTGCTAAA JJJJJJJFJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ AS:i:180 NM:i:0 MD:Z:90 YT:Z:CP MQ:i:60 Na:i:1 Nb:i:1 RG:Z:RG c3:i:53 YS:i:182

BWA has the -C option which does the following
Append FASTA/Q comment to SAM output. This option can be used to transfer read meta information (e.g. barcode) to the SAM output. Note that the FASTA/Q comment (the string after a space in the header line) must conform the SAM spec (e.g. BC:Z:CGTAC). Malformated comments lead to incorrect SAM output.

Is there any way to make Arioc behave like bwa-mem with -C option ? I.e. append the fastq comment to the end of the sam line ?

HOWTO?

I have two fastq files for a paired WGBS read, as well as hg38.fa.

It's not clear what the sequence of operations is, to go from this, to a SAM or a BAM file. Could you assemble a quick HOWTO or a very simple cfg? For example, how do I generate gapped or ungapped seeds?

Suppress @SQ metadata header for aggregated FASTA file

Ok, used this reference for mapping a WGS file, seemed to work OK, even did GATK base quality recalibration on the resulting BAM. However, after that i want to use Manta to find structural variations, but it balks at the bam with:
Reference genome mismatch: Reference fasta file is missing a chromosome found in the Tumor BAM/CRAM file: '0.101#3341'
Looking into the header of the bam i do find a SQ line:
@SQ SN:0.101#3341 LN:129060516 UR:https://console.cloud.google.com/storage/browser/_details/genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.fasta/others.fa rm:(multiple reference sequences)
This @sq describes the file i put all the decoys in. The line is followed by @sq lines for all the chromosomes in that file so that looks OK, but i am pretty sure it should not add an extra @sq line for the file itself.
I tried to remove the superfluous @sq line from the bam with samtools reheader (just removing the one line) but then indexing the bam fails with:
samtools index -@16 K274_16.bam samtools index: failed to create index for "K274_16.bam": Numerical result out of range
i checked with "samtools view K274_16.bam 0.101#3341", and there are no reads mapped to this @sq.
So, it seems currently i cant use this bam for Manta. I cant remove the @sq line.
Dont know what to do now.

Originally posted by @karlkashofer in #20 (comment)

Erroneous characters generated in Sam

I used Arioc to map reads to my reference genome twice but both sam files has erroneous characters generated and it cause samtools view to abort with error [W::sam_read1] Parse error at line 2602870 and [W::sam_read1] Parse error at line 2644696. Those two lines in the sam file are:
K00134:236:H5CJWBBXY:1:1104:29609:17175 83 chr03 20253514/* 0 151M = 597 -462 ATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAACAACCCAATAACCCAATAACCCACTAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCC 7-F7<AF7A7F7JAFAF<A-FFJJJ<AA7F<JFAAAFJF<7-A-FJFF<<FJAJJJJ<<JJJJF-FJAJJAAA-JJJJF<AF<7JJFJJAJJJJF<-<F-FJJJJFJFAJJJFA<FJFFFJFJFFJJJFF7JJJJFAJJJJJF<FFFFFAA AS:i:286 XS:i:286 NM:i:2 MD:Z:73A22A54 YT:Z:CPNa:i:9975 Nb:i:8 c3:i:31 YS:i:252
K00134:236:H5CJWBBXY:1:1104:29609:17175 83 chr03 0515640 /* 0 151M = 597 -462 ATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAACAACCCAATAACCCAATAACCCACTAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCC 7-F7<AF7A7F7JAFAF<A-FFJJJ<AA7F<JFAAAFJF<7-A-FJFF<<FJAJJJJ<<JJJJF-FJAJJAAA-JJJJF<AF<7JJFJJAJJJJF<-<F-FJJJJFJFAJJJFA<FJFFFJFJFFJJJFF7JJJJFAJJJJJF<FFFFFAA AS:i:286 XS:i:286 NM:i:2 MD:Z:73A22A54 YT:Z:CPNa:i:9975 Nb:i:8 c3:i:31 YS:i:252

Another example is:
K00134:236:H5CJWBBXY:1:1106:21085:38486 83 chr03 151M = 42* 0 151M = 624 -479 AACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAA 7-A-FFF--7F<7-7F77FJAFFJ7JJJFA<FFJJJFJ<JFFJJJFFJA-JFJJFJAAFAJJFA<A<AFF<F<-<FJJJJAAA<AJAF7-J<<-<F-<JJFJJJFJJJJJJJJJJJJJJJJJJJJJFFJJJJJJJJJJJJJJJJJFFFAAA AS:i:302 XS:i:302 NM:i:0 MD:Z:151 YT:Z:CP Na:i:872 Nb:i:1 c3:i:30 YS:i:302
K00134:236:H5CJWBBXY:1:1106:21085:38486 83 chr03 JJJFJJJJF* 0 151M = 624 -479 AACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAA 7-A-FFF--7F<7-7F77FJAFFJ7JJJFA<FFJJJFJ<JFFJJJFFJA-JFJJFJAAFAJJFA<A<AFF<F<-<FJJJJAAA<AJAF7-J<<-<F-<JJFJJJFJJJJJJJJJJJJJJJJJJJJJFFJJJJJJJJJJJJJJJJJFFFAAA AS:i:302 XS:i:302 NM:i:0 MD:Z:151 YT:Z:CPNa:i:872 Nb:i:1 c3:i:30 YS:i:302

They seem to be generated by the same read. Could you help explain the occurance of "/*"? I'm using the Arioc v1.51.

Does Arioc deal with Fastq.gz files?

Hello Team,
Is there a way to make use of the compressed fastq (paired-end) files in AriocE program.

Thanks,
Rajneesh

Memory usage requirements for large genomes

Dear developers,

This looks like a very interesting tool and is timely for many research groups now that Nvidia RTX 30x0 generation GPUs are becoming available.

I am working on a genome project centered on a species with a very large genome (18-19Gbp) and we need to map a lot of short-read data against this genome (approximately 140x coverage). While we have access to many CPUs in conventional clusters, it would be interesting to run Arioc on our GPU machines.

I wonder what kind of memory requirements we need to consider when 1) making the lookup table and 2) running the actual application. For example, is a machine with 256GB RAM and Nvidia RTX 3090 GPU expected to be able to carry out any of these two step in an efficient manner?

pthread_mutex_unlock returned 22 for AriocE

Hi,
I was testing Arioc with provided sample dataset. I created the index for yeast reference and saw a warning message for both gapped and ungapped indexes (ApplicationException ([0x00079555] CppCommon/RaiiMutex.cpp 107): pthread_mutex_unlock returned 22). Below is the log for making gapped reference index.
I was able to generate SAM files for both unpaired and paired reads. Just want to know if the pthread warning is critical and may affect the result. I am planning to use Arioc for a gigantic dataset, and hopefully it won't affect the result.
During the compilation, I did receive the warning message but was able to get the binary files. I am wondering if it is the reason I got the pthread error. I have CUDA 11.2 installed in Ubuntu 14.04 LTS

/usr/local/cuda/include/cub/util_cpp_dialect.cuh:129:13: warning:
CUB requires C++14. Please pass -std=c++14 to your compiler. Define CUB_IGNORE_DEPRECATED_CPP_DIALECT to suppress this message.
CUB_COMPILER_DEPRECATION(C++14, pass -std=c++14 to your compiler);"

Thanks!

162707206 [000131a1] AriocE v1.43.2991.21102 (release) +2021-04-12T13:07:35
162707206 [000131a1] Copyright (c) 2015-2021 Johns Hopkins University. All righ ts reserved.
162707206 [000131a1] compiled : 2021-04-14T15:55:58 (GNU g++ v5.4.0 )
162707206 [000131a1] data type sizes : int=4 long=8 *=8 Jvalue5=5 Jvalue8= 8 JtableHeader=5 (little-endian)
162707206 [000131a1] computer name : CBU5143610
162707206 [000131a1] executable file : /home/apps/arioc/bin/AriocE
162707206 [000131a1] configuration file : /home/apps/arioc/bin/R/S_ce revisiae/AriocE.gapped.cfg
162707219 [000131a1] AriocE::encodeR: encoding 17 files (34 sequences) (112 CPU threads available)...
162709417 [000131a1] AriocE::encodeR: encoded 17 files (34 sequences)
162711552 [000131a1] AriocE::computeJlistSizes: sorting C list (1073741824 hash keys)...
163102797 [000131a1] AriocE::computeJlistSizes: sorted C list
163105773 [000131a1] AriocE::computeJlistSizes: m_nJ=24485122 (220907+24264215) m_iCshort=5225 m_iCzero=22744289
163105773 [000131a1] AriocE::computeJlistSizes: log2(nJ) # J-lists
163105773 [000131a1] AriocE::computeJlistSizes: 0 21950821
163105773 [000131a1] AriocE::computeJlistSizes: 1 704636
163105773 [000131a1] AriocE::computeJlistSizes: 2 43501
163105773 [000131a1] AriocE::computeJlistSizes: 3 23287
163105773 [000131a1] AriocE::computeJlistSizes: 4 16819
163105773 [000131a1] AriocE::computeJlistSizes: 5 4622
163105773 [000131a1] AriocE::computeJlistSizes: 6 573
163105773 [000131a1] AriocE::computeJlistSizes: 7 26
163105773 [000131a1] AriocE::computeJlistSizes: 9 4
163112301 [000131a1] AriocE::computeJlistSizes: completed: J table uses 24490347 elements for 24485122 J values (0.021% for list counts in J lists)
163114647 [000131a1] AriocE::buildJlists: building J lists for 17 files (34 sequences) (112 CPU threads available)...
163116165 [000131a1] AriocE::buildJlists: built J lists for 17 files (34 sequences)
163116190 [000131a1] AriocE::sortJcpu: sorting J lists on 112 worker threads...
163206607 [000131a1] AriocE::sortJcpu: J list sort complete
163238533 [000131a1] AriocE::validateHJ: start
163247005 [000131a1] AriocE::validateHJ: H table uses 22744289/1073741824 ( 2.118%) hash keys
163247005 [000131a1] AriocE::validateHJ: J table contains 24490347 J values (list counts in lists: 5225 (0.021%))
163247005 [000131a1] AriocE::validateHJ: maximum nJ=892 for hash key 0x295C5954
163247005 [000131a1] AriocE::validateHJ: completed
163247006 [000131a1] AriocE::validateSubIds: start
163249042 [000131a1] AriocE::validateSubIds: validated 1073741824 hash values
163249042 [000131a1] AriocE::validateSubIds: # subIds # hash keys avg J-list size
163249042 [000131a1] AriocE::validateSubIds: 0 0 0.0
163249042 [000131a1] AriocE::validateSubIds: 1 22193080 1.0
163249042 [000131a1] AriocE::validateSubIds: 2 438996 2.1
163249042 [000131a1] AriocE::validateSubIds: 3 45131 3.1
163249042 [000131a1] AriocE::validateSubIds: 4 12937 4.3
163249042 [000131a1] AriocE::validateSubIds: 5 6546 5.7
163249042 [000131a1] AriocE::validateSubIds: 6 5218 7.4
163249042 [000131a1] AriocE::validateSubIds: 7 5482 9.1
163249042 [000131a1] AriocE::validateSubIds: 8 8909 12.1
163249042 [000131a1] AriocE::validateSubIds: 9 5069 13.6
163249042 [000131a1] AriocE::validateSubIds: 10 3334 16.9
163249042 [000131a1] AriocE::validateSubIds: 11 3647 20.4
163249042 [000131a1] AriocE::validateSubIds: 12 7642 28.0
163249042 [000131a1] AriocE::validateSubIds: 13 6322 21.2
163249042 [000131a1] AriocE::validateSubIds: 14 1197 39.2
163249042 [000131a1] AriocE::validateSubIds: 15 359 56.8
163249042 [000131a1] AriocE::validateSubIds: 16 416 84.3
163249042 [000131a1] AriocE::validateSubIds: 17 4 474.5
163249042 [000131a1] AriocE::validateSubIds: completed
163249042 [000131a1] AriocE::finalizeJ: finalizing J lists on 112 worker threads...
163309783 [000131a1] AriocE::finalizeJ: completed
163316296 [000131a1] RaiiMutex::~RaiiMutex: pthread_mutex_destroy returned 16
163316672 [000136c3] ApplicationException ([0x00079555] CppCommon/RaiiMutex.cpp 107): pthread_mutex_unlock returned 22

Problem with calling methylation marks from AriocP Post aligned sam files

Hi Team,
I am trying to extract the methylation marks from the AriocP post aligned sam file by using the recommended bismark_methylation_extractor script.
It is showing an error:

The IDs of Read 1 (A01204:9:HHFNMDSXY:1:1101:1018:2754 1:N:0:AGTCAACA) and Read 2 (A01204:9:HHFNMDSXY:1:1101:1018:2754 2:N:0:AGTCAACA) are not the same. 
This might be the result of sorting the paired-end SAM/BAM files by chromosomal position which is not compatible with correct methylation extraction. 
Please use an unsorted file instead or sort the file using 'samtools sort -n' (by read name). 
This may also occur using samtools merge as it does not guarantee the read order. 
To properly merge files please use 'samtools merge -n' or 'samtools cat'

However, as per recommendation by Bismark, I already sorted (sort -n) the files. Still it is giving the above error.

Below I am pasting the 1st two reads. I am not sure where the bismark_methylation_extractor script is finding it incorrect.

A01204:9:HHFNMDSXY:1:1118:28103:32534 1:N:0:AGTCAACA    83      chr3    67684399        44     151M     =       67684223        -327    TATCCCCCCTCTAAATTAAAATAAAATACACAAAATACTACCAACTATTTTTCCGACTCTTCCAAAATAAAATATATCCAAAATACTACTAATAAATTAAACCACATTAAAACATATAAACATTTAAAATCACTTACAACTTTTTTACTAT,FFFFFFFFFFFF:FF,FFFFFFFFFFFFFFF:FF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:F:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFF  AS:i:302        NM:i:0  MD:Z:151       YT:Z:CP  MQ:i:44 Na:i:3  Nb:i:1  c3:i:43 XM:Z:.u.........................h........h..x......x.......Z.............hh.h............h.....x..hh...h.h.........h...................................h..x.    XR:Z:CT XG:Z:GA YS:i:294
A01204:9:HHFNMDSXY:1:1118:28103:32534 2:N:0:AGTCAACA    163     chr3    67684223        44     4I147M   =       67684399        327     GGGGAATCCTATCAATAAACTATTTTTCTCCTCCCGAACCCTATTATTATTCATCCTATTTCCCTCCCTAACAAATATAACAATTATAACTATTATTATTTTTCTCAAATAAATTTAAATTATAAATTACATCAATTCGTCACAACCAAAAFFFFF:FFFFFFFFFFFFFFFFFF:FFFFFFFFFFFF:FFFFF:FFF:::FFFF:FFF:FFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFF:FFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFF  AS:i:294        NM:i:0  MD:Z:147       YT:Z:CP  MQ:i:44 Na:i:2  Nb:i:1  c3:i:45 XM:Z:.....u..........h.h..x.............Zxh....x..h..h....................xh.....h.h...x..h..h..x..h...............h.h........h.h..........x...Z.....x.....h    XR:Z:GA XG:Z:GA YS:i:302

Many thanks in advance!
Rajneesh

Source code?

There doesn't appear to be any source code in the repo. Could you double-check that you've pushed appropriately?

Hard-coded architecture

The Linux makefile currently hardcodes the architecture:

-gencode arch=compute_37,code=\"compute_37,sm_37\"

This is problematic for two reasons:

The 3.7 compute capability is deprecated in CUDA 11.
More seriously, this doesn't offer forward compatibility with most GPUs (from the docs):

The NVIDIA CUDA C++ compiler, nvcc, can be used to generate both architecture-specific cubin files and forward-compatible PTX versions of each kernel. Each cubin file targets a specific compute-capability version and is forward-compatible only with GPU architectures of the same major version number. For example, cubin files that target compute capability 3.0 are supported on all compute-capability 3.x (Kepler) devices but are not supported on compute-capability 5.x (Maxwell) or 6.x (Pascal) devices. For this reason, to ensure forward compatibility with GPU architectures introduced after the application has been released, it is recommended that all applications include PTX versions of their kernels.

Please provide an example for the creation of a cfg file from a fasta containing multiple sequences with AriocE

many thanks for providing and supporting Arioc.
So far I do fail to create a cfg file with AriocE from a fasta containing multiple sequences. I failed to derive the correct "parenthesized wildcard specification".
I can not overcome this error:
unable to open or create directory ./enc/ssi84_2_30 (errno=2: No such file or directory).

I would highly appreciate an example code.

Error with encoding reference genome

I'm trying out Arioc for the first time and I'm having trouble with the reference genome encoding. Running AriocE version v1.25.2401.18201 I get the following output while doing the nongapped encoding.

165230676 [000076b9] AriocE v1.25.2401.18201 (release)
165230677 [000076b9] Copyright (c) 2015-2018 Johns Hopkins University.  All rights reserved.
165230677 [000076b9]  data type sizes      : int=4 long=8 *=8
165230677 [000076b9]  executable file      : 
165230679 [000076b9]  configuration file : /path/to/my/projdir/config/encode-refr-nongapped.xml
165230691 [000076b9] encodeR: encoding 43 files (86 sequences) (8 CPU threads available)...
170714855 [000076b9] encodeR: encoded 43 files (86 sequences)
172715421 [000076b9] computeJlistSizes:  log2(nJ)  # J-lists
172715422 [000076b9] computeJlistSizes:         0      10713
172715422 [000076b9] computeJlistSizes:         1     341060
172715422 [000076b9] computeJlistSizes:         2   17427473
172715422 [000076b9] computeJlistSizes:         3  365417880
172715422 [000076b9] computeJlistSizes:         4  560730395
172715422 [000076b9] computeJlistSizes:         5   93906187
172715422 [000076b9] computeJlistSizes:         6   22111708
172715422 [000076b9] computeJlistSizes:         7    7880260
172715422 [000076b9] computeJlistSizes:         8    3306299
172715422 [000076b9] computeJlistSizes:         9    1535308
172715422 [000076b9] computeJlistSizes:        10     705079
172715422 [000076b9] computeJlistSizes:        11     246196
172715422 [000076b9] computeJlistSizes:        12      87600
172715422 [000076b9] computeJlistSizes:        13      28644
172715422 [000076b9] computeJlistSizes:        14       5676
172715422 [000076b9] computeJlistSizes:        15        511
172715422 [000076b9] computeJlistSizes:        16          9
172723704 [000076b9] ApplicationException ([0x30393] AriocE/AriocE.R.cpp 408): computeJlistSizes: too many J values (28464018949) for J table (maximum = 8589934591)

The manual discusses the performance benefits of higher vs lower choices of the J parameter, but it is unclear whether the issue is due to parameter selection or an issue with the software. Please advise.

This is the configuration file I used.

<?xml version="1.0" encoding="utf-8"?>

<AriocE seed="ssi84_2_30" maxDOP="8" maxJ="*">
    <dataIn sequenceType="R" srcId="0" filePath="/path/to/my/projdir/refr" uriPath="https://urgi.versailles.inra.fr/download/iwgsc/IWGSC_RefSeq_Assemblies/v1.0/iwgsc_refseqv1.0_all_chromosomes.zip">
        <file subId="1">Taes_chr1A_part1.fasta</file>
        <file subId="2">Taes_chr1A_part2.fasta</file>
        <file subId="3">Taes_chr1B_part1.fasta</file>
        <file subId="4">Taes_chr1B_part2.fasta</file>
        <file subId="5">Taes_chr1D_part1.fasta</file>
        <file subId="6">Taes_chr1D_part2.fasta</file>
        <file subId="7">Taes_chr2A_part1.fasta</file>
        <file subId="8">Taes_chr2A_part2.fasta</file>
        <file subId="9">Taes_chr2B_part1.fasta</file>
        <file subId="10">Taes_chr2B_part2.fasta</file>
        <file subId="11">Taes_chr2D_part1.fasta</file>
        <file subId="12">Taes_chr2D_part2.fasta</file>
        <file subId="13">Taes_chr3A_part1.fasta</file>
        <file subId="14">Taes_chr3A_part2.fasta</file>
        <file subId="15">Taes_chr3B_part1.fasta</file>
        <file subId="16">Taes_chr3B_part2.fasta</file>
        <file subId="17">Taes_chr3D_part1.fasta</file>
        <file subId="18">Taes_chr3D_part2.fasta</file>
        <file subId="19">Taes_chr4A_part1.fasta</file>
        <file subId="20">Taes_chr4A_part2.fasta</file>
        <file subId="21">Taes_chr4B_part1.fasta</file>
        <file subId="22">Taes_chr4B_part2.fasta</file>
        <file subId="23">Taes_chr4D_part1.fasta</file>
        <file subId="24">Taes_chr4D_part2.fasta</file>
        <file subId="25">Taes_chr5A_part1.fasta</file>
        <file subId="26">Taes_chr5A_part2.fasta</file>
        <file subId="27">Taes_chr5B_part1.fasta</file>
        <file subId="28">Taes_chr5B_part2.fasta</file>
        <file subId="29">Taes_chr5D_part1.fasta</file>
        <file subId="30">Taes_chr5D_part2.fasta</file>
        <file subId="31">Taes_chr6A_part1.fasta</file>
        <file subId="32">Taes_chr6A_part2.fasta</file>
        <file subId="33">Taes_chr6B_part1.fasta</file>
        <file subId="34">Taes_chr6B_part2.fasta</file>
        <file subId="35">Taes_chr6D_part1.fasta</file>
        <file subId="36">Taes_chr6D_part2.fasta</file>
        <file subId="37">Taes_chr7A_part1.fasta</file>
        <file subId="38">Taes_chr7A_part2.fasta</file>
        <file subId="39">Taes_chr7B_part1.fasta</file>
        <file subId="40">Taes_chr7B_part2.fasta</file>
        <file subId="41">Taes_chr7D_part1.fasta</file>
        <file subId="42">Taes_chr7D_part2.fasta</file>
        <file subId="43">Taes_chrUn.fasta</file>
    </dataIn>
    <dataOut>
        <path>/path/to/my/projdir/refr-encoded</path>
    </dataOut>
</AriocE>

AriocE/baseEncode.cpp 1263: buffer capacity exceeded

Hello again!

I've been running into the title as an error and can't see why exactly its happening. Other references run fine!

AriocE v1.52.3142.24159 (release) +2024-06-07T20:55:42
Copyright (c) 2015-2024 Johns Hopkins University.  All rights reserved.
 compiled             : 2024-06-13T15:49:58 (GNU g++ v11.4.0)
 data type sizes      : int=4 long=8 *=8 Jvalue5=5 Jvalue8=8 JtableHeader=5 (little-endian)
 computer name        : 94e17f383007
 operating system     : Ubuntu 22.04.4 LTS
 executable file      : /Arioc.x.152/src/AriocE
 configuration file   : /mnt/ramdisk/kraken/experiments/Kraken_Test_V3/encode_gapped_reference.cfg
AriocE::encodeR: encoding 1 file (48 CPU threads available)...
ApplicationException ([0x00000019] AriocE/baseEncode.cpp 1263): buffer capacity exceeded (136/112 bytes)
Command exited with return code 1

Here is the cfg and the reference.fa file. I had to add a .txt to the end of the file name to be able to upload them to github

encode_gapped_reference.cfg.txt
reference_templates.fa.txt

rwilton / arioc Goto Github PK

arioc's Introduction

Arioc: GPU-accelerated DNA short-read alignment

Arioc-encoded reference genomes

Related publications

arioc's People

Contributors

Stargazers

Watchers

Forkers

arioc's Issues

Config files

/usr/local/cuda/include/cub/util_cpp_dialect.cuh:129:13: warning: CUB requires C++14. Please pass -std=c++14 to your compiler. Define CUB_IGNORE_DEPRECATED_CPP_DIALECT to suppress this message. CUB_COMPILER_DEPRECATION(C++14, pass -std=c++14 to your compiler);"

Recommend Projects

Recommend Topics

Recommend Org

/usr/local/cuda/include/cub/util_cpp_dialect.cuh:129:13: warning:
CUB requires C++14. Please pass -std=c++14 to your compiler. Define CUB_IGNORE_DEPRECATED_CPP_DIALECT to suppress this message.
CUB_COMPILER_DEPRECATION(C++14, pass -std=c++14 to your compiler);"