morispi / hg-color Goto Github PK

Hybrid method based on a variable-order de bruijn Graph for the error Correction of Long Reads

License: GNU Affero General Public License v3.0

Shell 17.13% Makefile 2.53% Python 2.41% C++ 76.83% C 1.11%

ngs correction long-reads long reads hybrid hybrid-correction alignment de-bruijn-graphs dbg

hg-color's Introduction

HG-CoLoR

HG-CoLoR (Hybrid method based on a variable-order de bruijn Graph for the error Correction of Long Reads) is a hybrid method for the error correction of long reads that both aligns the short reads to the long reads, and uses a variable-order de Bruijn graph, in a seed-and-extend approach. The seeds, found by aligning the short reads to the long reads, are used as anchor points on the variable-order de Bruijn graph, built from the short reads, which is traversed in order to find paths allowing to link seeds together. Such paths between seeds dictate corrections for the missing part of the long reads, that are not covered by seeds.

Requirements

A Linux based operating system.
Python3.
Emboss binaries accessible through your PATH environment variable (http://emboss.sourceforge.net/download/).
KMC3 binaries accessible through your PATH environment variable (https://github.com/refresh-bio/KMC).
QuorUM binary accessible through your PATH environment variable (https://github.com/gmarcais/Quorum).

Dependencies

The blasr binary comes from the blasr software. Copyright notice is given in the file bin/blasr-license.

Installation

Clone the HG-CoLoR repository with:

git clone https://github.com/morispi/HG-CoLoR

Then run the install.sh script:

./install.sh

And add the HG-CoLoR and PgSA bin folders to your $PATH:

export PATH=$PWD/bin/:$PATH
export PATH=$PWD/PgSA/dist/pgsagen/GNU-Linux-x86/:$PATH

Running HG-CoLoR

To run HG-CoLoR, run the following command:

./HG-CoLoR --longreads LR.fasta --shortreads SR.fastq --out resultPrefix -K maxK

Input

LR.fasta: fasta file of long reads, one sequence per line.
SR.fastq: fastq file of short reads. Warning: only one file must be provided. If using paired reads, please concatenate them into one single file.
resultPrefix: Prefix of the fasta files where to output the corrected, trim and split long reads.
maxK: Maximum K-mer size of the variable-order de Bruijn graph.

Output format

The corrected reads are output in fasta format, with one sequence per line. The header of each corrected read consists of 5 components, as follows:

>id_len_seedsBases_graphBases_rawBases

where

id is the original read header
len is the original read length
seedsBases is the number of bases of the corrected long read coming from seeds
graphBases is the number of bases of the corrected long read coming from the traversals of the variable-order de Bruijn graph
rawBases is the number of (uncorrected) bases of the corrected long read, coming from the original, raw long read

Options

  --minorder INT, -k INT:       Minimum order of the variable-order de Bruijn graph (default: K/2).
  --solid INT, -S INT:          Minimum number of occurrences to consider a k-mer as solid (default: 1).
                                This parameter should be set accordingly to the short reads coverage and accuracy,
                                and to the chosen maximum order of the graph.
                                It should only be increased when using high coverage of short reads, or a small maximum order.
  --seedsoverlap INT, -o INT:   Minimum overlap length to allow the merging of two overlapping seeds (default: maxorder - 1).
  --seedsdistance INT, -d INT:   Maximum distance to consider two consecutive seeds for merging (default: 10).
  --branches INT, -b INT:       Maximum number of branches exploration (default: 1,500).
                                Raising this parameter will result in less split corrected long reads.
                                However, it will also increase the runtime, and may create chimeric links between the seeds.
  --seedskips INT, -s INT:      Maximum number of seed skips (default: 5).
  --mismatches INT, -m INT:     Allowed mismatches when attempting to link two seeds together (default: 3).
  --bestn INT, -n INT:          Top alignments to be reported by BLASR (default: 50).
                                This parameter should be set accordingly to the short reads coverage.
                                Its default value is adapted for a 50x coverage of short reads.
                                It should be decreased with higher coverage, and increased with lower coverage.
  --nproc INT, -j INT:          Number of processes to run in parallel (default: number of cores).
  --tmpdir STRING, -t STRING:   Path where to store the directory containing temporary files (default: working directory)
  --kmcmem INT, -r INT:         Maximum amount of RAM for KMC, in GB (default: 12)
  --help, -h:                   Print this help message.

Short reads coverage and accuracy

HG-CoLoR default parameters are adapted for a 50x coverage set of short reads with a 1% error rate. Please modify the parameters, in particular the --solid and --bestn ones, as indicated above if using a set of short reads with a much higher coverage and/or a highly different error rate.

Notes

HG-CoLoR has been developed and tested on x86-64 GNU/Linux.
Support for any other platform has not been tested.

Authors

Pierre Morisse, Thierry Lecroq and Arnaud Lefebvre.

Reference

Pierre Morisse, Thierry Lecroq, Arnaud Lefebvre; Hybrid correction of highly noisy long reads using a variable-order de Bruijn graph, Bioinformatics, Volume 34, Issue 24, 15 December 2018, Pages 4213–4222, https://doi.org/10.1093/bioinformatics/bty521

Contact

You can report problems and bugs to pierre[dot]morisse2[at]univ-rouen[dot]fr

hg-color's People

Contributors

Stargazers

Watchers

Forkers

txemapamundi9l tolot27

hg-color's Issues

Error While Running QuorUM

I am trying to run HG-CoLoR for error correction of PacBio reads with 10x Linked-Reads (short reads). I installed QuorUM with bioconda. I got this error when running HG-CoLoR:

----- QuorUM -----
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
LANGUAGE = (unset),
LC_ALL = (unset),
LANG = "C.UTF-8"
are supported and installed on your system.
perl: warning: Falling back to the standard locale ("C").
terminate called after throwing an instance of 'std::runtime_error'
what(): Hash is full
terminate called recursively
Creating the mer database failed. Most likely the size passed to the -s switch is too small. at /home/mborche2/miniconda3/envs/hg_color_env/bin/quorum line 155.

Any recommendations? Thanks a ton for putting out this software and helping others!

Running time?

Hey @morispi,

I am trying to use HG-Color (thank you for the tool) to correct a 5 mbp stretch of ONT reads from the chromosome 1 of HG002. It's about 15K reads with an N50 around 50 kbp. I ran the tool like this:

HG-CoLoR --longreads HG002._ONT.fq --shortreads HG002_Illumina.fq --out HG002_ONT_corrected -K 100 --nproc 24 --kmcmem 24

The command has been running for 17h now and it hasn't finished. Do you know if something went wrong or is it an expected running time for that data set?

Thank you for the help!
Guillaume

Run fails and only prints 'Correcting the short reads'

Hello,

I'm running HgColor (installed from the github repo) to correct Nanopore reads (~30x) using Illumina reads (~150x):
/home/bin/HG_CoLoR/HG_CoLoR --longreads ONT_filtered.fasta --shortreads short.fastq --tmpdir ./tmp --out ONT_HgColor --solid 2 --bestn 30 --nproc 48 -K 101

However, it only runs for one second and stops, and prints the message: Correcting the short reads. The output directory doesn't contain any corrected reads.

What could be causing this issue?

Thanks,
Julia

tmp files deleted before job is finished

Hello. I am trying to run the following code in a SGE script:
HG-CoLoR --longreads $long_reads --shortreads $short_reads --tmpdir ./tmp --out HGCcor_long_reads.fasta --nproc 48

I get the following error, which I think happens because the temporary files are deleted before processing is complete:
Traceback (most recent call last): File "/home/home02/bsjld/.conda/envs/genomeassembly/bin/filterShortReads.py", line 7, in <module> f = open(sys.argv[1]) FileNotFoundError: [Errno 2] No such file or directory: './tmp/quorum_corrected.fa' sed: can't read ./tmp/RC_long_quorum_corrected.fa: No such file or directory cat: ./tmp/RC_long_quorum_corrected.fa: No such file or directory mv: cannot stat ‘./tmp/good_long_quorum_corrected.fa’: No such file or directory /home/home02/bsjld/.conda/envs/genomeassembly/bin/HG-CoLoR: line 217: 130788 Floating point exception(core dumped) PgSAgen_hgcolor $tmpdir/"$k-mers-$SR" $tmpdir/"$k-mers-$SR" >> HG-CoLoR.stdout 2>> HG-CoLoR.stderr Traceback (most recent call last): File "/home/home02/bsjld/.conda/envs/genomeassembly/bin/filterOutShortAlignments.py", line 36, in <module> out.write(finalString) NameError: name 'out' is not defined
Could you please help me with this issue?
Thank you in advance.

Disk quota exceeded when there is enough space on disk

Hi,

I run HG-CoLoR on my data set and got the following error. It is pretty weird since I do have enough disk space.

The length of my short reads is 150bp. And the sequencing depth is 373x. That's why I set 'bestn' to 20 and 'solid' to 2 as what was suggested in other issues.

/project/HG-CoLoR/HG-CoLoR --longreads /ecoli_ont_2D_tmp/ecoli_ont_2D.fasta --shortreads /ecoli/miseq/ecoli.fastq --out /ecoli_ont_2D_output/hg-color_output/corrected_ecoli_ont_2D.fasta -K 135 --nproc 28 --kmcmem 64 --bestn 20 --solid 2 --tmpdir /ecoli_ont_2D_tmp/hg-color_tmp
[Wed Dec 12 20:37:10 EST 2018] Correcting the short reads
[Wed Dec 12 20:38:50 EST 2018] Removing short reads containing weak K-mers
[Wed Dec 12 20:42:15 EST 2018] Building the graph
[Wed Dec 12 20:44:15 EST 2018] Preparing the raw long reads temporary files
[Wed Dec 12 20:44:32 EST 2018] Aligning the short reads on the long reads
[Wed Dec 12 21:33:01 EST 2018] Preparing the alignments temporary files
Traceback (most recent call last):
File "/project/HG-CoLoR/bin/filterOutShortAlignments.py", line 32, in
out = open(sys.argv[3] + curFile, "w")
OSError: [Errno 122] Disk quota exceeded: '/ecoli_ont_2D_tmp/hg-color_tmp/HGC_29225/Alignments/15951_2783'

Serious safety problem, --tmpdir parameter is deleted with no checking

Hey,
your bash script has a major safety flaw in that it will delete the tmpdir without caring what it is.
What if someone were to supply '.' or '~' as the --tmpdir parameter?

You really need to change it so it makes a new folder inside the supplied path and only deletes that after completion.

Another thing is that as per default bash behavior, your script will continue even if intermediate steps fail, even if that makes no sense at all. It thus always arrives at the dangerous rm -Rf command...

So I would advise to
set -e
in the beginning to change bash behavior to stop on a failed command.

Cheers

compile error: fatal error: test/testdata.h: No such file or directory

$ install.sh
$ In file included from src/seedsLinking.cpp:1:0:
src/seedsLinking.h:5:27: fatal error: test/testdata.h: No such file or directory
#include "test/testdata.h"
^
compilation terminated.

Why? I had install quorum kmc though anaconda , but I still got the error message.

Please let me know where is the problem?

Thanks

avoid reverse complement files

If you consider all kmers by turning off transformation of k-mers into its canonical form (switch -b of kmc) you don't need to produce a reverse complement fasta file.

blasr doesn't use multiple threads?

I have set the --nproc as 14, but blasr only uses one CPU as indicated by the top command. Is this expected or I did something wrong?

HG-CoLor was installed from anaconda2.

Error (core dumped)

Hello,

I'm trying to use HG-CoLoR to correct minion reads with illumina reads but i end up with a error at the Blasr step :

[INFO] 2019-04-30T10:47:00 [blasr] started. Warning: resetting nCandidates to nBest 50 terminate called after throwing an instance of std::invalid_argument what(): stoi /datas/Save/Clement/soft/HG-CoLoR/HG-CoLoR : ligne 237 : 31495 Abandon (core dumped) $hgf/bin/HG-CoLoR -t "$tmpdir" -K "$K" -d "$seedsdistance" -o "$seedsoverlap" -k "$k" -b "$branches" -s "$seedskips" -m "$mismatches" -j "$nproc" -r $tmpdir/"$formatLR" -a $tmpdir/"$aln" $tmpdir/"$K-mers.fa.pgsa" > "$out.fasta"

Thanks in advance.

Fail due to erroneous string in alignment file?

Hey,
trying to correct some nanopore reads with illumina reads of the same sample.

Script died with a Python error due to a too long file name:

./HG-CoLoR -j 40 --maxorder 50 --longreads ../../../indelcorr/corona_only_filtlong1.fastq --shortreads ../../../nanocorr/corona_illumina.fastq --out ../corona_only_filtlong1.fa --tmpdir tmp
[Wed Feb 21 10:47:41 CET 2018] Correcting the short reads
[Wed Feb 21 10:49:48 CET 2018] Removing short reads containing weak K-mers
[Wed Feb 21 10:58:38 CET 2018] Building the graph
[Wed Feb 21 11:01:33 CET 2018] Preparing the raw long reads temporary files
[Wed Feb 21 11:01:34 CET 2018] Aligning the short reads on the long reads
[Wed Feb 21 11:29:56 CET 2018] Preparing the alignments temporary files
Traceback (most recent call last):
  File "/mnt/mahlzeitlocal/sebastian/quasispecies/hybridcorr/hg-color/HG-CoLoR/bin/filterOutShortAlignments.py", line 17, in <module>
    out = open(sys.argv[3] + curFile, "w")
OSError: [Errno 36] File name too long: 'tmp/HGC_19623/Alignments/:+04.7081GbicYQQIB+/)4/bJJ+KOTUE$-2:*D5@:E,=+".484FACA;5AKKUR0AA705;$#/*-\'\'UPGLHA.#(%*+/A]SaH:"T"8"2E`ZJ5744**&,\'\'32AY>+ED<?D?>9C8QG:FA;/3..005BU=RK6>?.8NNH>KEGKJ5+3;4.-4,(,",;E46,K:C:78C30959N>63@99?*BHK@J)VB7"*)(=EG8UF:A0?*,17121217?3@Wh^47`VC/(69M^M76OE&%.\'7.\'"%\'-;Mj^]BMQNF;[email protected]=\'AJE8*9-5DA()/*05(/$:;08,$"2"+710;.5++5F36MGFH96>975:[2$=V$>%(8G4hbeEG-3*(=.,(8BE5,094/.;&.=8-0343(*5L="%$,36:MM3.""1)*Q6EEID4B<25,FB.H\'>:B7""$"#"$,%#%*&%"%"")/%.,.%43-$-22:N@25/*()$-+LV[C3.2,6(/\'6247+,:,-0>DIKB<<2*,"C@7$;0-/>-,),F%5+,0<;O5D4H7E(%/1;4)3".4FQJFOCCB19C=?HKIYG<%*".42,.7)23>5DMF;0A5+>7^a9N8+6%*\'$327M@".2=EBYOIU5A=FaP/$%0DB<:.SNDPXC*?>>:C%.-<2$9D@D/15-(""//I9OVU9.\'MTOKQJJ>.(#+(47EG@3DM42%.0B<Y<@4V:,><:3"16)61;3F?eK>(2;.).+9>KGRHB5Ob:1-*+,#("$&-G=:0JG4&"""*#%85,(6.*)#269-\'-01<K9.F8?M1+&""%\')"7)+5Q+>"3>:.@AJ95B915/;476SN]5"""$"#"0+7\'%*;9O*9Y:2RcD\'E=:B+0-+.(""%$-,0\'%#98JN2FK1BN<UYRH/)3%,>HY4,;M689;22?MV(_YT#9)/<&0<KLB2=0@=.0&/1O\\a96-31:@-,T8BHDMEM;7@_=M9C751-HLQ@;3",&\'"(0#"##\'"""#+304#5KJE=97/85=&-$"*<(9<(:65D6?G]B+"+-*7"")-$0A".8;829/"3>.=-0,=8-$+"/),8494ACAHE<(*GRh]47-8)D8<76<\\HURH-),#4DCL,01368;9.2433TU[6:1.6?7))6;"-/0-&))>\'#9EfLJTD,0..B17,0$?:AR>;:CNE=IMFBM9NMC26-0""-$*\'.NB[S;A1=38+>//E<UO/2":CH4F3BE5DCI15Q:T<;2$"63/(&%($%/>0B<)\'+(3d22b0L>+21$%%305).#".:*(;"""4"C63),60+))-=7YjVacZ3J\'9=;"B+MK-)#/1F:8&*.EEVDbfST6ATLG$YCJH763/-/C8-.<EPXFF\'%%#(9[[HYS-@;NFBO7S=12fGI,\'\'&%"41)+"*@,E-")//#;2*<:44"&94$(-(=)2(*1<8\'-3(/>54&+#$$5<*,";,<;A8a;MFNI87L./$1ND"+=MAHAI^D3,K/3CUHH<CE8B1LJ?D@B=F919E3<U>XU@:>%.H=93267/3>2%\'-+BCE2+#2$+\\725188;4WH5@DJDK(6;=SD;EGCH\'1H=QJA?2#6:6TSXGa@;\'&$$134014*$.$7((838^]b>+4<K>RFAL?:./,-D""2"(\'410/.:8DE485329:4#)/&"#)15II8QA\'.522.-"+H/5@61SLNJB/++#).\'"73/;HBLGSYG9780I=GFKI57.]OY^Q:+/4""-1##+850:"4.ME7AS>=6*A2;#%,*:G;FLEB8VBA<,3+=.\'4*.048=INNUJLJH7QD3J/0=(?NALEHUKSM(G;QNJ-\'"$*.92<?$26:2@:/=7F;)2;5?BA<,+YB9(%0"(%(,%/=O?@+B"",-2R/$7987G-"#:04$-#"#"%"91L9K=E8"0HLCM;AC`UL)%<%)**(F01368D;"#.84138""$KYE,-*+5>%&%/D0%3=44G:S[^OH,18@11-71+,/FOJSJLU7GQQ=IK[OT7<QTRMR327.IGcR[bA5+4%"#8("56;+44*#*<,0@HNOJb/"/&%(%\'"%*C6@96;,QG68?;<?ODN`44,&-2*-,4./3<B,C296-5;8;<0#\'(:@BJH5869-23++#(/>38969\'H@(]TU\\;SB,;=<@FKIG4IGF)*-&*F-D.(+-))%%0-267#)?&)/*A:;JP7$+3<8E=M=EBFD<<<48S86470)(,,J?\\gO>51\'/:**.0*[email protected],&1+-\'+$"*+&.,?*-).(+%##"-67.5*(*)0G;+51#2.152LE0.S@8-6>Y986-70/1-/A,JDE&G@A<A50.\'\'7\'0;B07"4(<9+@=7DN:-(C;A%/-53THIV<CEZK#$A7:1627FD673OA574JFRg_]^VUITESJE:3GC=&<6+))F=?FB4>BH903K-.0S-+/?@N"-7%(/?C4+"*")"0++28:;F<Y[LL=BBJQ6/)%.+"$-9\'"&;AHI0,)"#9?]@A#848NdJ=/(2&1*00L?IFB6><<""0*.(&)?HAY8599+\'"\'"/*$2$\'2(<\'?%-1-*,%+5D:SPIDINGK%(+9O29>HBfX10"-$--OJ6]E/$%$\'#2BF9D<804?1.&%"#-@*6(.EEHE@:0WD;=GdLZBG5G=40H41@3>++,+8+B9JVUB7,>C4?73/)$,5FI8FD^I\\34\'\'.14:@;,36AA2570,3)*=E59?F(EK=YSO(P5/A_[NK/#.04:&%/_:>MZ-&;53\'\'\',@A=BGJ<1(7G1AOJIQHC)FMQR6>,Q52OVHS="%""0*/-?1)&+-$\',B+@578+6$%+,1M>-$\',0BL6-12/><=:""/O98GKB@9"""2,902/1,07\'$/>:120>SLBMT+,8-5,#"**.?B*/22,+HK26hSF2+,-)+$&,2508"-7/HA:I7EBE&=<\'%%)"<.;?-OI8@?0%\',C+03"95:$#92220123/-,,;CCEDE2*(&+4(/-5DB2*2:DUTI*H:H8J1%"""$+XQU>:K542B3"(>&+/5.9HQ5>"#270:/:+48;@K>.66?/@8D9DS@5-M>@^aQ[\\aMK+-".+7(&"&\'*""".,09,*.OHJEK)"$""&%"&%012L343KF3<><B^B@G3?-)6<<AV6ii[4PLUJ-95<2.-06B>:3/.3"#(""*%"#)7+,\'**)3:7665;?@F$%%&,,D&-DT6640&67,""#/+$,:PMNJXHMQIN9)00<\'0HbPRW<BCGDP?/5;/75.\'$/1\'\'$(7+\'Pb@:5"(0223#&()))?;9;#=01:aF.0HGBPWQV2.HHJ[]=a\\bULYd8PSJYcH;:MXJY*JB58AI41P=TCNI?39<=37*0@A<&,48.04ID=/%C==;DC@?E:69/Cd6:X6C(,2>OPB8\':9-\'")2*;3+*1))*$"&0"6&;:>500*((+"%&)6@=MFKBO1\'*78&0\'*5?I?385\'(7.(4>&/9.0>8B46?@\'LSV04.+&""#,,+"[email protected]>QGELJB8JLEKEIPM--2449F01437-24T3?+&""\'(@<1"++28,#,75;(-06K1e8$$H.9<H6)804>8@>5=,*)\'"%#)7<EAG505##,+)C>G32C/&:/\'158CY;J\'$$.*)2.47G&\'%"+%@*>**D.(2B=7QDZC2(-3"499>/K*""$8MD+-B/1%#&*)9%3"1"""#%%)CJMXC4<M$;<;249LU][email protected]%$.A,,-+<`$\'.S8RI3$&).""##)CGQKKQcG<6HF.;.%3&NTJ+\'<$$2&).%I_MR]2P,"&&)#)%"./<F0,&8%<1G=96/<;CI[Tf]9:>7I/#9"325\'$+\'//2-",/5=.BOLPVQN\\UPTdNK=OI0R2..5%"*6/(158;;9-7FI<DWZSV]Y^bT5aQW:<??71]VCBDKIE;CP6%+\'#/.-:67:,=?5H@;(+"",7<@TVF:7$2LAO,4D<,8""\'58B=.8,)*4*LL"(""$++/0;2**)V\\[email protected]=Z83&$\'R-00%&3MlPYTT?C@A6DZ@9"<6,/"*I18)$78-&-\'\'#)"\'\'"<:)1D;;@DQ^WY2:<B0KRC,*53(+"#-)%A,$5J==997AE.2J#.;Q,D6<I-,#"82064WD447*(%;1383\'.4,0\'64D6*#.@290*/08($9&""+(%..*-$1=:;FB"0(")("$=/\\XQ?V9aL70-+"-2;46F56;=CGN&-4;I*0>#"D2,+4/(6;:*"%,3>@Y=?@6;<XK?@("(""*;H>5d`\\Y_QGH.=#\'265=..)3&3*";9QGONM.23&)00F#,21/\'(&/*"#"52L;N7<-055Y>;;9\'%7PO_;,&"%7DKJFPYGK@L=KhdXB(+\'#(/*/-7JdO9<N8bhU97/A@:7376)3/91%..\'%)%)**)7<9C-./6:=G4\'.&+`BCIILMN=CJF</BO>PGJMAK\\976.4.\',4-KA?S,3\'\':94-,\'.BEE;6&#$);>*,3\'):4?1#"(6ML<6>FENQJS;6NA3f;:7%IFG@P>3.aJ^S_eU:E>REK;.?;91:\'09X=WWLO@(,3&"1W.("\'"7*/+@SD5-\')>YMUYb_A?RV?-+%0.0506Q66//1;)979+*6,+-8.3/2$3U:614//7.-#%"#",6P9C5B**)4++21\')*;//<GHjA?V8QP@`:88I<7>RB]C/-((+&>7VDJ:00,>A8@@D5;()\'"""#%##+"$&&(>7+&-*-("*%)146>CE1C+F12940<161-2"$*0168GS?Z]IM4"32()13.62L78.+(1$463\'GKh^.0")5K%\'"%+$&""2\'..453#$07045*$"$""#,+29??59=(&2($0"#1>11*\'$\'-\'8=8?H<?A9L5+O8,&\'"$/))855-)(%6F0)*-$/.%1#=6G=H>/%&"")NM,/03\')0H\'?,)&$4&=@IPF+>+&,"""0&529%"%""""""%$$77NCM&600-.+(555"\'13&*6KL@M^A/)D9+%EdXB?EJ72/09J@F/;019/.&*-"#%#<E.;3.5D=<>JRRG:#3#""26A<32"&4<.%*BE@J,/+.2*(*<<V:5/980>ULQ-8?/:I;C7\'/HA?^^jG;4YX44::^D=&)-\'6TIVM97)*=*"=IL3O=5:8DNZ;837AA7:D5?3&0*@$"($(8<aMB)),7**<BN;2#.&\'""\'19GN:=GGI/K<:&$(.;HA8B)-7\'>0K(\'=209::()4;$\'(317Q?BALDQiXSMKMA=5E+7?2;7TcQT;BC2,.0*393(>=HD@BgL?-*5\'&%#"*(&2#8""""".*.F""RH;6R:A<:D<,\'.,+1B%2A+5-1--EI44C08DO+J5\'.2*Q%*;fT314F@J`_L@&YDCN/\'"""""$28QKNN@RUOU]N@D)<K3L/N[[_TRI?4QJYMRGH11<MWNYP/1""1;1=743A43"/1<,.9>4?H_NI+,AJM_I=(13\'$\'.7"$+\'0G3*>MU41@)3?78I+87.06;@N21346VcHcSI^VbeYf\\c1*(,\\510(*0922JLAGRE#GI4&E3$%%$".$\'/"\'+&@2/DH7@>7/D<ER6@*4>+1=*#"""70^67TD>(,;B?B/*("*38,-$)&))&&9H*,2O12145WA-??7D8+5$\'@=3K7?941=*1"?/Z95522DGA=81:D5E0L:ZT8a-,=%,->F@W]A8;(\'*\'+50KOB=RS9&,@1MLBAFDC_:-7U6+#(3FL39,"(A<59M9A0#F07*PhC+8*840/95;><#)3.(C/A8/"$)*,*4@I4_XL:8@beK:GS@I__A@:9;<F7/.("%)\'-.7;0GHFC@hhH-=AC;3(.-/RYJOBLMZTE=+$"(3DOS>63#*1019N()/6)>=P;LP6RNI25+/(&,*+<*-<58%B6AD1;R8./&\'&EI^ID99@UDIK;.,*&.2*P;3\'-%XQ;1;.657G5),&\'&6((-*/-.,)12047MF;9$"\'"#)8ZYbEcOO`H_A1#-"%"+1=9?V?JI:WC/16#""=3-)$#1-.1886,0-9."9/.\'4-&2"%&(),=8C740)*)&""8+\')120\'*("1"%O9G:0"""72=.G[=5850*)"\'%0/3678NKSB@PHB=.6>5"84)"=USah]SJAB31#\'#0<cK/.-"#VFC4,309R3-?@"5(,E*B0<4/C7+)-5&&:0210%\'.AH-5a^?:9<"$07R3Zk8253%"""0R4c=\\\\MB<@>SL,7P21969:")\'3?-[LZ:Q@VZ,=/)<D+,JA0.6.;8>[email protected]?*=G63"1@AF3WNHH*/4B?+8B4L%-.%$*,./LKXDBMSMPJ3>"=<?8.$8334:A,(\'"8/.2>+&,;DD25;7=>HED>9;650$)*"?07#H=>?NF7:GE<MOLQPG94Q88W`<LIB+5*"&/\'8<;H=704EL::#3>E=2H<9)-(+/&.1/]>C?NA(,@QQ+3+,OST6]U6I/XB916%:3JR8UPR*9,B3C9)226-&/\'(/#/PGX[O?70DBOK8&*4$32;24B=7:<E=6956/9:624;I;;>A22-SJO@G9BQUI[FVF=+9"",%)"3<,-.B,;@CTKL-"&-0/<=5<7*C#&&)""$"+"6214<3&.7/8-00""&(,#"+-2(H3&$*;@?&#((&%+&%0*-$+""""#%,(""""""'

Going through filterOutShortAlignments.py shows this is from the alignment file:

line = f.readline()
if line != "":
	t = line.split("\t");
	curFile = t[2]
	out = open(sys.argv[3] + curFile, "w")

Any idea how that comically long string might have ended up there?

Cheers

Error messages

Hi I ran the following command to correct ONT data with Illumina data:
HG-CoLoR --longreads ONT.fasta --shortreads Illumina.fastq --out ONT_all.corrected.fasta --tmpdir ./tmp --solid 3 --bestn 75 --kmcmem 250 --nproc 35

According to the log files some files that the pipeline expects are missing

cat HG-CoLoR.stderr
----- QuorUM -----
----- revseq -----
Reverse and complement a nucleotide sequence
Error: Unable to read sequence 'tmp/long_corrected_SR.fa'
Died: revseq terminated: Bad value for '-sequence' and no prompt
----- KMC -----
----- KMC_tools -----
----- KMC_dump -----
----- PgSAgen -----`

cat HG-CoLoR.stdout
----- QuorUM -----
----- revseq -----
----- KMC -----
**
Stage 1: 100%
Stage 2: 100%
1st stage: 1.52365s
2nd stage: 4.07289s
Total : 5.59654s
Tmp size : 0MB

Stats:
No. of k-mers below min. threshold : 0
No. of k-mers above max. threshold : 0
No. of unique k-mers : 0
No. of unique counted k-mers : 0
Total no. of k-mers : 0
Total no. of reads : 0
Total no. of super-k-mers : 0
----- KMC_tools -----
Error: cannot open: tmp/mers.db by KMC API
----- KMC_dump -----
----- PgSAgen -----`

cat HG.log
Correcting the short reads
Removing short reads containing weak K-mers
sed: can't read tmp/RC_long_corrected_SR.fa: No such file or directory
cat: tmp/RC_long_corrected_SR.fa: No such file or directory
rm: cannot remove 'tmp/RC_long_corrected_SR.fa': No such file or directory
Building the graph
Aligning the short reads on the long reads

Do you have an idea what could be the problem?

Thanks,
Chris

Error

I'm using the anaconda installation of HG-CoLoR, but running into the following error and hoping you point me in the right direction.

Here's the slurm output:
[Fri Nov 16 15:45:54 MST 2018] Correcting the short reads
[Fri Nov 16 18:28:37 MST 2018] Removing short reads containing weak k-mers
[Fri Nov 16 20:52:45 MST 2018] Building the graph
[Fri Nov 16 23:12:53 MST 2018] Aligning the short reads on the long reads
[Fri Nov 16 23:18:18 MST 2018] Removing short alignments
Traceback (most recent call last):
File "/fslgroup/fslg_pws_module/compute/software/.conda/envs/hg-color_v1.0.0/bin/filterOutShortAlignments.py", line 36, in
out.write(finalString)
NameError: name 'out' is not defined
[Fri Nov 16 23:18:18 MST 2018] Generating the corrected long reads
[Fri Nov 16 23:18:19 MST 2018] Removing temporary files
[Fri Nov 16 23:19:11 MST 2018] Exiting

Here is the HG-CoLoR.stdout:

----- QuorUM -----
----- revseq -----
----- KMC -----
1st stage: 174.623s
2nd stage: 942.759s
Total : 1117.38s
Tmp size : 81150MB

Stats:
No. of k-mers below min. threshold : 635141104
No. of k-mers above max. threshold : 0
No. of unique k-mers : 2185558665
No. of unique counted k-mers : 1550417561
Total no. of k-mers : 87948736520
Total no. of reads : 542495294
Total no. of super-k-mers : 3458687932
----- KMC_tools -----
----- KMC_dump -----
----- PgSAgen_hgcolor -----
Reading reads set
reads count: 1452761050
all reads length: 92976707200
reads length is constant
maxReadLength: 64
symbolsCount: 4
symbols: ACGT

Found 0 duplicates.
Start overlapping.
6820830 reads left after 63 overlap
6080059 reads left after 62 overlap
5530567 reads left after 61 overlap
5111566 reads left after 60 overlap
4764051 reads left after 59 overlap
4458165 reads left after 58 overlap
4199852 reads left after 57 overlap
3972326 reads left after 56 overlap
3763990 reads left after 55 overlap
3584244 reads left after 54 overlap
3420892 reads left after 53 overlap
3267275 reads left after 52 overlap
3131134 reads left after 51 overlap
3004685 reads left after 50 overlap
2884016 reads left after 49 overlap
2775758 reads left after 48 overlap
2672523 reads left after 47 overlap
2574105 reads left after 46 overlap
2482159 reads left after 45 overlap
2396981 reads left after 44 overlap
2313961 reads left after 43 overlap
2236406 reads left after 42 overlap
2163594 reads left after 41 overlap
2090936 reads left after 40 overlap
2024386 reads left after 39 overlap
1961520 reads left after 38 overlap
1899004 reads left after 37 overlap
1840888 reads left after 36 overlap
1786350 reads left after 35 overlap
1731747 reads left after 34 overlap
1680180 reads left after 33 overlap
1631172 reads left after 32 overlap
1582092 reads left after 31 overlap
1535704 reads left after 30 overlap
1490125 reads left after 29 overlap
1445786 reads left after 28 overlap
1403406 reads left after 27 overlap
1361982 reads left after 26 overlap
1321812 reads left after 25 overlap
1281887 reads left after 24 overlap
1242301 reads left after 23 overlap
1205276 reads left after 22 overlap
1170604 reads left after 21 overlap
1136651 reads left after 20 overlap
1103473 reads left after 19 overlap
1071540 reads left after 18 overlap
1040447 reads left after 17 overlap
1008606 reads left after 16 overlap
975248 reads left after 15 overlap
937730 reads left after 14 overlap
889391 reads left after 13 overlap
816493 reads left after 12 overlap
701226 reads left after 11 overlap
538223 reads left after 10 overlap
372007 reads left after 9 overlap
244437 reads left after 8 overlap
159908 reads left after 7 overlap
103350 reads left after 6 overlap
66897 reads left after 5 overlap
41576 reads left after 4 overlap
28021 reads left after 3 overlap
18225 reads left after 2 overlap
14822 reads left after 1 overlap
14822 pseudo-genome components
Overlapping done in 954060 msec

1579325618 bytes after overlapping
14822 pseudo-genome components
0 single reads
Pseudogenome assembled in 1193560 msec

Found 62521749 reads containing duplicate 11-mers in 16510 msec!
SA creation start.
Written 2147483644 bytes
Written 2147483644 bytes
Written 2022335952 bytes
SAIS generation time 615460 msec!
SA generation time 1112850 msec!
4194305 elements in SA lookup
SA LUT generation time 23110 msec!
Written 1579325682 bytes
Written 13074849459 bytes
Written 7896628100 bytes
Written 16777220 bytes

Thanks in advance.

What is the proper output file

Which output file should be used for subsequent sequencing? The program only gave me output files in the temporary directory.

runtime estimation

Hi,
I am running HG-CoLoR on a local machine, and now it is running since about a week: is that normal?

HG-CoLoR --bestn 15 --kmcmem 90 --nproc 18 --longreads ../ONT_181026.fa --shortreads PE4702_200bp_interlaced.fq --out ONT_HC-CoLoR_PE470.fasta --tmpdir /data/dario/ONT_data/HC-CoLor_correction
[Thu Dec 20 15:47:17 CET 2018] Correcting the short reads
[Thu Dec 20 19:25:49 CET 2018] Removing short reads containing weak k-mers
[Thu Dec 20 23:41:06 CET 2018] Building the graph

It is using about 80 GB or RAM and 2 cores, the input files are 200 Gb (interlaced.fq) and 100 GB of ONT data. The genome is about 5.2 Gb, from a heterozygous plant.
I wonder if this long running time is acceptable or there is some issue going on and the computation is stuck.
Thanks

Error

Hi,

I have some difficulties to run HG-CoLoR, it stops after few seconds.

Here is my command (I have about ~100x for short reads) :

HG-CoLoR --longreads $LRFASTA --shortreads $SHORTREAD --out $CORRREADS -K 100 --nproc 8 --solid 2 --bestn 40

Here the slurm output :

[Wed Feb 13 09:19:51 CET 2019] Correcting the short reads [Wed Feb 13 09:19:57 CET 2019] Removing short reads containing weak K-mers [Wed Feb 13 09:20:01 CET 2019] Building the graph [Wed Feb 13 09:20:03 CET 2019] Preparing the raw long reads temporary files Traceback (most recent call last): File "/usr/local/bioinfo/src/HG-CoLoR/HG-CoLoR-d476519/bin/prepareRawLongReads.py", line 12, in <module> out = open(sys.argv[2] + id[1:-1], "w") FileNotFoundError: [Errno 2] No such file or directory: './HGC_3078/RawLongReads/m180221_013155_42237_c101409362550000001823304103141816_s1_p0/1434/783_15206_2_2104'

I looked into the ./HGC_3078/RawLongReads directory and it is void.

Here the HG-CoLoR.stderr :

`----- QuorUM -----
----- revseq -----
Reverse and complement a nucleotide sequence
----- KMC -----

Stage 1: 100%
Stage 2: 100%
----- KMC_tools -----
----- KMC_dump -----
----- PgSAgen -----
`

And here the HG-CoLoR.stdout :

`----- QuorUM -----
----- revseq -----
----- KMC -----
1st stage: 0.773835s
2nd stage: 0.259328s
Total : 1.03316s
Tmp size : 5MB

Stats:
No. of k-mers below min. threshold : 2211894
No. of k-mers above max. threshold : 0
No. of unique k-mers : 2770468
No. of unique counted k-mers : 558574
Total no. of k-mers : 3575842
Total no. of reads : 95036
Total no. of super-k-mers : 165740
----- KMC_tools -----
----- KMC_dump -----
----- PgSAgen -----
Reading reads set
reads count: 558574
all reads length: 55857400
reads length is constant
maxReadLength: 100
symbolsCount: 4
symbols: ACGT

Found 0 duplicates.
Start overlapping.
23649 reads left after 99 overlap
23167 reads left after 98 overlap
22775 reads left after 97 overlap
22475 reads left after 96 overlap
22173 reads left after 95 overlap
21878 reads left after 94 overlap
21614 reads left after 93 overlap
21392 reads left after 92 overlap
21130 reads left after 91 overlap
20890 reads left after 90 overlap
20674 reads left after 89 overlap
20454 reads left after 88 overlap
20266 reads left after 87 overlap
20062 reads left after 86 overlap
19870 reads left after 85 overlap
19688 reads left after 84 overlap
19538 reads left after 83 overlap
19400 reads left after 82 overlap
19232 reads left after 81 overlap
19112 reads left after 80 overlap
18982 reads left after 79 overlap
18872 reads left after 78 overlap
18752 reads left after 77 overlap
18640 reads left after 76 overlap
18536 reads left after 75 overlap
18406 reads left after 74 overlap
18292 reads left after 73 overlap
18186 reads left after 72 overlap
18112 reads left after 71 overlap
18012 reads left after 70 overlap
17922 reads left after 69 overlap
17850 reads left after 68 overlap
17776 reads left after 67 overlap
17702 reads left after 66 overlap
17604 reads left after 65 overlap
17531 reads left after 64 overlap
17447 reads left after 63 overlap
17381 reads left after 62 overlap
17311 reads left after 61 overlap
17241 reads left after 60 overlap
17147 reads left after 59 overlap
17083 reads left after 58 overlap
17018 reads left after 57 overlap
16948 reads left after 56 overlap
16872 reads left after 55 overlap
16814 reads left after 54 overlap
16750 reads left after 53 overlap
16692 reads left after 52 overlap
16630 reads left after 51 overlap
16560 reads left after 50 overlap
16518 reads left after 49 overlap
16466 reads left after 48 overlap
16402 reads left after 47 overlap
16354 reads left after 46 overlap
16296 reads left after 45 overlap
16240 reads left after 44 overlap
16190 reads left after 43 overlap
16140 reads left after 42 overlap
16093 reads left after 41 overlap
16038 reads left after 40 overlap
15982 reads left after 39 overlap
15922 reads left after 38 overlap
15874 reads left after 37 overlap
15823 reads left after 36 overlap
15779 reads left after 35 overlap
15718 reads left after 34 overlap
15663 reads left after 33 overlap
15607 reads left after 32 overlap
15557 reads left after 31 overlap
15501 reads left after 30 overlap
15442 reads left after 29 overlap
15400 reads left after 28 overlap
15322 reads left after 27 overlap
15272 reads left after 26 overlap
15204 reads left after 25 overlap
15153 reads left after 24 overlap
15097 reads left after 23 overlap
15048 reads left after 22 overlap
15002 reads left after 21 overlap
14952 reads left after 20 overlap
14908 reads left after 19 overlap
14853 reads left after 18 overlap
14805 reads left after 17 overlap
14742 reads left after 16 overlap
14687 reads left after 15 overlap
14600 reads left after 14 overlap
14515 reads left after 13 overlap
14379 reads left after 12 overlap
14079 reads left after 11 overlap
13431 reads left after 10 overlap
12091 reads left after 9 overlap
9262 reads left after 8 overlap
5797 reads left after 7 overlap
3020 reads left after 6 overlap
1761 reads left after 5 overlap
1089 reads left after 4 overlap
897 reads left after 3 overlap
596 reads left after 2 overlap
468 reads left after 1 overlap
468 pseudo-genome components
Overlapping done in 690 msec

2159117 bytes after overlapping
468 pseudo-genome components
0 single reads
Pseudogenome assembled in 120 msec

Found 41192 reads containing duplicate 11-mers in 30 msec!
SA creation start.
Written 5757912 bytes
Written 2879356 bytes
SAIS generation time 180 msec!
SA generation time 490 msec!
4194305 elements in SA lookup
SA LUT generation time 80 msec!
Written 2159217 bytes
Written 5027175 bytes
Written 8636476 bytes
Written 16777220 bytes
`

Could you help me please?

Thanks,

optimal short read data

Hello,

I am about to run HG-CoLoR, but first I wonder if there is a preferred format/coverage of the Illumina data.
I have a PE 470 bp library (2x260 bp) and a PE 700 bp (2x150). I am going to correct about 100 GB of PromethION data (~20x of a plant genome).
I wonder if I should trim the short reads, remove the overlapping part in the PE 470, keep sequences with the same length, optimal coverage and such.
Thanks
Dario

morispi / hg-color Goto Github PK

hg-color's Introduction

HG-CoLoR

Requirements

Dependencies

Installation

Running HG-CoLoR

Input

Output format

Options

Short reads coverage and accuracy

Notes

Authors

Reference

Contact

hg-color's People

Contributors

Stargazers

Watchers

Forkers

hg-color's Issues

Recommend Projects

Recommend Topics

Recommend Org