Git Product home page Git Product logo

Comments (12)

morispi avatar morispi commented on May 25, 2024

Hi Chris,

First of all, thanks for the feedback, and sorry for not replying earlier, it seems that github didn't send me a mail about your issue.

From what I see here, it seems that your corrected SRs were shorter than your chosen maximum K size (--maxorder parameter), which is 100 by default. Can you tell me what is the length of your original SRs and their coverage depth?

Pierre

from hg-color.

ctxchris avatar ctxchris commented on May 25, 2024

Hi Pierre,

the Illumina read length is 75bp and their coverage is probably about 200x (transcriptomic data). What's a safe value to set --maxorder to? And what do you recommend for --solid and --bestn?

Thanks,
Chris

from hg-color.

morispi avatar morispi commented on May 25, 2024

For your read length, --maxorder can be set to any value lower or equal to 75. I wouldn't set it as high as 75 though, because too many reads would be dropped after the Quorum correction, which usually removes a few bases from the short reads. I haven't tried HG-CoLoR on such short reads yet, but I think setting it somewhere between 60-70 should be optimal.

With such a high coverage, I would recommend setting --solid to 2 or 3. Setting it higher could filter out genomic k-mers, and prevent HG-CoLoR from correcting some regions of the long reads.

I would lower the --bestn parameter to 30. Keeping it at the default value or setting it higher wouldn't improve the results significantly, but would require much more time for the alignment step of the short reads to the long reads, so 30 should be a good compromise.

Pierre

from hg-color.

ctxchris avatar ctxchris commented on May 25, 2024

Getting closer...
Now CLRgen is core dumped after writing about 10 FASTA entries with

terminate called after throwing an instance of 'std::invalid_argument'
what(): stoi

when called like that:

CLRgen -t ./tmp -K 65 -o 64 -k 32 -b 1500 -s 5 -j 90 ./tmp/65-mers.fa.pgsa >ONT_all.corrected.fasta

from hg-color.

morispi avatar morispi commented on May 25, 2024

Sounds like your SR to LR alignments are wrongly formatted. Can you drop me a few lines of any file from the tmp/Alignments directory to make sure?

from hg-color.

ctxchris avatar ctxchris commented on May 25, 2024

K00244:76:HM5YHBBXX:1:1102:26240:47841 16 fedad272-a191-44c4-984b-77d577382f93 768 254 14M1D17M1D26M1I2M1I14M * 0 0 AAGAGCAGAATAAAGAAGCGCTGCAGGATGTGGAAGATGAAAATCAGTGAGACATAATAAAGCCAACAAGAGAAC * RG:Z:5b44d038 AS:i:-213 XS:i:1 XE:i:76 YS:i:0 YE:i:75 ZM:i:0 XL:i:75 XT:i:1 NM:i:0 FI:i:1 XQ:i:75
K00244:76:HM5YHBBXX:1:1103:22536:20603 16 fedad272-a191-44c4-984b-77d577382f93 768 254 14M1D17M1D26M1I2M1I14M * 0 0 AAGAGCAGAATAAAGAAGCGCTGCAGGATGTGGAAGATGAAAATCAGTGAGACATAATAAAGCCAACAAGAGAAC * RG:Z:5b44d038 AS:i:-213 XS:i:1 XE:i:76 YS:i:0 YE:i:75 ZM:i:0 XL:i:75 XT:i:1 NM:i:0 FI:i:1 XQ:i:75
K00244:76:HM5YHBBXX:1:1103:31446:11390 16 fedad272-a191-44c4-984b-77d577382f93 768 254 14M1D17M1D26M1I2M1I14M * 0 0 AAGAGCAGAATAAAGAAGCGCTGCAGGATGTGGAAGATGAAAATCAGTGAGACATAATAAAGCCAACAAGAGAAC * RG:Z:5b44d038 AS:i:-213 XS:i:1 XE:i:76 YS:i:0 YE:i:75 ZM:i:0 XL:i:75 XT:i:1 NM:i:0 FI:i:1 XQ:i:75
K00244:76:HM5YHBBXX:1:1105:7293:10299 16 fedad272-a191-44c4-984b-77d577382f93 768 254 14M1D17M1D26M1I2M1I14M * 0 0 AAGAGCAGAATAAAGAAGCGCTGCAGGATGTGGAAGATGAAAATCAGTGAGACATAATAAAGCCAACAAGAGAAC * RG:Z:5b44d038 AS:i:-213 XS:i:1 XE:i:76 YS:i:0 YE:i:75 ZM:i:0 XL:i:75 XT:i:1 NM:i:0 FI:i:1 XQ:i:75
K00244:76:HM5YHBBXX:1:1104:12053:18159 16 fedad272-a191-44c4-984b-77d577382f93 769 254 13M1D17M1D26M1I2M1I15M * 0 0 AGAGCAGAATAAAGAAGCGCTGCAGGATGTGGAAGATGAAAATCAGTGAGACATAATAAAGCCAACAAGAGAACA * RG:Z:5b44d038 AS:i:-213 XS:i:1 XE:i:76 YS:i:0 YE:i:75 ZM:i:0 XL:i:75 XT:i:1 NM:i:0 FI:i:1 XQ:i:75
K00244:76:HM5YHBBXX:1:1104:29772:37255 16 fedad272-a191-44c4-984b-77d577382f93 769 254 13M1D17M1D26M1I2M1I15M * 0 0 AGAGCAGAATAAAGAAGCGCTGCAGGATGTGGAAGATGAAAATCAGTGAGACATAATAAAGCCAACAAGAGAACA * RG:Z:5b44d038 AS:i:-213 XS:i:1 XE:i:76 YS:i:0 YE:i:75 ZM:i:0 XL:i:75 XT:i:1 NM:i:0 FI:i:1 XQ:i:75
K00244:76:HM5YHBBXX:1:1105:23409:12480 16 fedad272-a191-44c4-984b-77d577382f93 769 254 13M1D17M1D26M1I2M1I15M * 0 0 AGAGCAGAATAAAGAAGCGCTGCAGGATGTGGAAGATGAAAATCAGTGAGACATAATAAAGCCAACAAGAGAACA * RG:Z:5b44d038 AS:i:-213 XS:i:1 XE:i:76 YS:i:0 YE:i:75 ZM:i:0 XL:i:75 XT:i:1 NM:i:0 FI:i:1 XQ:i:75

from hg-color.

morispi avatar morispi commented on May 25, 2024

I assume "fedad272-a191-44c4-984b-77d577382f93" is one of your long reads id. If so, it is indeed wrongly formatted. It should end with an underscore, followed by the length of the long read. The bin/formatLongReads.py script, included in the pipeline, is supposed to format the long reads correctly before launching the alignment step.

Has the command to launch the script been commented in the HG-CoLoR script? Or did it fail?

Can you try to run bin/formatLongReads.py on your long reads set, and see if it indeed adds a "_lengthOfTheRead" to the end of the headers?

from hg-color.

ctxchris avatar ctxchris commented on May 25, 2024

The read length is added to the end of the header. Seems to be a typical case of no-format-standards in bioinformatics as I have spaces in the headers so that only part of the header ends up in the alignment file. I'll reformat them and try again.

from hg-color.

morispi avatar morispi commented on May 25, 2024

Oh, yeah indeed, now that you mention it, I had the same trouble with headers containing spaces, and blasr outputting only the first part of the header in the alignment file, but I totally forgot about it. I should take that into account and update the formatLongReads.py script, thanks for the feedback. Replace spaces with underscores and you should be fine.

On a side note, you should probably not run with 90 threads. The PgSA index structure is not thread safe, and a lot of time will probably be wasted because of mutexes with such a high number.

Tell me how it goes.

from hg-color.

morispi avatar morispi commented on May 25, 2024

Hey,

Did you face any other problem when running HG-CoLoR?

Cheers,
P

from hg-color.

ctxchris avatar ctxchris commented on May 25, 2024

No further problems after re-naming the FASTA headers. The error correction finished just yesterday, CLRgen was taking quite some time to run. The error rate was reduced from about 20% to about 0.38%. It's just a pity that so much sequence(-length) is lost during error correction. But that's the same for all methods I've seen so far.

Thanks,
Chris

from hg-color.

morispi avatar morispi commented on May 25, 2024

Glad to hear.

I know CLRgen is pretty slow at the moment, the fact that PgSA isn't thread-safe wastes a lot of time. I should try to dive into PgSA's code and try to propose a thread-implementation, sooner or later.

I just pushed updates taking into account the white spaces in fasta headers. I also added support for corrected / trim / split long reads. Trimmed long reads should allow you to lose a bit less of sequence length, but will add few bases from the raw long reads when the graph traversals fail, so the error rate will be a bit higher. Once again, same trade-off between quality and length.

Cheers,
Pierre

from hg-color.

Related Issues (17)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.