Git Product home page Git Product logo

Comments (3)

kcleal avatar kcleal commented on May 20, 2024

Hi @agolicz,
Thanks for reporting this. That is much longer than expected, I will try and help get this fixed. Would you mind checking a few things for me? If possible could you check to see how much memory is being consumed? Also, if you have time, would you mind trying to merge just two of your samples, rather than the whole cohort - this should only take a minute or so, and it would be useful to know if it completes in a reasonable time. I have mainly tested merging on larger cohorts of short-read data, so its possible there is a scaling issue for long read data.

I had a quick scan of the code, and it looks like there might be a scaling issue if there is a complex region of the genome with lots of SVs that overlap each other or lots of diversity - this situation is common near centromeric regions in humans, for example. Merging will essentially be an all vs all comparison in these regions, which might give rise to the high run time. However, dysgu usually gives these types of rearrangements low probability, so its possible you have filtered those out with the flt_vcf.py script?

from dysgu.

agolicz avatar agolicz commented on May 20, 2024

Hi,
thanks for the reply.
It actually finished:

2022-06-14 21:50:43,619 [INFO   ]  [dysgu-merge] Version: 1.3.11
2022-06-14 21:54:54,489 [INFO   ]  Merge distance: 500 bp
2022-06-15 08:56:46,338 [INFO   ]  SVs output to stdout
2022-06-15 08:56:46,342 [INFO   ]  Input samples: ['a702', 'a703', 'a705', 'a709', 'a711', 'a714', 'a715', 'a716', 'a717', 'a723', 'a724', 'a726', 'a727', 'a728', 'a729', 'a730', 'a731', 'a732', 'a733', 'a734', 'a735', 'a743', 'a748', 'a762', 'a764', 'a765', 'a776', 'a778', 'a779', 'a783', 'a784', 'a786', 'a790', 'a792', 'a796', 'a797', 'a802', 'a810', 'a815', 'a816', 'a817', 'a818', 'a820', 'a823', 'a824', 'a825', 'a827', 'a828', 'a830', 'a833', 'a834', 'a835', 'a836', 'a838', 'a839', 'a840', 'a841', 'a842', 'a843', 'a845']
2022-06-15 09:08:44,000 [INFO   ]  Sample rows before merge [27524, 39428, 36668, 18995, 42091, 42029, 30433, 26344, 27145, 40743, 1269, 36795, 36315, 43578, 39322, 43658, 35629, 37191, 41179, 29260, 34499, 20557, 39690, 36079, 38506, 42710, 49017, 40979, 41427, 44644, 42835, 40906, 42328, 40995, 32513, 51, 41343, 40148, 39025, 46364, 39965, 20092, 31426, 35372, 37474, 14263, 20316, 33790, 44510, 40746, 37944, 19803, 40951, 32969, 24995, 10389, 24662, 31023, 15909, 26496], rows after 375685
2022-06-15 09:08:44,009 [INFO   ]  dysgu merge complete h:m:s, 11:18:00

 cat *pass.vcf | grep -v "^#" | wc -l
2013307
grep -v "^#" long.vcf | wc -l
375685

Yes, flt_vcf.py only keeps the variants with PASS.
Can't check exact memory usage but it had to be less than 40G which was the limit.
Just trying two files.

dysgu merge a843.pass.vcf a845.pass.vcf > dm.t.vcf
2022-06-15 11:54:15,586 [INFO   ]  [dysgu-merge] Version: 1.3.11
2022-06-15 11:54:18,394 [INFO   ]  Merge distance: 500 bp
2022-06-15 11:54:34,354 [INFO   ]  SVs output to stdout
2022-06-15 11:54:34,392 [INFO   ]  Input samples: ['a843', 'a845']
2022-06-15 11:54:47,542 [INFO   ]  Sample rows before merge [15909, 26496], rows after 36978
2022-06-15 11:54:47,543 [INFO   ]  dysgu merge complete h:m:s, 0:00:31

11hrs is not too bad (we're used to that in plants :)). I was just surprised because merging from 100 short read samples was much quicker.

If you are interested in testing merging for long reads I am planning to run SVJedi to genotype and can report if there any issues, sites with too many missing genotypes etc.

minimap2+dysgu have done very well in our in-house comparisons for Brassica napus! :)

from dysgu.

kcleal avatar kcleal commented on May 20, 2024

Glad it finished! I think the runtime is probably caused by high genome complexity in that case. Would be very interested to hear how you get on - feed back from users is very valuable! If you have not come across it already, jasmine could be a useful tool for merging also: https://github.com/mkirsche/Jasmine

from dysgu.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.