Comments (3)
Hi @agolicz,
Thanks for reporting this. That is much longer than expected, I will try and help get this fixed. Would you mind checking a few things for me? If possible could you check to see how much memory is being consumed? Also, if you have time, would you mind trying to merge just two of your samples, rather than the whole cohort - this should only take a minute or so, and it would be useful to know if it completes in a reasonable time. I have mainly tested merging on larger cohorts of short-read data, so its possible there is a scaling issue for long read data.
I had a quick scan of the code, and it looks like there might be a scaling issue if there is a complex region of the genome with lots of SVs that overlap each other or lots of diversity - this situation is common near centromeric regions in humans, for example. Merging will essentially be an all vs all comparison in these regions, which might give rise to the high run time. However, dysgu usually gives these types of rearrangements low probability, so its possible you have filtered those out with the flt_vcf.py script?
from dysgu.
Hi,
thanks for the reply.
It actually finished:
2022-06-14 21:50:43,619 [INFO ] [dysgu-merge] Version: 1.3.11
2022-06-14 21:54:54,489 [INFO ] Merge distance: 500 bp
2022-06-15 08:56:46,338 [INFO ] SVs output to stdout
2022-06-15 08:56:46,342 [INFO ] Input samples: ['a702', 'a703', 'a705', 'a709', 'a711', 'a714', 'a715', 'a716', 'a717', 'a723', 'a724', 'a726', 'a727', 'a728', 'a729', 'a730', 'a731', 'a732', 'a733', 'a734', 'a735', 'a743', 'a748', 'a762', 'a764', 'a765', 'a776', 'a778', 'a779', 'a783', 'a784', 'a786', 'a790', 'a792', 'a796', 'a797', 'a802', 'a810', 'a815', 'a816', 'a817', 'a818', 'a820', 'a823', 'a824', 'a825', 'a827', 'a828', 'a830', 'a833', 'a834', 'a835', 'a836', 'a838', 'a839', 'a840', 'a841', 'a842', 'a843', 'a845']
2022-06-15 09:08:44,000 [INFO ] Sample rows before merge [27524, 39428, 36668, 18995, 42091, 42029, 30433, 26344, 27145, 40743, 1269, 36795, 36315, 43578, 39322, 43658, 35629, 37191, 41179, 29260, 34499, 20557, 39690, 36079, 38506, 42710, 49017, 40979, 41427, 44644, 42835, 40906, 42328, 40995, 32513, 51, 41343, 40148, 39025, 46364, 39965, 20092, 31426, 35372, 37474, 14263, 20316, 33790, 44510, 40746, 37944, 19803, 40951, 32969, 24995, 10389, 24662, 31023, 15909, 26496], rows after 375685
2022-06-15 09:08:44,009 [INFO ] dysgu merge complete h:m:s, 11:18:00
cat *pass.vcf | grep -v "^#" | wc -l
2013307
grep -v "^#" long.vcf | wc -l
375685
Yes, flt_vcf.py only keeps the variants with PASS.
Can't check exact memory usage but it had to be less than 40G which was the limit.
Just trying two files.
dysgu merge a843.pass.vcf a845.pass.vcf > dm.t.vcf
2022-06-15 11:54:15,586 [INFO ] [dysgu-merge] Version: 1.3.11
2022-06-15 11:54:18,394 [INFO ] Merge distance: 500 bp
2022-06-15 11:54:34,354 [INFO ] SVs output to stdout
2022-06-15 11:54:34,392 [INFO ] Input samples: ['a843', 'a845']
2022-06-15 11:54:47,542 [INFO ] Sample rows before merge [15909, 26496], rows after 36978
2022-06-15 11:54:47,543 [INFO ] dysgu merge complete h:m:s, 0:00:31
11hrs is not too bad (we're used to that in plants :)). I was just surprised because merging from 100 short read samples was much quicker.
If you are interested in testing merging for long reads I am planning to run SVJedi to genotype and can report if there any issues, sites with too many missing genotypes etc.
minimap2+dysgu have done very well in our in-house comparisons for Brassica napus! :)
from dysgu.
Glad it finished! I think the runtime is probably caused by high genome complexity in that case. Would be very interested to hear how you get on - feed back from users is very valuable! If you have not come across it already, jasmine could be a useful tool for merging also: https://github.com/mkirsche/Jasmine
from dysgu.
Related Issues (20)
- ValueError: numpy.ufunc size changed, may indicate binary incompatibility. Expected 232 from C header, got 216 from PyObject HOT 8
- GT value for missing data HOT 4
- no command as dysgu filter-normal HOT 3
- what is SVLEN? HOT 2
- Error with --search option HOT 11
- Generating Alternative Reference HOT 16
- Run OSError: [Errno 24] Too many open files Mac OS M HOT 4
- OverflowError: can't convert negative value to size_t HOT 2
- Dysgu filter IndexError: string index out of range HOT 6
- long reads default mapq lowered to 1: help text for dysgu call still says pacbio and nanopore mode has --mq 20 HOT 1
- When will docker image with new release be available? HOT 1
- Got an warning when Loading Model in "dysgu run" HOT 1
- problems genotyping, dysgu run --sites HOT 3
- clarification needed on RG and samples HOT 4
- Getting SV length in dysgu output vcf HOT 3
- _pickle.UnpicklingError: invalid load key, 'A'. Failed to read from standard input: unknown file type HOT 2
- Subject: Inquiry on Benchmarking DEL and INS Events with dysgu Pipelines. HOT 35
- TypeError: an integer is required when using --sites option and manta.vcf HOT 6
- When combining a large number of samples, the speed is very slow HOT 13
- When merging a large number of samples, the process is very slow
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dysgu.