Comments (11)
Hi
I just run 1/4 of the poolseq dataset and everything went smoothly in less than 24 hours, generating 65Million SNPs!
So maybe is my workstation (128 threads and 512GB RAM) that doesn't cope with the full dataset. I am going to try with increase amount to check how far I can go.
osp
from grenepipe.
Hi @ospfsg,
interesting. What exactly do you mean by
It crashed and the terminal closed by himself.
Without any error message, the terminal just closes? That does not seem like anything that would happen from running snakemake or grenepipe, so there must be something more weird going on. Could you somehow maybe give me some steps for how I could reproduce this behavior, or describe what is happening in more detail? Or even better, use a screen recording tool to get a video clip of what is happening?
From your log files, there is nothing that explains the behavior that you describe. I am seeing
Invalid tag name: "technology.-"
in there. Is that name something that shows up anywhere in your data, as a sample name or chromosome name, or somewhere else? Apart from that, there are no clues that would help right now.
I am traveling and won't be able to investigate this issue for another week. I'll then try to reproduce with the freebayes settings that you posted above. Let's see if I can recreate the issue as well. If not, and in the meantime: How big is your data set? Could you maybe prepare a minimal subset of your data that causes the issue, and that I can use to investigate? It seems like a rather involved and complicated issue, if the terminal just closes. I have never seen that, and tools such as freebayes, snakemake (and grenepipe) should usually not do that - hence, I have not really a good idea what is going on there.
So long, let's see what we can figure out
Lucas
from grenepipe.
Hi @ospfsg,
how are things going? Did the pipeline work now with a reduced data set? From what you described, I can only guess what has happened - but if it works with smaller data, and fails with larger, then it is rather likely that this is some issue with memory or your workstation, instead of something related to grenepipe itself.
Let me know what you think, and if we can close this issue then.
Cheers and so long
Lucas
from grenepipe.
Hi Lucas
The problem seems not to be the amount of data but the ploidy that is used in freebayes. I am currently running with ploidy of 4 in spite that in the pool I have 40 samples. So the problem is not in your pipeline, the problem is the capacity of my workstation dealing with a freebayes -p 40.. It copes until ploidy of 30 but apparently not more than that!!!
When I used the bam files that come out of the pipeline (dedup) and try to run them in freebayes I got the same problem
So the problem is not in your pipeline, the problem is the capacity of my workstation dealing with a freebayes -p 40.. It copes until ploidy of 30 but apparently not more than that!!!
Your pipeline implement natively the parallelization in freebayes ?
Thank you
osp
from grenepipe.
Hi @ospfsg,
ah okay. So, it seems you want to use freebayes for variant calling on pooled data? In that case, I'd recommend to have a look at our preprint where we show that this can lead to issues and biases down the line. That's because the typical variant callers (all three that grenepipe offers) are not meant for pooled data - and hence their statistical assumptions for detecting what is a variant (in individuals) not applicable for pools. That's also why these tools are not optimized for the artificially high ploidy that you are using there.
Nonetheless, this type of variant calling on pooled data seems to be common practice. I'd however advise against it. Instead, what we found has been working well, and is also way faster, is to just map the reads to the reference genome you are using. This can for example be done by only running the mapping part of the pipeline, see here.
Then, you can work on the raw counts of how many nucleotides were mapped to each position (often called allele counts), which acts as a proxy for allele frequencies, and can be used for downstream analyses instead of called variants. This also offers a more sophisticated view on the data, where instead of a hard yes/no variant call, you can operate on frequencies, and hence capture more detail.
We are also currently in the very last steps of publishing our tool for working with this kind of data, called grenedalf. If you want to run some typical downstream analysis, this might help. If grenedalf is lacking some functionality that you need, let me know. However, wait a few more days before using it - by the end of the week (hopefully) I'll have a new version out, in which many things are improved and corrected compared to the current v0.3.0.
As for your question:
Your pipeline implement natively the parallelization in freebayes ?
Variant calling is parallelized over chromosomes. It however by its very nature has to be run on all samples at the same time, so there is unfortunately not much more than can be parallelized easily.
Hope that helps, and so long
Lucas
from grenepipe.
Hi Lucas
Thank very much for your advice. I will be happy to test grenedalf.
In your prepprint you have an interesting paragraph :
These analyses resulted in prohibitively long runtimes even in cluster environments (GATK HaplotypeCaller) and large memory usage (freebayes), demonstrating these toolsβ limited capabilities for analyzing large datasets and large pool sizes.
We hence ran the three callers with default ploidy of 2 to study their artifacts in Pool-Seq applications, assuming that other researchers may be required to resort to these default settings.
I assume that the second sentence is about your experiment 2, but did you test the impact on the results of a "fake" low ploidy on the results ?
I used the .bam files after picard the ones in the dedup folder, that seems to me be the input for the snpcallers. Can I used then for the input grenedalf?
Thank you
osp
from grenepipe.
Hi Octavio,
I assume that the second sentence is about your experiment 2, but did you test the impact on the results of a "fake" low ploidy on the results ?
Yes, if I recall correctly, that's what we did. See supplementary figures S6 onward in the manuscript :-)
I used the .bam files after picard the ones in the dedup folder, that seems to me be the input for the snpcallers. Can I used then for the input grenedalf?
Yes you can! It seemed rather wasteful and annoying to me to have to convert everything to pileup files first, so in grenedalf, I implemented to read directly from sam/bam files ;-)
Cheers
Lucas
from grenepipe.
Hi Lucas
is it possible to run grenepipe with the starting point of bam files (from dedup) and run the different snpcallers instead of starting the all process from the beginning each time we want to use a different caller ?
best
osp
from grenepipe.
Hi @ospfsg,
good question, and you are not the only one with a request like that. Unfortunately, Snakemake itself is not meant for these things where changes from the outside are needed in the middle of a workflow run. And within grenepipe, I've had users who wanted similar things of "start from there", "re-use these files", etc, but always in a slightly different way, which means, I cannot easily integrate that as a feature, because it would need to be so flexible to accommodate for all these use cases that it would be both hard to implement and hard to use...
So, long story short, the current solution for use cases like that is to trick Snakemake itself into doing what you want. Depending on how you wanna do this, this can be a bit tricky, but possible.
The straight forward approach is as follows: Run the pipeline with a variant caller. Then, move the directories that contain all variant-caller-related files to somewhere else. These directories are: called
, filtered
, and genotyped
. Then, change the config.yaml
to the next variant caller you want to try, and run the pipeline again. Snakemake will recognize that all the mapping files are there, but that the calling files are not, and hence only run those parts. This approach is the easiest one, and I'd recommend to do that.
Couple of tips for this approach: Before running Snakemake again for the other callers, I'd check first with the extra flags -nq
when running that Snakemake actually does not want to re-run the trimming and mapping. Also, I'd copy the config.yaml
file of each run to the same directory where you move the called
, filtered
, and genotyped
directory to, so that the config file that created those variant calls is preserved as well.
The only downside is that you can only run the pipeline with one variant caller (or one config setting in general) at a time. For the three callers implemented, I think this is good enough. If you'd need to scale this up, to test many settings of each caller, etc, a more scalable approach would be needed, where you can run everything at once, and don't need to manually copy directories each time. If you need that instead, let me know.
Hope that helps, and so long
Lucas
from grenepipe.
Related Issues (20)
- bwa-mem2 "{tmp}.0000.bam": File exists HOT 5
- threads for bwa-mem2 via slurm HOT 2
- Error running toy example HOT 6
- MissingRuleException HOT 13
- PID error HOT 9
- java.lang.OutOfMemoryError: Java heap space HOT 2
- GRENEPIPE v12.1 HOT 5
- Make "trimming-tool" optional HOT 4
- restrict-regions and short contigs HOT 2
- ModuleNotFoundError: No module named 'chardet' HOT 2
- Write full executed command for each step to log files for reproducibility HOT 3
- merging calls from multiple pipeline runs? HOT 2
- mamba is difficult to install in grenepipe environment HOT 6
- Feature Request: Download reference genome and known variation HOT 2
- config file HOT 5
- greenepipe run error HOT 5
- problem with dedup HOT 4
- a new type of error HOT 2
- a new type of error HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from grenepipe.