Git Product home page Git Product logo

Comments (12)

Brent-Saylor-Canopy avatar Brent-Saylor-Canopy commented on July 30, 2024 1

Ahh, Ok. That setting won't change for me, so hopefully I'll only need to rerun the variant calling this one time.

Thanks for your help!

from grenepipe.

Brent-Saylor-Canopy avatar Brent-Saylor-Canopy commented on July 30, 2024

Nevermind

from grenepipe.

lczech avatar lczech commented on July 30, 2024

Hey @Brent-Saylor-Canopy,

did it work in the end? Usually, this should work - as you said, snakemake is good at this. But I never explicitly tested this, so I'd be curious to hear your feedback!

Cheers and so long
Lucas

from grenepipe.

Brent-Saylor-Canopy avatar Brent-Saylor-Canopy commented on July 30, 2024

It does work. I had copied a input file that was older than the output file initially, but that was fixed with a touch command.

The only error that comes up is that filtered/all.vcf.gz is write protected, so that needs to be deleted before it can be regenerated.

from grenepipe.

lczech avatar lczech commented on July 30, 2024

Ah thanks, the touch makes sense :-)

As for the write protected file: I think that it is good to keep it that way, in order to avoid accidental overwriting, meaning that users need to make sure that they actually want this.

from grenepipe.

Brent-Saylor-Canopy avatar Brent-Saylor-Canopy commented on July 30, 2024

Yes the touch is easy enough. The other option would be to add a step where links are made to each of the reads and updated on each run. That way the timestamp on the files corresponds to when the analysis was run rather than when the file was created.

Yes the file protection makes sense. I'm not sure how many people will have a similar use case to mine.

from grenepipe.

lczech avatar lczech commented on July 30, 2024

Hm, interesting idea to use symlinks. It might also solve some file naming issues with downstream tools. I am not entirely sure though that it would not also introduce new issues - I'll have to think about this. But thanks for the suggestion!

Also, here is another way to solve this: https://snakemake.readthedocs.io/en/stable/project_info/faq.html#snakemake-does-not-trigger-re-runs-if-i-add-additional-input-files-what-can-i-do
(the "snakemake" way).

from grenepipe.

Brent-Saylor-Canopy avatar Brent-Saylor-Canopy commented on July 30, 2024

I've encountered another problem with trying a larger scale test of adding new samples and rerunning the pipeline.

I keep getting an error when the call_variants rule is launched stating

ProtectedOutputException in line 38 of /mnt/Data1/GBS_data/grenepipe/rules/calling-haplotypecaller.smk:
Write-protected output files for rule call_variants:
called/Sample23.10A.g.vcf.gz

This seems to happen when the rule is launched, I'm not sure why but the pipeline seems to be recalling the variants for every sample, not just the new ones.

from grenepipe.

lczech avatar lczech commented on July 30, 2024

Not sure that I understand your question here.

Is the issue that it fails with the exception about write protected files? Because that is intentional: I've marked these files as write-protected in order to avoid accidental re-computation. Hence, in cases where you want to compute them again (which does not seem to be what you want here...), you'd have to delete them manually first - this is meant as a protection from mistakes that could otherwise lead to expensive re-computation.

If your question however is why these files are being re-computed in the first place: As you noted before, snakemake works by comparing timestamps of files and rules, and re-runs downstream rules if their input is newer than their output. I cannot tell from the information that you provided what exact chain of updates leads snakemake to want to do this, but you can call the pipeline with the -n --reason flags, which is a dry run (-n) that gives you this information for each executed rule. It might be that your input sample files were somehow updated, or some intermediate files changed.

Let me know if this helped and if you have further questions :-)

from grenepipe.

Brent-Saylor-Canopy avatar Brent-Saylor-Canopy commented on July 30, 2024

My question is why they are being recomputed at all. It is running call_variants on both the 100 samples that were run previously, and the 20 samples I added. I would expect that variants would only need to get called for the new 20 samples. The only thing I changed was to add new samples and change the "known-variants" setting in the config file. I'll have to test it out on another run. I removed the write protection on the files for now so I could get the updated results.

Would changing the config parameter "known-variants" cause each sample to get call_variants run on them again?

from grenepipe.

lczech avatar lczech commented on July 30, 2024

Ah yes, that is the reason then! The variant calling takes these known variants into account, and hence produces different output - hence, the variant calling needs to be repeated. You can check with the --reason flag as well, as there might be additional reasons, but this is definitely one of them!

from grenepipe.

lczech avatar lczech commented on July 30, 2024

For anyone finding this in the future: In recent versions, I have removed the file write-protection, because users were confused by this. This comes at the risk that unnecessary computation is done though, but that can easily checked beforehand by running snakemake with -n or -nd for a dry-run to check that the rule executions are as expected.

from grenepipe.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.