petrov-lab / hafpipe-line Goto Github PK

View Code? Open in Web Editor NEW

6.0 6.0 1.0 13.14 MB

calculate haplotype-derived allele frequencies for pool-seq samples

Shell 71.09% R 2.42% Python 26.50%

hafpipe-line's People

Contributors

Stargazers

Watchers

Forkers

lczech

hafpipe-line's Issues

Task 1 does not create SNP table unless FILTER "PASS" is present

Hi there,

I've been debugging why my SNP tables were empty (except for the header) after Task 1. It seems that here the script filters for the word PASS to be present in the FILTER column of the VCF. In practice however, a missing value indicated by a dot . is also often produced by many tools, and should be accepted.

I'd hence suggest to also accept . in that line of the script. If that is not desirable, it might at least be useful to issue a warning when a VCF contains many . instead of PASS, so that users can figure out what's wrong with their VCF more easily.

Cheers
Lucas

Wrong file naming for `simpute` method

Hi there!

There is a mismatch between the file names of the imputed table when using Task 2 with "simpute", and what Task 3 then expects as input.

In particular, see here for the command that calls Task 2, to create a table with the suffix .simpute. However, Task 2 actually names this file using the suffix .imputed instead, see here. Then, when Task 3 is called, see here, it does not find the table, as it is looking for the file with suffix .simpute instead. HAFpipe needs to be restarted with --impmethod imputed instead of simpute to fix this at the moment, which does not seem ideal.

Cheers
Lucas

Error in `rownames<-(tmp, value = logical(0))`

Hi there,

I was trying to get HAFpipe to run on my Arabidopsis data, using this script:

#!/bin/bash

cd "$(dirname $0)/"
BASEDIR=`pwd`
export PATH="${BASEDIR}/harp_linux_140925_103521/bin:$PATH"
export PATH="${BASEDIR}/HAFpipe-line-master:$PATH"
HAFPIPE="${BASEDIR}/HAFpipe-line-master/HAFpipe_wrapper.sh"

OUTDIR="out"
mkdir -p ${OUTDIR}

# cd HAFpipe-line-master

FOUNDER="/home/lucas/Projects/1001g/1001gbi.recode.vcf.gz"
SUBSET="/home/lucas/Projects/grenedalf-paper/haf-pipe/founders.txt"

REF="//home/lucas/Projects/grenedalf-paper/reference/TAIR10_chr_all.fa"
BAM="/home/lucas/Projects/grenedalf-paper/benchmark-real/mapped/S1-1.sorted.bam"

${HAFPIPE} \
    -t 1,2,3,4 \
    -v ${FOUNDER} \
    -u ${SUBSET} \
    -c 1 \
    -s ${OUTDIR}/founderGenotypes.segregating.snpTable \
    -i simpute \
    -b ${BAM} \
    -e sanger \
    -g 2     \
    -r ${OUTDIR}/dmel_ref_r5.39.fa \
    -o ${OUTDIR}

which produced this log file:

        Tue 08 Jun 2021 03:57:35 PM PDT
        ####PARAMETERS########
        --tasks: 1
        --scriptdir /home/lucas/Projects/grenedalf-paper/haf-pipe/HAFpipe-line-master
        --outdir out
        --vcf /home/lucas/Projects/1001g/1001gbi.recode.vcf.gz
        --chrom 1
        --snptable out/founderGenotypes.segregating.snpTable
        --mincalls 2
        
        --subsetlist --subsetlist /home/lucas/Projects/grenedalf-paper/haf-pipe/founders.txt
        --impmethod .simpute
        --nsites 20
        --bamfile /home/lucas/Projects/grenedalf-paper/benchmark-real/mapped/S1-1.sorted.bam
        --refseq out/dmel_ref_r5.39.fa
        --encoding sanger
        --generations 2
        --recombrate .0000000239
        --quantile 18
        --winsize 

        #####COMMANDS#######




COMMAND: /home/lucas/Projects/grenedalf-paper/haf-pipe/HAFpipe-line-master/make_SNPtable_from_vcf.sh -v /home/lucas/Projects/1001g/1001gbi.recode.vcf.gz -c 1 -s out/founderGenotypes.segregating.snpTable --mincalls 2 --subsetlist /home/lucas/Projects/grenedalf-paper/haf-pipe/founders.txt 
making snptable for chrom 1 from /home/lucas/Projects/1001g/1001gbi.recode.vcf.gz, starting from column 10
subsetting snp file out/founderGenotypes.segregating.snpTable to 50 selected founders and writing to out/founderGenotypes.segregating.snpTable.subset
50  fields extracted,  50  fields and 0 sites in new SNP file:
        out/founderGenotypes.segregating.snpTable.subset
allele counts written to:
out/founderGenotypes.segregating.snpTable.alleleCts
making numeric version of out/founderGenotypes.segregating.snpTable
bgzipping out/founderGenotypes.segregating.snpTable.numeric
SNP table written to:
out/founderGenotypes.segregating.snpTable
COMMAND: /home/lucas/Projects/grenedalf-paper/haf-pipe/HAFpipe-line-master/impute_SNPtable.sh out/founderGenotypes.segregating.snpTable
counting alleles in

(the last line seems to have an empty file name printed!)

but then stopped with these errors:

./run.sh 
Error in `rownames<-`(`*tmp*`, value = logical(0)) : 
  attempt to set 'rownames' on an object with no dimensions
Execution halted
tail: cannot open 'out/founderGenotypes.segregating.snpTable.numeric' for reading: No such file or directory

At this point, it just hangs and does not continue any more.

Here are the files produced so far: out.zip

Any idea what is going on here?
Thanks for the support and all the best
Lucas

Running on founders with VCF without ##contig

Hi there,

a little suggestion: The wrapper script line

if [ -z $chrom ] || [ ! $(zcat $vcf | head -5000 | grep ^##contig | grep "ID=${chrom},") ];  then echo "ERROR: must choose valid chromosome"; exit 1; fi

(see here) checks that the founder vcf contains ##contig header fields - which not all VCFs do. I needed to comment out this line to work on my data. Maybe it is worth considering to obtain the chromosomes by some other means that does not require the VCF to contain these headers?!

Edit: Just noted that I probably should have said this as well: As far as I understand the code, this is possible, because the chromosome is used in make_SNPtable_from_vcf.sh in a way that does not need the ##contig to be present.

Cheers
Lucas

Harp installation

Hello,

I am having a hard time installing harp on my mac and am wondering if you had any tips for installation - I figured I'd ask here before harp because the harp repository is quite old and I am unsure if the developers are responsive.

The error I get when trying to build harp is:

In file included from preprocess_pcr_fastq.cpp:36:
./BAMFile.hpp:41:10: fatal error: 'tr1/memory' file not found
#include <tr1/memory>
         ^~~~~~~~~~~~
1 error generated.

    "g++"   -O0 -fno-inline -Wall -Werror -g     -c -o "../build/gcc-4.2.1/debug/link-static/preprocess_pcr_fastq.o" "preprocess_pcr_fastq.cpp"

...failed gcc.compile.c++ ../build/gcc-4.2.1/debug/link-static/preprocess_pcr_fastq.o...
...skipped <p../build/gcc-4.2.1/debug/link-static/runtime-link-static>preprocess_pcr_fastq for lack of <p../build/gcc-4.2.1/debug/link-static>preprocess_pcr_fastq.o...
...skipped <p../bin>preprocess_pcr_fastq for lack of <p../build/gcc-4.2.1/debug/link-static>preprocess_pcr_fastq.o...
...failed updating 48 targets...
...skipped 67 targets...

For all the files with tr1/memory in them.

Searching around stackoverflow and other places suggests that this is a compiler and code issue (link).

There is a note at the end of the Jamroot file:

#
# static link issue: 
#   - by default, XCode gcc links to .dylib if it finds one, even if it found a static lib first in
#     a different directory (!)
#   - darwin toolset doesn't allow <link>static specification for searched libs
#   - this means that */<link>static has no effect on OSX
#
# OSX workaround: remove .dylibs from ALL searched dirs
# alternate OSX workaround: give full filename of lib*.a on g++ command line
#
# linux: boost_filesystem/<link>static does indeed link to the boost_*.a files
#

But I am unsure how to implement this suggestion.

Any help would be greatly appreciated!

Thanks

Task 4 does not allow for parallel execution when using Task 2 (imputation)

This is a tricky one!

When using Task 2 (with "simpute" or "npute"), the SNP table generated from the imputation does not have the alleleCts and numeric files generated from it automatically. Hence, in Task 4, when those files are needed, their existence is checked again, and they are computed if needed, see here.

However, when for example running HAF-pipe on a cluster, or some other form of parallel execution, this can lead to attempting to create these files multiple times in parallel, leading to corrupted files. Assume we have two samples (bam files). For the first sample, when running Task 4, the alleleCts and numeric files are started to be created. The second sample, if its Task 4 is started shortly after, will then find these files already present, and start computing HAFs for the sample - but with files that are not yet fully created, hence leading to errors.

One fix could be to move that check to the end of the imputation step, which is only executed once per SNP table.

Cheers
Lucas

Task 1 creation of numeric SNP table needs an unreasonable amount of memory

Hi there,

the Task 1 script numeric_SNPtable.R currently reads in the whole table at once it order to create a numeric version of the ref/alt bases. For one of our datasets, this needs around ~160GB per chromosome, which makes it a bit inconvenient to work with. As far as I understand what the script does, this could easily be done in Python by iterating the files line by line instead.

If I get to it, I might write that myself, but wanted to open this issue here first as a point of reference.

Cheers
Lucas

petrov-lab / hafpipe-line Goto Github PK

hafpipe-line's People

Contributors

Stargazers

Watchers

Forkers

hafpipe-line's Issues

Task 1 does not create SNP table unless FILTER "PASS" is present

Wrong file naming for `simpute` method

Error in `rownames<-(tmp, value = logical(0))`

Running on founders with VCF without ##contig

Harp installation

Task 4 does not allow for parallel execution when using Task 2 (imputation)

Task 1 creation of numeric SNP table needs an unreasonable amount of memory

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent