angsd-wrapper / angsd-wrapper Goto Github PK

View Code? Open in Web Editor NEW

14.0 9.0 4.0 13.73 MB

Utilities for analyzing next generation sequencing data.

License: MIT License

R 31.29% Shell 68.71%

software ngs-analysis ngs visualization gui r bash shell wrapper population-genetics

angsd-wrapper's Introduction

ANGSD-wrapper

Active Development

As of October 2020, ANGSD-wrapper will be undergoing active development by Samuel Hamann to improve the project. Some areas of improvement include:

Updating underlying dependencies to work with the newest stable ANGSD version (currently 0.933 based upon the Github repo)
Environment and dependency management via bioconda
Improving performance via parallelization

ANGSD-wrapper is a utility developed to aid in the analysis of next generation sequencing data. Users can do the following with this suite:

Calculate a site frequency spectrum
Calculate a 2D site frequency spectrum with corresponding F_ST estimations
Perform ABBA/BABA tests
Extract a FASTA sequence from BAM files
Calculate genotype likelihoods
Estimate Thetas and various neutrality statistics
Calculate per-individual inbreeding coefficient
Find admixture proportions

Likelihood based approaches are used in ANGSD to calculate summary statistics from next generation sequencing data. The wrapper scripts and documentation are designed to make ANGSD user-friendly.

Installing ANGSD-wrapper

To install ANGSD-wrapper, download from GitHub

git clone https://github.com/ANGSD-wrapper/angsd-wrapper.git

Go into the ANGSD-wrapper directory

cd angsd-wrapper/

Run the setup command

./angsd-wrapper setup dependencies

Download the example dataset (optional)

./angsd-wrapper setup data

Finish the installation

source ~/.bash_profile

A note about BAM files

ANGSD requires BAM files as input, and ANGSD-wrapper passes a list of BAM files to ANGSD. These BAM files have a few requirements:

The BAM files must have an '@HD' header line
The BAM files must be indexed (.bai)

To see whether or not the BAM files have an '@HD' header line, run the following on your list of samples:

for sample in `cat ~/path/to/sample_list.txt`
do
    echo $sample
    samtools view -H $sample | head -1
done

If any samples start with '@SQ' instead of '@HD', ANGSD and ANGSD-wrapper will fail. This Gist will add an @HD header lines to your BAM files.

The index files must be generated after the BAM files. To index the BAM files using SAMTools, run the following on your sample list:

for sample in `cat ~/path/to/sample_list.txt`
do
    samtools index $sample
done

If you have GNU Parallel installed on your system, this process can be sped up:

cat ~/path/to/sample_list.txt | parallel samtools index {}

Basic usage

To run ANGSD-wrapper, run

angsd-wrapper <wrapper> <config>

Where wrapper is one of the methods that ANGSD-wrapper can run and config is the relative path to the corresponding configuration file.

To see a list of available wrappers, run

angsd-wrapper

Configuration files

There is a configuration (config) file for each method available through angsd-wrapper. The configuration files hold variables used by the wrappers. This is where you need to modify and save the variables (i.e., specify filepaths of indexed BAM files/CRAM files, FASTA files, sample lists, etc.) to suit your samples before running angsd-wrapper with a specified method.

The default config files can be found in the Configuration_Files directory. You will need to modify them to suit your samples. Please refer to the config files or the wiki to see what each variable is used for and how they should be specified. If you run angsd-wrapper without any arguments, it will return a usage message.

Example config files can be found in Example_Data/Configuration_Files upon running angsd-wrapper setup data.

Futher Information

For more information about ANGSD-wrapper, the methods availble through ANGSD-wrapper, and a comprehensive tutorial, please see the wiki.

Dependencies

This package requires the following dependencies:

These are downloaded and installed automatically when angsd-wrapper is installed

There are a few other dependencies that are not automatically downloaded during the installation:

Supported methods

Citing ANGSD-wrapper

ANGSD-wrapper was published in Molecular Ecology Resources; if you use this in your work please cite the paper. For BibTeX users, the citation is as follows:

@article {MEN:MEN12578,
author = {Durvasula, Arun and Hoffman, Paul J. and Kent, Tyler V. and Liu, Chaochih and Kono, Thomas J. Y. and Morrell, Peter L. and Ross-Ibarra, Jeffrey},
title = {angsd-wrapper: utilities for analysing next-generation sequencing data},
journal = {Molecular Ecology Resources},
issn = {1755-0998},
url = {http://dx.doi.org/10.1111/1755-0998.12578},
doi = {10.1111/1755-0998.12578},
pages = {n/a--n/a},
keywords = {domestication, population genetics, software, visualization, Zea},
year = {2016},
}

angsd-wrapper's People

Contributors

Stargazers

Watchers

Forkers

callithrix-omics editt nitinra sarangian

angsd-wrapper's Issues

How many iterations does the Inbreeding Coefficient run?

How many iterations does the Inbreeding Coefficient run? I tried reading the documentation but have not found the info. I am running it for a big dataset, it has been running for more than 5 days and is up to iteration 126. I see that its set at max_iters: 1500 min_iters: 10 and in that case it will not be able to finish before running out of time (max 7 days on our server). Is there a way to change this? Does the program need to do all 1500 iterations for the results to be usable?

.clst files for shiny graphing for PCA plots

Trying to use '.clst' clusters file to color code subgroups. I must be coding my cluster file incorrectly. I should only have one 'gross' cluster, but there are three, see image. Thanks in advance. Loving using this wrapper!

FID	IID	CLUSTER
HeliagrossG5_11_4	1	gross
HeliagrossG9_9_4	1	ang
HeliavertA5_11_7	1	vert
HeliavertE11_9_8	1	miss
HeliavertE12_9_5	1	miss
HeliavertG2_9_3	1	vert
HeliavertG3_9_7	1	vert
HeliavertG4_9_6	1	miss
HeliavertH12_11_3	1	miss

Fst wrapper - graphing issue

Just want to report what I think is a bug in the code for the Fst wrapper. It seems that the FST.sh script moves the shared.pos file one directory up. However, at the stage where the script is preparing the files for graphing, it still calls the file from ${OUT}/shared.pos. If I'm reading this right, I think shared.pos is at the same level as ${OUT} rather than within it. This triggers an error at the end of the FST wrapper.

[Problem opening file: '-fold'] --> angst-wrapper SFS ./Site_Frequency_Spectrum_Config in online tutorial

I followed the installation instructions and the tutorial instructions but am running into an error when I try to run the Site Frequency Spectrum.

angsd-wrapper SFS ./Site_Frequency_Spectrum_Config

This is my output and it appears it's failing when trying to find the file needed to fold (or not fold) the spectrum, but it can't.

WRAPPER: Zipping advanced arguments onto basic ones

        -> angsd version: 0.911-44-g1c0ebb6 (htslib: 1.3.1-30-gbb03b02) build(Oct 31 2021 11:04:52)
        -> Reading fasta: /mnt/steelhead/remote/Sophie/Programs/angsd-wrapper/Example_Data/Sequences/Tripsacum_TDD39103.fa
        -> Reading fasta: /mnt/steelhead/remote/Sophie/Programs/angsd-wrapper/Example_Data/Sequences/Zea_mays.AGPv3.30.dna_sm.chromosome.10.fa
        -> (Using Filipe G Vieira modification of: abcSaf.cpp)
        -> Parsing 11 number of samples
        -> Region lookup 1/1

        -> We have now allocated approximately 10 Megabytes of raw nodes to the nodepool
        -> Printing at chr: 10 pos:17551496 chunknumber 1100
        -> We have now allocated approximately 20 Megabytes of raw nodes to the nodepool
        -> Printing at chr: 10 pos:19386992 chunknumber 2000 [emFrequency_F] caught nan will not exit
logLike (3*nInd). nInd=11
keepList (nInd)
used logLike (3*length(keep))=11
        -> Printing at chr: 10 pos:22395913 chunknumber 3200 [emFrequency_F] caught nan will not exit
logLike (3*nInd). nInd=11
keepList (nInd)
used logLike (3*length(keep))=10
[emFrequency_F] caught nan will not exit
logLike (3*nInd). nInd=11
keepList (nInd)
used logLike (3*length(keep))=10
[emFrequency_F] caught nan will not exit
logLike (3*nInd). nInd=11
keepList (nInd)
used logLike (3*length(keep))=10
        -> Printing at chr: 10 pos:24004662 chunknumber 3600 [emFrequency_F] caught nan will not exit
logLike (3*nInd). nInd=11
keepList (nInd)
used logLike (3*length(keep))=11
        -> Printing at chr: 10 pos:24908040 chunknumber 4000
        -> Done reading data waiting for calculations to finish
        -> Done waiting for threads

        -> npools:26 unfreed tnodes before clean:0
        -> Output filenames:
                ->"/mnt/steelhead/remote/Sophie/scratch/Maize/SFS/Maize_SFSOut.arg"
                ->"/mnt/steelhead/remote/Sophie/scratch/Maize/SFS/Maize_SFSOut.mafs.gz"
                ->"/mnt/steelhead/remote/Sophie/scratch/Maize/SFS/Maize_SFSOut.geno.gz"
                ->"/mnt/steelhead/remote/Sophie/scratch/Maize/SFS/Maize_SFSOut.saf.gz"
                ->"/mnt/steelhead/remote/Sophie/scratch/Maize/SFS/Maize_SFSOut.saf.pos.gz"
                ->"/mnt/steelhead/remote/Sophie/scratch/Maize/SFS/Maize_SFSOut.saf.idx"
        -> Sun Oct 31 12:08:56 2021
        -> Arguments and parameters for all analysis are located in .arg file
        [ALL done] cpu-time used =  199.08 sec
        [ALL done] walltime used =  130.00 sec
        -> Version of fname:/mnt/steelhead/remote/Sophie/scratch/Maize/SFS/Maize_SFSOut.saf.idx is:2
        -> Assuming .saf.gz file: /mnt/steelhead/remote/Sophie/scratch/Maize/SFS/Maize_SFSOut.saf.gz
        -> Assuming .saf.pos.gz: /mnt/steelhead/remote/Sophie/scratch/Maize/SFS/Maize_SFSOut.saf.pos.gz
        -> Problem opening file: '-fold'

Looking at the wrapper shell script (Site_Frequency_Spectrum.sh) it appears that is failing in the final section of the script in the middle of a series of pipes to the final file which does get output in my scratch directory, it's just empty.

#!/usr/bin/env bash

set -e
set -o pipefail

#   Load variables from supplied config file
source "$1"

#   Are we using Common_Config? If so, source it
if [[ -f "${COMMON}" ]]
then
    source "${COMMON}"
fi

#   Where is angsd-wrapper located?
SOURCE=$2

#   Where is ANGSD?
ANGSD_DIR=${SOURCE}/dependencies/angsd

#   Variables created from transforming other variables
#       The number of individuals in the taxon we are analyzing
N_IND=$(wc -l < "${SAMPLE_LIST}")
#       How many inbreeding coefficients are supplied?
N_F=$(wc -l < "${SAMPLE_INBREEDING}")
#       For ANGSD, the actual sample size is twice the number of individuals, since each individual has two chromosomes.
#       The individual inbreeding coefficents take care of the mismatch between these two numbers

#   Perform a check to see if number of individuals matches number of inbreeding coefficients
if [ "${N_IND}" -ne "${N_F}" ]
then
    echo "Mismatch between number of samples in ${SAMPLE_LIST} and ${SAMPLE_INBREEDING}"
    exit 1
fi

#   Check to see if ancestral state is supplied: If not, polarize samples using
#   the reference sequence and generate folded saf.
if [ ! -f "${ANC_SEQ}" ]
then
    echo "Ancestral state data not found, using reference sequence to polarize alignment data. BAQ will likewise not be calculated."
    if [ ! -f "${REF_SEQ}" ]
    then
        echo "No reference sequence supplied, unable to perform calculations."
        exit 2
    else
        ANC_SEQ=$REF_SEQ
        REF_SEQ=
        BAQ=0
        FOLD=1
    fi
else
    FOLD=0
fi

#   Create outdirectory
OUT="${SCRATCH}"/"${PROJECT}"/SFS
mkdir -p "${OUT}"

#   Now we actually run the command, this creates a binary file that contains the prior SFS
if [[ -f "${OUT}"/"${PROJECT}"_SFSOut.mafs.gz ]] && [ "$OVERRIDE" = "false" ]
then
    echo "WRAPPER:maf already exists and OVERRIDE=false, skipping angsd -bam..."
else
    #   Do we have a regions file?
    if [[ -f "${REGIONS}" ]]
    then
	WRAPPER_ARGS=$(echo -bam "${SAMPLE_LIST}" \
            -out "${OUT}"/"${PROJECT}"_SFSOut \
            -indF "${SAMPLE_INBREEDING}" \
            -doSaf "${DO_SAF}" \
            -uniqueOnly "${UNIQUE_ONLY}" \
            -anc "${ANC_SEQ}" \
            -minMapQ "${MIN_MAPQ}" \
            -minQ "${MIN_BASEQUAL}" \
            -nInd "${N_IND}" \
            -minInd "${MIN_IND}"\
            -baq "${BAQ}" \
            -ref "${REF_SEQ}" \
            -GL "${GT_LIKELIHOOD}" \
            -P "${N_CORES}" \
            -doMajorMinor "${DO_MAJORMINOR}" \
            -doMaf "${DO_MAF}" \
            -doGeno "${DO_GENO}" \
            -rf "${REGIONS}" \
            -doPost "${DO_POST}")
    #   Are we missing a definiton for regions?
    elif [[ -z "${REGIONS}" ]]
    then
	WRAPPER_ARGS=$(echo -bam "${SAMPLE_LIST}" \
            -out "${OUT}"/"${PROJECT}"_SFSOut \
            -indF "${SAMPLE_INBREEDING}" \
            -doSaf "${DO_SAF}" \
            -uniqueOnly "${UNIQUE_ONLY}" \
            -anc "${ANC_SEQ}" \
            -minMapQ "${MIN_MAPQ}" \
            -minQ "${MIN_BASEQUAL}" \
            -nInd "${N_IND}" \
            -minInd "${MIN_IND}"\
            -baq "${BAQ}" \
            -ref "${REF_SEQ}" \
            -GL "${GT_LIKELIHOOD}" \
            -P "${N_CORES}" \
            -doMajorMinor "${DO_MAJORMINOR}" \
            -doMaf "${DO_MAF}" \
            -doGeno "${DO_GENO}" \
            -doPost "${DO_POST}")
    #   Assuming a single region was defined in config file
    else
	WRAPPER_ARGS=$(echo -bam "${SAMPLE_LIST}" \
            -out "${OUT}"/"${PROJECT}"_SFSOut \
            -indF "${SAMPLE_INBREEDING}" \
            -doSaf "${DO_SAF}" \
            -uniqueOnly "${UNIQUE_ONLY}" \
            -anc "${ANC_SEQ}" \
            -folded "${FOLD}" \
            -minMapQ "${MIN_MAPQ}" \
            -minQ "${MIN_BASEQUAL}" \
            -nInd "${N_IND}" \
            -minInd "${MIN_IND}" \
            -baq "${BAQ}" \
            -ref "${REF_SEQ}" \
            -GL "${GT_LIKELIHOOD}" \
            -P "${N_CORES}" \
            -doMajorMinor "${DO_MAJORMINOR}" \
            -doMaf "${DO_MAF}" \
            -doGeno "${DO_GENO}" \
            -doPost "${DO_POST}" \
            -r "${REGIONS}")
    fi
fi
# Check for advanced arguments, and overwrite any overlapping definitions
FINAL_ARGS=($(source "${SOURCE}/Wrappers/Arg_Zipper.sh" "${WRAPPER_ARGS}" "${ADVANCED_ARGS}"))
# DEBUGGING
# echo "Wrapper arguments: ${WRAPPER_ARGS}" 1<&2
# echo -e "Final arguments:" ${FINAL_ARGS} 1<&2

"${ANGSD_DIR}"/angsd "${FINAL_ARGS[@]}"

"${ANGSD_DIR}"/misc/realSFS \
    "${OUT}"/"${PROJECT}"_SFSOut.saf.idx \
    -P "${N_CORES}" \
    -fold "${FOLD}" \
    > "${OUT}"/"${PROJECT}"_DerivedSFS.graph.me`

I can also include my configuration file if helpful (Site_Frequency_Spectrum_Config) which also directs the script to another configuration file in the same directory (Common_Config), but I'm wondering whether anyone else has run into this error while trying to move through this tutorial before. I am trying to figure out if this is a file path issue or if the SFS is not running correctly and there is some other error in the output file I am not identifying correctly.

Adding more filters than are in the Common_config

Can I somehow add the -minIndDepth filter (Only use site if at least minInd of samples has this minimum depth) to the Common_config file or other config files? I have found it to be very important to deal with missing data when comparing ancient and modern genomes.

I would also like to add these filters which I have also found to be important for my dataset
-setMinDepth
-setMaxDepth
-rmTrans
-minMaf
-SNP_pval 1e-6

issue running shiny graphing

after I install angsd-wrapper and try to run shiny I get the following error. Any suggestions?

MBP-de-Joanna:angsd-wrapper joannamalukiewicz$ angsd-wrapper shiny graphing
angsd-wrapper running from /Applications/angsd-wrapper

Loading required package: survival
Loading required package: Formula
Loading required package: ggplot2

Attaching package: ‘Hmisc’

The following objects are masked from ‘package:base’:

format.pval, units

Attaching package: ‘ape’

The following object is masked from ‘package:Hmisc’:

zoom

Attaching package: ‘DT’

The following objects are masked from ‘package:shiny’:

dataTableOutput, renderDataTable

Fehler: With R version 3.5 or greater, install Bioconductor packages using BiocManager; see https://bioconductor.org/install
Ausführung angehalten

compilation error (abcHWE.cpp:272:8: error: ‘isnan’ was not declared in this scope)

Hello,
I am trying to install this tool on Linux 4.15.0-147-generic #151-Ubuntu. The error I am getting is

abcHWE.cpp: In member function ‘void abcHWE::HWE_EM(double*, double*, int)’:
abcHWE.cpp:272:8: error: ‘isnan’ was not declared in this scope
     if(isnan(newFreq1*0.5 + newFreq2)){
        ^~~~~
abcHWE.cpp:272:8: note: suggested alternative:
In file included from analysisFunction.h:6:0,
                 from abc.h:8,
                 from abcFreq.h:2,
                 from abcHWE.cpp:17:
/usr/include/c++/7/cmath:639:5: note:   ‘std::isnan’
     isnan(_Tp __x)
     ^~~~~
Makefile:47: recipe for target 'abcHWE.o' failed
make: *** [abcHWE.o] Error 1

which was previously reported. I installed libgeos-dev, but it did not solve the problem

stelo@H4:~/sw/angsd-wrapper$ sudo apt install libgeos-dev
Reading package lists... Done
Building dependency tree
Reading state information... Done
libgeos-dev is already the newest version (3.6.2-1build2).

Install issue;gsl

Hello,
I realize that there is active development going on but wanted to make sure you were aware of an issue that is currently preventing install / compilation:

read_data.cpp:1:10: fatal error: gsl/gsl_rng.h: No such file or directory

Hopefully this will be a quick fix?
Thanks!
-A

Issue with shiny graphing (running ./angsd-wrapper)

Hi guys,

I am here to report a problem I had when running ./angsd-wrapper shiny graphing. After doing the whole tutorial (which is very helpful, thanks for that!), I tried to run shiny graphing and got this:

~/Desktop/programs/angsd-wrapper$ ./angsd-wrapper shiny graphing
angsd-wrapper running from /home/carlos/Desktop/programs/angsd-wrapper

Loading required package: methods
Loading required package: survival
Loading required package: Formula
Loading required package: ggplot2

Attaching package: ‘Hmisc’

The following objects are masked from ‘package:base’:

    format.pval, units

Error: package or namespace load failed for ‘ape’ in dyn.load(file, DLLpath = DLLpath, ...):
 unable to load shared object '/home/carlos/Desktop/programs/angsd-wrapper/.RLibs/ape/libs/ape.so':
  libRlapack.so: cannot open shared object file: No such file or directory
Execution halted

However, this directory exists:

~/Desktop/programs/angsd-wrapper$ ls /home/carlos/Desktop/programs/angsd-wrapper/.RLibs/ape/libs
ape.so

Re-installing the 'ape' package in R did not change the output. I did it and it automatically installed where I have my R-3.4:

> install.packages('ape')
Installing package into ‘/home/carlos/R/x86_64-pc-linux-gnu-library/3.4’
(as ‘lib’ is unspecified)
...
* DONE (ape)

The downloaded source packages are in
	‘/tmp/RtmpgvgvS3/downloaded_packages’

I tried adding /home/carlos/R/x86_64-pc-linux-gnu-library/3.4 in my PATH variable of .bashrc and the result is still the same:

export PATH="/home/carlos/R/x86_64-pc-linux-gnu-library/3.4/:$PATH"
export LD_LIBRARY_PATH=/home/carlos/Desktop/programs/angsd-wrapper/.RLibs/ape/libs/ape.so:${LD_LIBRARY_PATH}

I saw a similar issue here: mojaveazure/angsd-wrapper#47 and I realized I had also installed anaconda in my PC. I tried to deactivate it (adding # to #export PATH="/home/carlos/miniconda3/condabin/:$PATH" in .bashrc), but I still had the same issue.

What could be wrong?

Best regards and thanks

Error when setup dependencies

Hi there,
I wanted to install angsd-wrapper locally. Unfortunately, the dependencies setup of angsd-wrapper was not easy as it was described on github (https://github.com/ANGSD-wrapper/angsd-wrapper). There was an error message as following when I ran “./angsd-wrapper setup dependencies”,

############################################################
abcHWE.cpp: In member function ‘void abcHWE::HWE_EM(double*, double*, int)’:
abcHWE.cpp:272:8: error: ‘isnan’ was not declared in this scope
if(isnan(newFreq1*0.5 + newFreq2)){
^~~~~
abcHWE.cpp:272:8: note: suggested alternative:
In file included from analysisFunction.h:6:0,
from abc.h:8,
from abcFreq.h:2,
from abcHWE.cpp:17:
/usr/include/c++/7/cmath:639:5: note: ‘std::isnan’
isnan(_Tp __x)
^~~~~
Makefile:47: recipe for target 'abcHWE.o' failed
make: *** [abcHWE.o] Error 1
############################################################
OS: Windows 10 subsystem Linux
Version: Linux version 4.4.0-17134-Microsoft ([email protected]) (gcc version 5.4.0 (GCC) )
Command used: just following the setup instruction on https://github.com/ANGSD-wrapper/angsd-wrapper.

Can somebody help me to solve the problem? Thanks!