mfumagalli / ngstools Goto Github PK

Programs to analyse NGS data for population genetics purposes

License: GNU General Public License v3.0

Makefile 1.00% R 90.93% Shell 8.07%

ngstools's Introduction

ngsTools

NGS (Next-Generation Sequencing) technologies have revolutionised population genetic research by enabling unparalleled data collection from the genomes or subsets of genomes from many individuals. Current technologies produce short fragments of sequenced DNA called reads that are either de novo assembled or mapped to a pre-existing reference genome. This leads to chromosomal positions being sequenced a variable number of times across the genome. This parameter is usually referred to as the sequencing depth. Individual genotypes are then inferred from the proportion of nucleotide bases covering each site after the reads have been aligned.

Low sequencing depth and high error rates stemming from base calling and mapping errors can cause SNP (Single Nucleotide Polymorphism) and genotype calling from NGS data to be associated with considerable statistical uncertainty. Probabilistic models, which take these errors into account, have been proposed to accurately assign genotypes and estimate allele frequencies (e.g. Nielsen et al. 2012; for a review Nielsen et al. 2011).

ngsTools is a collection of programs for population genetics analyses from NGS data, taking into account data statistical uncertainty. The methods implemented in these programs do not rely on SNP or genotype calling, and are particularly suitable for low sequencing depth data. An application note illustrating its application has published (Fumagalli et al. 2014).

NOTE: this repository is intended for general use as it groups together the latest stable version of each tool. Developers (and only them) may want to check each tool's main repository.

Packages

ANGSD - Software for analyzing next generation sequencing data taking genotype uncertainty into account by working with genotype probabilities (instead of called genotypes). This is especially useful for low and medium depth data (Korneliussen et al., 2014). NOTE: this program is NOT developed by ngsTools so, if you have any questions about it or encounter any errors/bugs, please check its wiki or contact its authors.
ngsSim - Simple sequencing read simulator that can generate data for multiple populations with variable levels of depth, error rates, genetic variability, and individual inbreeding (Kim et al., 2011). It generates mapped reads and the corresponding genotype likelihoods, avoiding mapping uncertainty and being extremely fast.
ngsF - This program provides a method to estimate individual inbreeding coefficients using an EM algorithm (Vieira et al., 2013). These can provide insight into a population's mating system or demographic history and, more importantly, they can be used as a prior in ANGSD.
ngsPopGen - Several tools to perform population genetic analyses from NGS data (Fumagalli et al., 2013, Fumagalli, 2013).
- ngsFst - Quantifying population genetic differentiation
- ngsCovar - Population structure via PCA (principal components analysis)
- ngs2dSFS - Estimate 2D-SFS from posterior probabilities of sample allele frequencies
- ngsStat - Estimate number of segregating sites, expected average heterozygosity and other nucleotide diversity indexes
ngsUtils - General tools to manipulate data produced by ngsTools.
- GetMergedGeno - Merge genotype posterior probabilities files
- GetSubGeno - Select a subset of genotype posterior probabilities files
- GetSubSim - Select a subset of simulated data files
- GetSwitchedGeno - Switch major/minor in genotype posterior probabilities files
ngsDist is a program that estimates genetic distances from genotype posterior probabilities (Vieira et al., 2016).
ngsF-HMM is a program developed and written by F.G. Vieira to estimate per-individual inbreeding tracts using a two-state Hidden Markov Model (Vieira et al. 2016). It uses a probabilistic framework that takes the uncertainty of genotype's assignation into account; making it specially suited for low-quality or low-coverage datasets. It is not officially part of ngsTools so it must be installed separately.
ngsLD is a program that calculates pairwise Linkage Disequilibrium (LD) under a probabilistic framework (Fox et al., 2019).
HMMploidy is a program written by S. Soraggi to infer ploidy levels from low-coverage sequencing data. It is not officially part of ngsTools so it must be installed separately.

Installation

ngsTools can be easily installed but some packages have some external dependencies:

Mandatory:
- gcc: >= 4.9.2 tested on Debian 7.8 (wheezy)
- zlib: v1.2.7 tested on Debian 7.8 (wheezy)
- gsl : v1.15 tested on Debian 7.8 (wheezy)
- libbz2.so, required by htslib
- liblzma.so, required by htslib
- libcurl.so, required by angsd
Optional (only needed for testing or auxilliary scripts):
- md5sum
- samtools
- Perl packages: Getopt::Long, Graph::Easy, Math::BigFloat, and IO::Zlib
- R packages: optparse, tools, ggplot2, reshape2, plyr, gtools, LDheatmap, ape, grid, methods, phangorn, and plot3D

If you have issues with gsl package, then on linux make sure you have these packages installed: gsl-bin libgsl-dbg libgsl-dev libgslcblas0.

To download ngsTools and its submodules use:

% git clone --recursive https://github.com/mfumagalli/ngsTools.git

If you prefer, although it is not recomended, you can download a zipped folder on the right side of this page ("Download ZIP").

To install these tools just run:

% cd ngsTools
% make

To run the tests (currently deprecated):

% make test

Executables are built into each tool directory in the repository. If you wish to clean all binaries and intermediate files:

% make clean

To get the latest version of ngsTools package:

% git pull
% git submodule update

NOTE: for developers only: if you wish to make changes and update the whole package:

# in the modified repo
# be sure to be on the master branch: git checkout master
% git commit -a -m 'Local changes...'
% git push
# in ngsTools main repo
% git commit -a -m 'Submodules updated'
% git push
# check that everything went well: 
% git status

To build the tool set into an Apptainer or SingularityCE container you can take the following definition file as a starting point:

Apptainer/SingularityCE container definition file

Bootstrap: docker
From: ubuntu:20.04

%post
    export DEBIAN_FRONTEND=noninteractive
    apt-get update -y

    apt install -y git build-essential pkg-config
    apt install -y libz-dev libbz2-dev liblzma-dev libcurl4-openssl-dev libssl-dev libgsl-dev

    git clone --recursive https://github.com/mfumagalli/ngsTools.git
    cd ngsTools

    # this will build a specific commit (in this case commit 6505f80) #
    # alternatively you can build the latest commit by commenting out the next two lines
    git checkout 6505f80
    git submodule update --init --recursive

    make

%environment
    export LC_ALL=C

%runscript
    export PATH=/ngsTools/angsd:$PATH
    export PATH=/ngsTools/ngsDist:$PATH
    export PATH=/ngsTools/ngsF:$PATH
    export PATH=/ngsTools/ngsF-HMM:$PATH
    export PATH=/ngsTools/ngsLD:$PATH
    export PATH=/ngsTools/ngsPopGen:$PATH
    export PATH=/ngsTools/ngsSim:$PATH
    export PATH=/ngsTools/ngsUtils:$PATH

    $@

%labels
    Author Radovan Bast

%help
    You can build this container with:
    $ sudo singularity build container.sif container.def
    This is how I use this container image:
    $ ./container.sif ngsDist [other arguments]
    $ ./container.sif [some other NGS tool]

Input Files

All programs receive as input files produced by ANGSD. In general, these files can contain genotype likelihoods, genotype posterior probabilities, sample allele frequency posterior probabilities or an estimate of the SFS (Site Frequency Spectrum). Please refer to each tool's repository or the comprehensive Tutorial for more explanations and examples on how these tools work.

Tutorial

A tutorial on some analyses using ngsTools/ANGSD from BAM files can be found here. In this Tutorial, you will find how to filter your data, assess population structure and estimate summary statistics using these tools for low-depth data. For most cases, you will find all the information you need here.

Additional information

Authors

Matteo Fumagalli & Filipe G. Vieira. Other programmers and developers: Tyler Lynderoth, Rasmus Nielsen. Some lines of code have been and adapted from: Thorfinn Korneliussen, Anders Albrechtsen, Jacob Crawford, Dean Ousby, Martin Sykora, Leo Diaz.

Updates

If you want to be updated about new releases and fixed bugs please follow the github repository. For specific queries on the code, please use github features to raise issues. For informal questions (better on ngsTools and not ANGSD) feel free to contact Matteo Fumagalli.

Citation

ngsTools package can be cited as:

ngsTools: methods for population genetics analyses from next-generation sequencing data.
Fumagalli M, Vieira FG, Linderoth T, Nielsen R.
Bioinformatics. 2014 May 15;30(10):1486-7

ANGSD can be cited as:

ANGSD: Analysis of Next Generation Sequencing Data.
Korneliussen T, Albrechtsen A, Nielsen R
BMC Bioinformatics. 2014 Nov 25;15(1):356

SNP calling, genotype calling, and sample allele frequency estimation from New-Generation Sequencing data.
Nielsen R, Korneliussen T, Albrechtsen A, Li Y, Wang J
PLoS One. 2012;7(7):e37558

FST and PCA methods can be cited as:

Quantifying Population Genetic Differentiation from Next-Generation Sequencing Data.
Fumagalli M, Vieira FG, Korneliussen TS, Linderoth T, Huerta-Sánchez E, Albrechtsen A, Nielsen R
Genetics. 2013 Nov;195(3):979-92

Inbreeding estimation can be cited as:

Estimating inbreeding coefficients from NGS data: impact on genotype calling and allele frequency estimation.
Vieira FG, Fumagalli M, Albrechtsen A, Nielsen R
Genome Res. 2013 Nov;23(11):1852-61

Estimating IBD tracts from low coverage NGS data
Vieira FG, Albrechtsen A and Nielsen R
Bioinformatics. 2016; 32:2096-2102

Nucleotide diversity estimates from NGS data implemented here have been proposed in:

Sequencing of 50 human exomes reveals adaptation to high altitude.
Yi X, Liang Y, Huerta-Sanchez E, Jin X, Cuo ZX, Pool JE, Xu X, Jiang H, Vinckenbosch N, Korneliussen TS, Zheng H, Liu T, He W, Li K, Luo R, Nie X, Wu H, Zhao M, Cao H, Zou J, Shan Y, Li S, Yang Q, Asan, Ni P, Tian G, Xu J, Liu X, Jiang T, Wu R, Zhou G, Tang M, Qin J, Wang T, Feng S, Li G, Huasang, Luosang J, Wang W, Chen F, Wang Y, Zheng X, Li Z, Bianba Z, Yang G, Wang X, Tang S, Gao G, Chen Y, Luo Z, Gusang L, Cao Z, Zhang Q, Ouyang W, Ren X, Liang H, Zheng H, Huang Y, Li J, Bolund L, Kristiansen K, Li Y, Zhang Y, Zhang X, Li R, Li S, Yang H, Nielsen R, Wang J, Wang J
Science. 2010 Jul 2;329(5987):75-8

Calculation of Tajima's D and other neutrality test statistics from low depth next-generation sequencing data.
Korneliussen TS, Moltke I, Albrechtsen A, Nielsen R
BMC Bioinformatics. 2013 Oct 2;14(1):289

Assessing the effect of sequencing depth and sample size in population genetics inferences.
Fumagalli M
PLoS One. 2013 Nov 18;8(11):e79667

Estimation of genetic distances have been described here:

Improving the estimation of genetic distances from Next-Generation Sequencing data.
Vieira FG, Lassalle F, Korneliussen TS, Fumagalli M
Biological Journal of the Linnean Society. Special Issue: Collections-Based Research in the Genomic Era. 117(1):139–149

LD estimation has been published here:

Fox EA, Wright AE, Fumagalli M, Vieira FG. ngsLD: evaluating linkage disequilibrium using genotype likelihoods. Bioinformatics. 2019 Oct 1;35(19):3855-3856. doi: 10.1093/bioinformatics/btz200. PMID: 30903149.

HMMploidy can be cited as:

Soraggi, Samuele; Rhodes, Johanna; Altinkaya, Isin; Tarrant, Oliver; Balloux, Francois; Fisher, Matthew C; Fumagalli, Matteo. HMMploidy: inference of ploidy levels from short-read sequencing data. Peer Community Journal, Volume 2 (2022), article no. e60. doi : 10.24072/pcjournal.178.

ngstools's People

Contributors

Stargazers

Watchers

ngstools's Issues

visualization 2DSFS. plot2DSFS.R

hi
Thank you very much for your valuable sharing.
Recently, I had a problem using plot2DSFS.R.
The command I used was as follows,
Rscript plot2DSFS.R HNHZ.YZ.sfs HNHZ-YZ 27-28
But instead of a raster, I got a strange one, as follows

Can you tell me what causes this?
I am looking forward to your reply!

How to use plotSFS.R for SFS plot?

Dear mfumagalli,

Now, I have three population SFS files, and want to use plotSFS.R to plot a figure of SFS. However, I dont't know how to do.
I tried to use command R CMD BATCH plotSFS.R pop1.sfs pop2.sfs pop3.sfs , but with false output file named NA.

The formate of input file :
1501016844.282213 305197949.866378 94107367.693417 28784361.230037 12232819.391580 8848907.018584 4452457.953427 6221102.970627 2647028.357635 3060544.718254

Hope your answer.
Thank you very much!

Sun

How to interpret the output from plotQC.R

I want to begin with thanking you for the very helpful angsd tutorial and scripts, they have been so useful.

I have run your code plotQC.R successfully on my samples but I am struggling with how to interpret the pdf output and how I can use it to make decisions about the settings for the -setMinDepth and -setMaxDepth flags as that is not really explained in the tutorial.

I have looked far and wide but have not found good explanations behind how people are choosing values for these flags and the angsd documentation is not very helpful. There it says
-setMinDepth [int]: Discard site if total sequencing depth (all individuals added together) is below [int]. Requires -doCounts
-setMaxDepth [int]: Discard site if total sequencing depth (all individuals added together) is above [int] -doCounts
But gives no info on how to figure out what settings might make sense.

Choi et al 2019 say:
-setMinDepth to be one-third the average genome-wide coverage, and–setMaxDepth to be 2.5 times the average genome-wide coverage.

I am working with WGS ancient and modern with a mammal that has a good reference genome. My sample coverage is 3-7x and I am running angsd only on chr no x, y or scaffolds.

I realize this may not strictly be catagorized as an issue but thought I would try.
Example of plot output for chr 3 with 28 ancient samples

Conda recipies?

Hi,

I wanted to install the NGSutils in my conda environment, on a external cluster. I am afraid that the "make" process did not work then. Is there any way to install it using conda? Or are there alternative ways to install in external?

Thank you,

angsd repo link is broken

Hi,

The link to angsd is broken.

I used this repo : https://github.com/ANGSD/angsd

And processed as mentionned in the README.

git clone https://github.com/samtools/htslib.git;
git clone https://github.com/angsd/angsd.git;
cd htsdir;make;cd ../angsd;make HTSSRC=../htslib

It's seem good

ngsCovar error

Hi @mfumagalli, @fgvieira

I have run doGeno 32 for each chromosome with specified high quality sites (filtered by depth, SNP_pvalue, hwe_pvalue, SB3, and so on)
angsd -bam bam.list -only_proper_pairs 1 -uniqueOnly 1 -remove_bads 1 -minQ 20 -minMapQ 30 -C 50 -ref ref.fa -r $i -out $i -doMaf 1 -doMajorMinor 1 -doGeno 32 -doPost 1 -GL 1 -doCounts 1 -P 5 -doGlf 2 -sites $i.site
but when I do ngsCovar -probfile test.geno -outfile test.covar -nind $minindividual -nsites $sites -call 0 -norm 0 for one chromosome according to the tutorial, it just give me an error like following,
-> Possible error read GENO, binary file might be broken...
I do not know why because everything proceeded very smoothly except ngsCovar. I need your help.

Many thanks.
Best

Rewriting history

Howdy, and thanks for putting this on github, I'm a fan. Appears you've been rewriting commit history, which causes merge conflicts when I try to git pull to get newest changes to code. See http://git-scm.com/book/ch6-4.html .

Cheers,

Jeff

Make test(s) failure

Hi there!

I would appreciate some help with installing ngsTools. After cloning latests version, running make and make tests, I get the following (on two completely different servers, CentOS 7.4.1708 and Ubuntu 16.04.3 LTS):

make[1]: Entering directory /storage/rafal.gutaker/bin/INSTALLS/ngsTools/ngsSim' ngsSim: test(s) failed! make[1]: Leaving directory /storage/rafal.gutaker/bin/INSTALLS/ngsTools/ngsSim'
make[1]: Entering directory /storage/rafal.gutaker/bin/INSTALLS/ngsTools/ngsPopGen' ngsPopGen: test(s) failed! make[1]: Leaving directory /storage/rafal.gutaker/bin/INSTALLS/ngsTools/ngsPopGen'
make[1]: Entering directory /storage/rafal.gutaker/bin/INSTALLS/ngsTools/ngsUtils' ngsUtils: test(s) failed! make[1]: Leaving directory /storage/rafal.gutaker/bin/INSTALLS/ngsTools/ngsUtils'
make[1]: Entering directory /storage/rafal.gutaker/bin/INSTALLS/ngsTools/ngsDist' ngsDist: test(s) failed! make[1]: Leaving directory /storage/rafal.gutaker/bin/INSTALLS/ngsTools/ngsDist'
make[1]: Entering directory /storage/rafal.gutaker/bin/INSTALLS/ngsTools/ngsF' ngsF: All tests OK! make[1]: Leaving directory /storage/rafal.gutaker/bin/INSTALLS/ngsTools/ngsF'
make[1]: Entering directory /storage/rafal.gutaker/bin/INSTALLS/ngsTools/ngsF-HMM' ngsF-HMM: test(s) failed! make[1]: Leaving directory /storage/rafal.gutaker/bin/INSTALLS/ngsTools/ngsF-HMM'

Any help would be much welcome!

Best,
Rafal

PCA analysis with unlinked sites

I was wondering why the tutorial is not making a PCA on pruned or unlinked sites in the TUTORIAL.md. It is mentioned in the inbreeding part, but not for PCA. How would this be performed?

TESTS FAILED

Hi. I installed the ngsTools on my ubuntu machine using the make command. I run the make test and some of the tests failed. Is there anything to be concerned about ? Below is the log

make[1]: Entering directory '/home/user/ngsTools/ngsSim'
ngsSim: All tests OK!
make[1]: Leaving directory '/home/user/ngsTools/ngsSim'
make[1]: Entering directory '/home/user/ngsTools/ngsPopGen'
ngsPopGen: All tests OK!
make[1]: Leaving directory '/home/user/ngsTools/ngsPopGen'
make[1]: Entering directory '/home/user/ngsTools/ngsUtils'
ngsUtils: All tests OK!
make[1]: Leaving directory '/home/user/ngsTools/ngsUtils'
make[1]: Entering directory '/home/user/ngsTools/ngsDist'
ngsDist: All tests OK!
make[1]: Leaving directory '/home/user/ngsTools/ngsDist'
make[1]: Entering directory '/home/user/ngsTools/ngsLD'
ngsLD: test(s) failed!
make[1]: Leaving directory '/home/user/ngsTools/ngsLD'
make[1]: Entering directory '/home/user/ngsTools/ngsF'
ngsF: All tests OK!
make[1]: Leaving directory '/home/user/ngsTools/ngsF'
make[1]: Entering directory '/home/user/ngsTools/ngsF-HMM'
ngsF-HMM: test(s) failed!
make[1]: Leaving directory '/home/user/ngsTools/ngsF-HMM'

sem_t not included in shared.h for some reason

Hello,

Trying to compile your software here on an Ubuntu 16.04 box and for some reason the sem_t type isn't recognized. It's a standard library so I am not sure what the problem is. My version of g++ is: g++ (Ubuntu 7.3.0-21ubuntu1~16.04) 7.3.0.

Do you have any idea what to do? Thank you!

g++ -I./shared  -O3 -Wall -D_FILE_OFFSET_BITS=64 -D_LARGEFILE64_SOURCE -c ./shared/threadpool.c
g++ -I./shared  -O3 -Wall -D_FILE_OFFSET_BITS=64 -D_LARGEFILE64_SOURCE -c parse_args.cpp
g++ -I./shared  -O3 -Wall -D_FILE_OFFSET_BITS=64 -D_LARGEFILE64_SOURCE -c ngsDist.cpp -lgsl -lgslcblas -lm -lz -lpthread
g++ -O3 -Wall -D_FILE_OFFSET_BITS=64 -D_LARGEFILE64_SOURCE *.o -lgsl -lgslcblas -lm -lz -lpthread -o ngsDist
make[1]: Leaving directory '/usr/local/bin/ngsTools/ngsDist'
make[1]: Entering directory '/usr/local/bin/ngsTools/ngsLD'
g++ -I./shared  -O3 -Wall -D_FILE_OFFSET_BITS=64 -D_LARGEFILE64_SOURCE -c ./shared/gen_func.cpp
g++ -I./shared  -O3 -Wall -D_FILE_OFFSET_BITS=64 -D_LARGEFILE64_SOURCE -c ./shared/read_data.cpp
g++ -I./shared  -O3 -Wall -D_FILE_OFFSET_BITS=64 -D_LARGEFILE64_SOURCE -c ./shared/threadpool.c
g++ -I./shared  -O3 -Wall -D_FILE_OFFSET_BITS=64 -D_LARGEFILE64_SOURCE -c parse_args.cpp
g++ -I./shared  -O3 -Wall -D_FILE_OFFSET_BITS=64 -D_LARGEFILE64_SOURCE -c ngsLD.cpp -lgsl -lgslcblas -lm -lz -lpthread
g++ -O3 -Wall -D_FILE_OFFSET_BITS=64 -D_LARGEFILE64_SOURCE *.o -lgsl -lgslcblas -lm -lz -lpthread -o ngsLD
make[1]: Leaving directory '/usr/local/bin/ngsTools/ngsLD'
make[1]: Entering directory '/usr/local/bin/ngsTools/ngsF'
g++ -O3 -Wall -I -I/usr/local/bin/ngsTools/htslib  -D_FILE_OFFSET_BITS=64 -D_LARGEFILE64_SOURCE -D_USE_KNETFILE  -c parse_args.cpp
In file included from parse_args.cpp:3:0:
shared.h:47:2: error: ‘sem_t’ does not name a type; did you mean ‘key_t’?
  sem_t _launch_thread_semaph;
  ^~~~~
  key_t
shared.h:48:2: error: ‘sem_t’ does not name a type; did you mean ‘key_t’?
  sem_t _running_thread_semaph;
  ^~~~~
  key_t
Makefile:28: recipe for target 'parse_args.o' failed
make[1]: *** [parse_args.o] Error 1
make[1]: Leaving directory '/usr/local/bin/ngsTools/ngsF'
Makefile:11: recipe for target 'ngsF' failed
make: *** [ngsF] Error 2

covar vs. cover

In:

Rscript scripts/plotPCA.R -i test.cover -c 1-2 -a test.pops.clst -o test.pca.pdf

test.cover should be test.covar to be consistent with previous tutorial syntax.

running ngsTools on Plasmodium falciparum data

I would like to know if its possible to run ngsTools on P. falciparum parasite ILLUMINA data. Thanks

ngsTools for polyploid

Hi, is it possible to use ngsTools for simulation of sequence data in polyploid?
Thanks

segfault with some -offset / -nsites combinations

Dear Matteo,

I am getting a segmentation fault when running ngsCovar for data subsets using -offset and -nsites options - I suspect this could be a bug.

I am running the following command with the latest github version of ngsCovar and get the following verbose output (args and dumping file lines removed):

$ ngsCovar -probfile file.geno -nind 58 -verbose 1 -offset 400000 -nsites 193 -norm 0 -call 0 -sfsfile file.saf -outfile tmp1
Block 0 out of 0 from 399999 to 400191
Getting esti...: 193 174 , 0.969347 0.059351
Getting mu...: 193, 0.286054 0.115822
Updating covar......weighting...
Got sfs: 193 117, e.g. 0.000000 0.000000
Getting pvar...: 193 , 1.000000 1.000000
Updating covar...Segmentation fault

I get the same when not using -sfsfile:

$ ngsCovar -probfile file.geno -nind 58 -verbose 1 -offset 148028 -nsites 327 -norm 0 -call 0 -outfile tmp1
Block 0 out of 0 from 148027 to 148353
Getting esti...: 327 174 , 0.984398 0.015602
Getting mu...: 327, 0.071430 0.074497
Updating covar...
I am using this minmaf 0.000000 ...no weighting...Segmentation fault

While for some other combinations of -offset and -nsites, it works:

$ ngsCovar -probfile file.geno -nind 58 -verbose 1 -offset 148027 -nsites 329 -norm 0 -call 0 -sfsfile file.saf -outfile tmp1**
Block 0 out of 0 from 148026 to 148354
Getting esti...: 329 174 , 0.000000 0.015602
Getting mu...: 329, 0.169526 0.048822
Updating covar......weighting...
Got sfs: 329 117, e.g. 0.000000 0.186886
Getting pvar...: 329 , 1.000000 0.999999
Updating covar...
Updating esite...: 309.205395 Sum:4383384258618.480957
(Exp/eff) nr sites: 309.205395

I use geno/saf files created with angsd 0.911 (newest github version) from a BEAGLE input file containing around 7M SNPs, so the geno and saf files (both binary) should contain information on 7M sites. I observed that the subsetting worked fine for low numbers supplied to -offset, but that with high numbers (= high index of starting site), the segfault happens. To test this, I ran the same data subset once using the full 7M SNP geno/saf files (-> segfault) and once using geno/saf files only containing the data subset of interest (-> worked nicely). The index number (1 vs. 27000000) supplied to -offset was thus the only difference.

Thank you very much for any ideas or input on what I could be wrong or whether this could be a bug.
Best,
David

COMPILATION ERROR

Hi , I tried installing ngsTools and I got this error. Please advice

No package 'gsl' found
make[1]: Entering directory `/home/ha/apps/bioinfo/ngsTools/ngsTools/ngsDist'
g++ -I./shared -O3 -Wall -D_FILE_OFFSET_BITS=64 -D_LARGEFILE64_SOURCE -c ./shared/gen_func.cpp
In file included from ./shared/gen_func.cpp:1:0:
./shared/gen_func.hpp:12:25: fatal error: gsl/gsl_rng.h: No such file or directory
#include <gsl/gsl_rng.h>

But I have gsl installed ( in a non-default linux directory)

error while running plotPCA.R

Dear Dr Fumagalli,

I'm trying to use your plotPCA.R script through:
Rscript --vanilla --slave pca.R -i ${in}.covMat -c 1-2 -a ${bamclst}.clst -o ${out}.pdf

with a .covMat file I generated from the following angsd command:

angsd -bam ${bamlist} -out ${out} -nThreads 8 \
      -minMapQ 20 -minQ 20 -minInd ${minInd} -doCounts 1 \
      -GL 1 -doMajorMinor 1 -doMaf 1 -minMaf 0.05 -skipTriallelic 1 \
      -SNP_pval 1e-6 -doHWE 1 -minHWEpval 0.05 -doPost 2 -postCutoff 0.95 \
      -doGeno 4 -geno_minDepth 5 -doGlf 2 -doIBS 1 -doCov 1

But I get the following error message :

Error in eigen(covar, symm = TRUE) : infinite or missing values in 'x'
Execution halted

I don't understand the issue because I don't have any missing data in my .covMat file.
Do you know what this message means?

Best regards,
Manon

Floating point exception problem in ngsStat

Hello,

I am running nsgStat in different populations. No matter which population I used, when I try to get the stats by doing:

/opt/ngsTools/ngsPopGen/ngsStat -npop 1 -postfiles BIA.unfolded.ref.postprob.saf -outfile BIA.stats.txt -nind 8

I got the message: "floating point exception". Output file is generated but empty. The .saf file "seems" normal when I print it. I don't really know what could be causing the problem, so sorry for not giving any hint. Any idea?

I would truly appreciate any input,
Best,
María.

cannot convert vcf file to beagle binary format using the -vcf-gl option

Hello,

I did SNP calling with GATK (HaplotypeCaller, CombineGVCFs and GenotypeGVCFs) and as a result I have a final with all snps after applying several filters.

Now I would like to use this matrix of snps to study population structure (do a PCA with ngsCovar) and individual admixture proportions (with NGSadmix). It is my understanding that in order to do this I need to convert my vcf file into beagle binary format, which I could do in angsd using the -vcf-gl option.

I am able to run the command (see below) without getting errors but it seems it is not generating the correct output file.

ANGSD=/path to angsd

REF=genome.fasta.fai

$ANGSD -vcf-gl ./snps.vcf.gz -GL 0 -out ./snps.beagle -fai $REF -doGlf 2 -doMajorMinor 1 -doMaf 1 -nind 43

This is what I see when I check the run log file:

BAD SITE chr24:21289418. return code:-1 while fetching GL tag rec->rid:23
BAD SITE chr24:21289424. return code:-1 while fetching GL tag rec->rid:23
BAD SITE chr24:21289425. return code:-1 while fetching GL tag rec->rid:23
BAD SITE chr24:21289431. return code:-1 while fetching GL tag rec->rid:23
BAD SITE chr24:21289433. return code:-1 while fetching GL tag rec->rid:23
BAD SITE chr24:21289470. return code:-1 while fetching GL tag rec->rid:23
BAD SITE chr24:21289475. return code:-1 while fetching GL tag rec->rid:23
BAD SITE chr24:21289525. return code:-1 while fetching GL tag rec->rid:23
-> Done reading data waiting for calculations to finish
-> Done waiting for threads
-> Output filenames:
->"./snps.beagle.arg"
->"./snps.beagle.beagle.gz"
->"./snps.beagle.mafs.gz"
-> Mon May 15 15:43:02 2023
-> Arguments and parameters for all analysis are located in .arg file
-> Total number of sites analyzed: 0
-> Number of sites retained after filtering: 0
[ALL done] cpu-time used = 134.35 sec
[ALL done] walltime used = 135.00 sec

I'm not really sure what the problem is and how to fix it. Any advice on this?

Thank you!!

Possible error reading SFS, binary file might be broken...

Hello,

I am trying to use ngsStat to estimate Dxy between my populations.

To achieve this I am following the ngsTools tutorial.

First, I produce saf files with ANGSD using an ancestral outgroup to unfold the spectrum.

Then i use realSFS to find the intersection between the two populations

realSFS print 5_angsd/5_2_${pop1}_unfolded.saf.idx 5_angsd/5_2_${pop2}_unfolded.saf.idx | cut -f 1-2 > 5_angsd/5_7_dxy/${pop1}vs${pop2}_intersect.txt

angsd sites index 5_angsd/5_7_dxy/${pop1}vs${pop2}_intersect.txt

Then I use angsd to produce the saf files for each population only for the overlapping sites

angsd -bam 5_angsd/5_0_popfiles/${pop}.bamlist \
	-ref ${REF} -anc ${ANC} -out 5_angsd/5_7_dxy/5_2_${pop1}vs${pop2}_${pop} \
	-uniqueOnly 1 -remove_bads 1 -only_proper_pairs 1 -trim 0 -C 50 -baq 1 \
	-minMapQ 20 -minQ 20 -minInd 3 -setMinDepth 6 -setMaxDepth ${MAXDEPTH} \
	-GL 1 -doSaf 1 -doCounts 1 -P 6 -sites 5_angsd/5_7_dxy/${pop1}vs${pop2}_intersect.txt

In this example, pop1 has 4 individuals and pop2 has 6. Both diploid.

The number of overlapping sites is 935,523,322

When I try and run ngsStats as follows (after doing zcat ...saf.gz > ...saf

ngsStat -npop 2 -postfiles 5_2_LMvsLNs_LM.saf 5_2_LMvsLNs_LNs.saf -nsites $NSITES -nind 8 12 -outfile test_LMvsLNs.txt

I get the error: Possible error reading SFS, binary file might be broken...

It is unclear what I have to put for nind, so i tried both 8 12 and 4 6.

I saw closed issue #5 in the ngsPopGen repository, but it did not help. I can provide the gzipped saf files if that can help with troubleshooting.

Cheers,
LVB

How to apply Plink format in ngsTools (ped/map data as input)

Dear all,

I use angsd and ngsTools to do PCA analysis and I found the PCA analysis (by angsd) are not similar with my previous study (by Plink). I tried lots of parameters in angsd and found the results are highly depended on the angsd parameters which are mainly about quality control. So I am wondering how I could apply plink output (ped/map data) to ngsTools as input?

Best,
Yafei

How does ANGSD handle some samples without coverage?

If we have a region in which some samples do not have coverage (e.g. due to presence/absence variation), how will ANGSD handle this? Will it calculate diversity given the samples it has coverage for? Or omit the region entirely?

Tutorial command not running

Hello @mfumagalli . I tried following the tutorial given for ngsTools and it could not run. Below is the command I used

$ANGSD -P 36 -b ALL.bamlist -ref $REF -out Results/ALL.qc -uniqueOnly 1 -remove_bads 1 -only_proper_pairs 1 -trim 0 -C 50 -baq 1 -minMapQ 20 -minQ 20 -minInd 15 -setMinDepth 60 -setMaxDepth 400 -doCounts 1 -GL 1 -doMajorMinor 1 -doMaf 1 -skipTriallelic 1 -SNP_pval 1e-3 -doGeno 32 -doPost 1 &> /dev/null

Below is the message that was displayed
_-> angsd version: 0.930 (htslib: 1.9) build(Jan 3 2021 13:10:41)
-> angsd version: 0.930 (htslib: 1.9) build(Jan 3 2021 13:10:41)
-> Please use the website "http://www.popgen.dk/angsd" as reference
-> Use -nThreads or -P for number of threads allocated to the program
Overview of methods:_Below is the message that was displayed
-GL Estimate genotype likelihoods
-doCounts Calculate various counts statistics
-doAsso Perform association study
-doMaf Estimate allele frequencies
-doError Estimate the type specific error rates
-doAncError Estimate the errorrate based on perfect fastas
-HWE_pval Est inbreedning per site or use as filter
-doGeno Call genotypes
-doFasta Generate a fasta for a BAM file
-doAbbababa Perform an ABBA-BABA test
-sites Analyse specific sites (can force major/minor)
-doSaf Estimate the SFS and/or neutrality tests genotype calling
-doHetPlas Estimate hetplasmy by calculating a pooled haploid frequency

Below are options that can be usefull
-bam		Options relating to bam reading
-doMajorMinor	Infer the major/minor using different approaches
-ref/-anc	Read reference or ancestral genome
-doSNPstat	Calculate various SNPstat
-cigstat	Printout CIGAR stat across readlength
many others

Output files:
In general the specific analysis outputs specific files, but we support basic bcf output
-doBcf Wrapper around -dopost -domajorminor -dofreq -gl -dovcf docounts
For information of specific options type:
./angsd METHODNAME eg
./angsd -GL
./angsd -doMaf
./angsd -doAsso etc
./angsd sites for information about indexing -sites files
Examples:
Estimate MAF for bam files in 'list'
'./angsd -bam list -GL 2 -doMaf 2 -out RES -doMajorMinor 1'__

How to apply inbreeding coeffs (F) in Angsd?

Hi, if i use ngsF to calculate F for individuals in my population, where do I use it as a prior in Angsd? There seems to be no documentation on this.

cheers
DK

folded SFS

Not an issue. I was just wondering how aspects of the tutorial might change when dealing with a system in which an ancestral sequence is lacking? Could one just at -fold 1 to most commands?

ERROR DOWNLOADING DATA SET FOR THE TUTORIAL

I tried downloading the data set for the tutorial and I got some error below is the two lines of the error messages. Please advice

mkdir: cannot create directory ‘Data/LWK.BAMs’: No such file or directory
LWK
/NA19313.mapped.ILLUMINA.bwa.LWK.low_coverage.20120522.bam
./data.sh: line 48: Data/LWK.BAMs/NA19313.mapped.ILLUMINA.bwa.LWK.low_coverage.20120522.bam: No such file or directory
/NA19331.mapped.ILLUMINA.bwa.LWK.low_coverage.20120522.bam

I realized that the error was coming from the line below. please advice
$SAMTOOLS view -h ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/$i $CHROM:$START-$END > Data/$POP.BAMs$NAME 2> /dev/null

Is there interest in adding container definition to documentation?

I am currently helping somebody to build the code and we solved some build issues by building the tool set into a Singularity/Apptainer container:

Bootstrap: docker
From: ubuntu:20.04

%post
    export DEBIAN_FRONTEND=noninteractive
    apt-get update -y

    apt install -y git build-essential pkg-config
    apt install -y libz-dev libbz2-dev liblzma-dev libcurl4-openssl-dev libssl-dev libgsl-dev

    git clone --recursive https://github.com/mfumagalli/ngsTools.git
    cd ngsTools
    git checkout 6505f80
    git submodule update --init --recursive
    make

%environment
    export LC_ALL=C

%runscript
    export PATH=/ngsTools/angsd:$PATH
    export PATH=/ngsTools/ngsDist:$PATH
    export PATH=/ngsTools/ngsF:$PATH
    export PATH=/ngsTools/ngsF-HMM:$PATH
    export PATH=/ngsTools/ngsLD:$PATH
    export PATH=/ngsTools/ngsPopGen:$PATH
    export PATH=/ngsTools/ngsSim:$PATH
    export PATH=/ngsTools/ngsUtils:$PATH

    $@

%labels
    Author Radovan Bast

%help
    This is how I use this container image:
    $ ./container.sif ngsDist [other arguments]
    $ ./container.sif [some other NGS tool]

If there is interest, I could submit it to the README or somewhere else.

never ending ngsF

Hello,
I was rellay happy to see ngstool working in combnation wit ANGSD. I made my first steps with ANGSt recently , and tested ngsF. I am working n yeast and I was interested in estimating selfing rates for each strain: as well for life cycle than from genotyping.
I have downloaded ngstools from Github and compiled it on my Mac OSX 10.9.5, with some error messages but when launching the programs , it apparently works.

1- I managed to launch ngsF with my data, , and it looks like it is calculating but it does not stop after one night with only 123000 variant positions . I thought that this might come from a too big dataset, so that I filtered the SNPs for calling and min quality which turned the data set down to 53681 snps and still I do not see any change. Last I added the “quick “ or "-fast_lkl" option and it does the same.
here are the command lines and outputs

2- I looked back to install data which indicate in the case of ngsF make and then perform 'make test': this one answer test(s) failed! .
However I launched the test scripts after simulating the data with ngsSim and it started also a run … but apparently never ending….

It would be nice of you if you can tell me where the problem might come from!.
Best regards.
JL

zcat 20vintest.glf.gz | ./ngsF -n_ind 20 -n_sites 53681 -glf - -min_epsilon 0.1 -out 20vin.ttest3 -seed 1239 -init_values r -fast_lkl
==> Input Arguments:
glf file: -
init_values: r
out file: 20vin.ttest3
n_ind: 20
n_sites: 53681
chunk_size: 100000
fast_lkl: true
approx_EM: false
call_geno: false
max_iters: 1500
min_epsilon: 0.1000000000
n_threads: 1
seed: 1239
quick: false
version: 1.0.0
verbose: 1

==> Fewer sites (53681) than chunk_size (100000). Reducing chunk size to match number of sites
==> Analysis will be run in 1 chunk(s)
==> Using native I/O library
==> Setting initial values

Iteration 1:
......

realSFS on multiple regions

Can I use realSFS specifying more than one region? if the answer is yes, how?

I am looking for the equivalent to -rf in ANGSD in NGSTools

The idea would be:

Estimate the SFS around SEVERAL REGIONS:
-> ./realSFS afile.saf.idx -r $REGION_FILE

cat $REGION_FILE

chr1:1-10000
chr2:135000000-140000000
chr3:4880-194598

Thetas on a single individual BAM

After running a set of BAM files from multiple individuals through ANGSD, I'd like to take a closer look at summary statistics calculated by ANGSD for a single (inbred) individual BAM to assess some possible earlier technical issues with our mapping steps (I work with plants, so this is quite tricky). I know this is outside of what ANGSD is designed to do, but I'd like to keep our workflow the same as to avoid adding other complexities. Is this possible, or is more than one sample needed with ANGSD?

Installation of NGStools when ANGSD already installed

Hello,
Using the 'git clone' option listed, we downloaded and attempted to install NGStools on our system (Linux, RHEL 6 x86_64), but the make seem to bail when it hit ANGSD, which we already have installed. Can you direct how to install when ANGSD (and its related programs like RealSFS, etc.) are already on a system? I've pasted the error message below in case it is helpful. thanks in advance!

g++ -MM -O3 -D__WITH_POOL__ -I/usr/local/work/workspace/ngsTools/htslib abcTemplate.cpp >abcTemplate.d
g++ -c -O3 -D__WITH_POOL__ -I/usr/local/work/workspace/ngsTools/htslib abcWriteFasta.cpp
g++ -MM -O3 -D__WITH_POOL__ -I/usr/local/work/workspace/ngsTools/htslib abcWriteFasta.cpp >abcWriteFasta.d
g++ -c -O3 -D__WITH_POOL__ -I/usr/local/work/workspace/ngsTools/htslib abcWritePlink.cpp
g++ -MM -O3 -D__WITH_POOL__ -I/usr/local/work/workspace/ngsTools/htslib abcWritePlink.cpp >abcWritePlink.d
g++ -c -O3 -D__WITH_POOL__ -I/usr/local/work/workspace/ngsTools/htslib abcWriteVcf.cpp
g++ -MM -O3 -D__WITH_POOL__ -I/usr/local/work/workspace/ngsTools/htslib abcWriteVcf.cpp >abcWriteVcf.d
g++ -c -O3 -D__WITH_POOL__ -I/usr/local/work/workspace/ngsTools/htslib analysisFunction.cpp
g++ -MM -O3 -D__WITH_POOL__ -I/usr/local/work/workspace/ngsTools/htslib analysisFunction.cpp >analysisFunction.d
g++ -c -O3 -D__WITH_POOL__ -I/usr/local/work/workspace/ngsTools/htslib angsd.cpp
g++ -MM -O3 -D__WITH_POOL__ -I/usr/local/work/workspace/ngsTools/htslib angsd.cpp >angsd.d
g++ -c -O3 -D__WITH_POOL__ -I/usr/local/work/workspace/ngsTools/htslib bam_likes.cpp
bam_likes.cpp:271: error: expected constructor, destructor, or type conversion before ‘’ token
bam_likes.cpp: In function ‘void bam_likes_init()’:
bam_likes.cpp:275: error: ‘mod’ was not declared in this scope
bam_likes.cpp:275: error: ‘errmod_init’ was not declared in this scope
bam_likes.cpp: In function ‘void bam_likes_destroy()’:
bam_likes.cpp:280: error: ‘mod’ was not declared in this scope
bam_likes.cpp:281: error: ‘errmod_destroy’ was not declared in this scope
bam_likes.cpp: In function ‘void call_bam(chunkyT, double*, int)’:
bam_likes.cpp:288: error: ‘mod’ was not declared in this scope
bam_likes.cpp:292: error: ‘mod’ was not declared in this scope
bam_likes.cpp:343: error: ‘errmod_cal’ was not declared in this scope
make[1]: ** [bam_likes.o] Error 1
make[1]: Leaving directory `/usr/local/work/workspace/ngsTools/angsd'
make: *** [angsd] Error 2

how to prepare sfs input file?

I used below code to run, it printed a error "Possible error,binaryfiles might be broken"
/home/lvfh/Tools-Reseq/ngsTools/ngsPopGen/ngsCovar -probfile test.pops.geno -outfile test.covar -nind 10 -nsites 10000 -block_size 2000 -call 0 -sfsfile test.sfs.ml.norm
So, i used an another code "/home/lvfh/Tools-Reseq/ngsTools/ngsPopGen/ngsCovar -probfile test.pops.geno -outfile test.covar -nind 10 -nsites 10000 -block_size 2000 -call 0". i got a right results.
I used the following commands to output .sfs files :
"angsd -b bam.filelist -anc chimpHg19.fa -out test -P 5 -r 1: -GL 1 -doSaf 1" "realSFS test.saf.idx -maxIter 100 -P 5 > test.sfs" .

cannot estimate fst due problem with size of dimension of 2d SFS

Hi there,

I've been following this tutorial (which is great btw!) without any problem until now :(
For some reason I am not able to estimate fst.

I have what I believe are 2 different populations (the whole point of doing these analyses is to actually see if they are different or not): pop1 (n = 10) and pop2 (n =10)

So far, this is what I've done:

1 -> calculate SAF:
(I'm using my reference genome as the ancestral sequence to polarise the alleles)
FILTERS="-minMapQ 30 -minQ 20 -uniqueOnly 1 -remove_bads 1 -baq 1 -only_proper_pairs 1 -trim 0 -C 50 -maxDepth 500 -minInd 10 -setMinDepth 20 -setMaxDepth 100"
$ANGSD -b pop1.bam.list -anc $ANC -ref $REF $FILTERS -GL 1 -doCounts 1 -doSaf 1 -out pop1
$ANGSD -b pop2.bam.list -anc $ANC -ref $REF $FILTERS -GL 1 -doCounts 1 -doSaf 1 -out pop2

2 -> calculate SFS:
$realSFS -bootstrap 100 pop1.saf.idx > pop1.sfs
$realSFS -bootstrap 100 pop2.saf.idx > pop2.sfs

Then I checked the number of values by doing:
awk -F' ' '{print NF; exit}' pop1.sfs
awk -F' ' '{print NF; exit}' pop2.sfs
awk -F' ' '{print NF; exit}' pop1.pop2.sfs

and I get 21 (2N+1), 21, and 441 respectively

3 -> calculate 2d SFS:
$realSFS -bootstrap 100 pop1.saf.idx pop2.saf.idx > pop1.pop2.sfs

now I'm trying to calculate the fst by doing:
$realSFS fst index pop1.saf.idx pop2.saf.idx -sfs pop1.pop2.sfs -fstout pop1.pop2
$realSFS fst stats pop1.po2.fst.idx

but I get this error:
-> args: tole:0.000000 nthreads:4 maxiter:100 nsites(block):0 start:pop1.pop2.sfs chr:(null) start:-1 stop:-1 fstout:pop1.pop2 oldout:0 seed:-1 bootstrap:0 resample_chr:0 whichFst:0 fold:0 ref:(null) anc:(null)
-> nSites: 100000
-> IMPORTANT: please make sure that your saf files haven't been folded with -fold 1 in -doSaf in angsd
-> [reynoldFst] sfs1:20 sfs2:20 dimspace:441
-> generating offset remapper lookup
-> isSame:1 adjusting foldfactors
-> Reading: pop1.pop2.sfs assuming counts (will normalize to probs internally)
-> Pxroblem with size of dimension of prior 441 vs 44100

I can't figure out what I've done wrong... Any ideas?

Thank you!

Marcela

mfumagalli / ngstools Goto Github PK

ngstools's Introduction

ngsTools

Packages

Installation

Input Files

Tutorial

Additional information

ngstools's People

Contributors

Stargazers

Watchers

Forkers

ngstools's Issues

Recommend Projects

Recommend Topics

Recommend Org