Git Product home page Git Product logo

aimsetfinder's Introduction

AIMsetfinder

Peter Pfaffelhuber, Franziska Grundner-Culemann, Veronika Lipphardt, Franz Baumdicker

Overview:

AIMsetfinder is a collection of Rscripts to identify sets of Ancestry Informative Markers (AIMs), that minimize the logloss error of a naive Bayes classifier.

It takes as input:

  • SNP data (e.g. 1000 Genomes SNP data or user's own data in vcf.gz format)
  • biogeographic information (or alternatively any discrete phenotype)

to select a set of specified size of optimal AIMs to classify the samples.

The output is:

  • a vcf.gz file with the selected AIMs
  • a list of SNP identifiers

which can be used in ancestry inferrence methods.

  • Furthermore the posterior probabilites of a naive classifier based on these AIMs are given for the input data.

Table of contents

Quick start

git clone https://github.com/fbaumdicker/AIMsetfinder.git
cd AIMsetfinder

Install dependencies and then run the test: Rscript pipeline_example.r

Installing dependencies

For data analysis as well as for our simulation studies, we rely on R scripts.

dependencies

  • For multicore-computing, we require the R-package parallel.
  • Since, both data from the 1000 genomes project, which is analysed here, and the coalescent simulations, come in vcf-format, we require the R-package vcfR.
  • For some steps in the analysis, we use vcftools \cite{Danecek2011} and bcftools \cite{Li2011}, both can be installed using
sudo apt-get install vcftools bcftools

data resources (optional)

In addition, data from the 1000 genomes project (phase 3) was downloaded, as well as information on the sampling locations. For this, we used

wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.*
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/integrated_call_samples_v3.20130502.ALL.panel

The latter file was renamed {\tt 1000G_SampleListWithLocations.txt}, and the first row (the header) was removed.

for simulations (optional)

The simulation studies are performed using msprime. Most easily installed via pip3 install msprime. Msprime is a fast coalescent simulator. In particular, structured populations (with varying population sizes etc) can be simulated. We are using the python-interface of msprime. See msprime documentation for more information.

Overview of dependencies:

How to run

To run the test set: Rscript example_pipeline.r

In data/sim/ooa/, you will find the file ooa_chromosome_1_example.vcf.gz, a small set of 240 simulated individuals that is used in this tutorial.

Directory structure and analysis output

The analysis generates the following files:

./AIMs                list of identifiers of the chosen AIMs
./AIMs.vcf.gz         corresponding states for all individuals
./predictions.csv     table of posterior probabilites for all classes/BGAs as predicted by naive Bayes using AIMs
./classifications.tab classification into the class with the largest probability as in predictions.csv 

In which step different files are produced is described in more details in readme.pdf.

aimsetfinder's People

Contributors

fbaumdicker avatar pfaffelh avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

nh13

aimsetfinder's Issues

`getData` method of `tools.r` fails when the VCF has only a single record

For an input VCF containing a single variant and multiple samples, the getData method in tools.r fails with the error message below. I encounter this problem when trying to carry out step5 of the pipeline_1000G_AIMs_noAMR.r script applied to a reduced set of variants.

Error in rownames<-(x, value) : attempt to set 'rownames' on an object with no dimensions Calls: getData -> row.names<- -> row.names<-.default -> rownames<- Execution halted

Wrong file name in README.md

In the How to run section of the README you have example_pipeline.r where it should say pipeline_example.r. There also a couple of LaTeX bits in the README that could do with tidying up.

Error in msprime Out-of-Africa model

We have recently learned that there was an error in the description of the Gutenkunst et al model provided as an example in the msprime tutorial. It appears that you are using a copy of the incorrect model in this repo, and so I am opening this issue to alert you.

Please see here for details on what the error is, and what actions you can take to fix it.

We have also written a short note analysing this and another related error, detailing the likely effects on downstream analysis. Thankfully, the differences between the misspecified model from msprime's documentation and the intended model are slight.

I apologise for this error and I sincerely hope that it has not affected your research.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.