Git Product home page Git Product logo

pannagram's Introduction

Pannagram

Pannagram is a package for constructing pan-genome alignments, analyzing structural variants, and translating annotations between genomes. Additionally, Pannagram contains useful functions for visualization. The manual is available in the examples folder.

Recreating working environment

Linux users

Make sure you have Conda or Mamba installed. To create and activate the package environment run:

conda env create -f environment.yml
conda activate pannagram
# OR
mamba env create -f environment.yml
mamba activate pannagram

The environment downloads required R interpreter version and all needed libraries, including BLAST, MAFFT and others.

MacOS users

should also run:

brew install coreutils

to make sure all the needed shell commands are installed.

Windows users

Can try running code from this repo under WSL (as Bash and / path separator are used extensively in the code). Nevertheless it was never tested in such environment, so good luck.

1. Pangenome linear alignment

1.1 Building the alignment

Pangenome alignment can be built in two modes:

  • reference-free:
./pannagram.sh -path_in '<genome files directory path>' \
    -path_out '<output files path>' \
    -cores 8
  • reference-based:
./pannagram.sh -ref '<reference genome name>' \
    -path_in '<genome files directory path>' \
    -path_out '<output files path>' \
    -cores 8
  • quick look: If there is no information on genomes and corresponding chromosomes available, one can run preparation steps:
./pannagram.sh -ref '<reference genome name>' \
    -path_in '<genome files directory path>' \
    -path_out '<output files path>' \
    -cores 8 -pre

An extended description of the parameters for all three scripts are avaliable by executing scripts with the flag -help.

1.2 Extract information from the pangenome alignment

Synteny blocks, SNPs, and sequence consensus (for the IGV browser) can be extracted from the alignment:

./analys.sh -path_msa '<output path with consensus>' \
      -path_chr '<path with chromosomes>' \
      -blocks  \  # Find Synteny block inforamtion for visualisation
      -seq  \     # Create consensus sequence of the pangenome
      -snp        # SNP calling

1.3 Calling structural variants

When the pangenome linear alignment is built, SVs can be called using the following script:

./analys.sh -path_msa '<output path with consensus>' \
      -sv_call  \         # Create output .gff and .fasta files with SVs
      -sv_sim te.fasta \  # Compare with a set of sequences (e.g., TEs)
      -sv_graph           # Construct the graph of SVs

2. Visualisation

Pannagram contains a number of useful methods for visualization in R.

2.1 Visualisation of the pangenome alignment

All genomes together:

A dotplot for a pair of genomes:

2.2 Graph of Nestedness on Structural variants

Every node is an SV:

Every node is a unique sequence, size - the amount of this sequence in SVs:

2.3 Nucleotide plot for a fragment of the alignment

  • In the ACTG-mode:

# --- Quick start code ---
source('utils/utils.R')  			# Functions to work with sequences
source('visualisation/msaplot.R')	# Visualisation
aln.seq = readFastaMy('aln.fasta')	# Vector of strings
aln.mx = aln2mx(aln.seq)			# Transfom into the matrix
msaplot(aln.mx)						# ggplot object
  • In the Polymorphism mode:

# --- Quick start code ---
msadiff(aln.mx)						# ggplot object

2.4 Dotplots of Sequences

Simultaneously on forward (dark color) and reverse complement (pink color) strands:

# --- Quick start code ---
source('utils/utils.R')  			# Functions to work with sequences
source('visualisation/dotplot.R')	# Visualisation
s = sample(c("A","C","G","T"), 100, replace = T)
dotplot(s, s, 15, 9)				# ggplot object

2.5 ORF-finder and visualisation

# --- Quick start code ---
source('utils/utils.R')  			# Functions to work with sequences
source('visualisation/orfplot.R')	# Visualisation
str = nt2seq(s)
orfs = orfFinder(str)
orfplot(orfs$pos)					# ggplot object

3. Additional useful tools

3.1 Search for similar sequences

...on the genome

The first approach involves searching against entire genomes or individual chromosomes. The quickstart toy-example is:

./simsearch.sh -in_seq genes.fasta -on_genome genome.fasta -out out.txt

The result is a GFF file with hits matching the similarity threshold.

...on another set

The second approach, in contrast, is designed to search for similarities against another set of sequences. The quickstart toy-example is:

./simsearch.sh -in_seq genes.fasta -on_seq genome.fasta -out out.txt

The result is an RDS (R Data Structure) table. This table shows the coverage of one sequence over another and includes a flag column that indicates whether the sequences meet the similarity threshold. Additionally, the second script takes into account the coverage strand, determining not just if a sequence is covered, but also if it's covered in a specific orientation.

Acknowledgements

Development:

  • Anna Igolkina - Lead Developer and Project Initiator
  • Alexander Bezlepsky - Assistant

Testing:

  • Anna Igolkina: Lead Tester
  • Anna Glushkevich: Testing the alignment on A. lyrata genomes
  • Elizaveta Grigoreva: Testing the alignment on A. thaliana and A. lyrata genomes
  • Jilong Ma: Testing the SV-graph on spider genomes
  • Alexander Bezlepsky: Testing the Pannagram's functionality on Rhizobial genomes
  • Gregoire Bohl-Viallefond: Testing the annotation converter on A. thaliana alignment

Resources:

pannagram's People

Contributors

iganna avatar phlaster avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.