aineniamh / snipit Goto Github PK

View Code? Open in Web Editor NEW

112.0 4.0 22.0 403 KB

snipit: summarise snps relative to your reference sequence

License: GNU General Public License v3.0

Python 100.00%

snipit's Introduction

snipit

Summarise snps relative to a reference sequence

Install

pip install snipit

Example Usage

Link to test data: test.fasta

Basic usage for nucleotide alignments:

snipit test.fasta \
--output-file test

Default format output is png. Only specify output path/name (not extension).

To change output format, use --format:

snipit test.fasta \
--output-file test \
--format pdf

Options: png, jpg, pdf, svg, tiff.

To change color scheme, use --colour-palette:

snipit test.fasta \
--output-file test \
--colour-palette classic_extended

Other colours schemes:

classic, classic_extended, primary, purine-pyrimidine, greyscale, wes,verity, ugene

Use ugene for protein (aa) alignments. Use classic_extended for colouring ambiguous bases.

There are multiple options to control which SNPs or indels are included/excluded:

snipit test.fasta \
--show-indels \
--include-positions '100-150' \
--exclude-positions '223 224 225'

For control over ambiguous bases, use --ambig-mode to specify how ambiguous bases are handled:

[all] include all ambig such as N,Y,B in all positions
[snps] only include ambig if a snp is present at the same position - Default 
[exclude] remove all ambig, same as depreciated --exclude-ambig-pos

Use the colour palette classic_extended when plotting with all or snps.

Recombination mode is designed to assist with recombination analysis for SC2. This mode allows for colouring of mutations present in two references. For recombination mode, three flags are required: --reference,--recombi-mode,--recombi-references.

The specified --reference must be different from the --recombi-references.

snipit test.fasta \
--reference USA_3 \
--recombi-mode \
--recombi-references "USA_1,USA_2"

For amino acid alignments, specify the sequence type as aa, use the colour palette ugene:

snipit test.prot.fasta \
--sequence-type aa \
--colour-palette ugene \
--output-file test.prot

There are several more options, see below for full usage.

Full Usage

snipit

optional arguments:
  -h, --help            show this help message and exit

Input options:
  alignment             Input alignment fasta file
  -t {nt,aa}, --sequence-type {nt,aa}
                        Input sequence type: aa or nt
  -r REFERENCE, --reference REFERENCE
                        Indicates which sequence in the alignment is the
                        reference (by sequence ID). Default: first sequence in
                        alignment
  -l LABELS, --labels LABELS
                        Optional csv file of labels to show in output snipit
                        plot. Default: sequence names
  --l-header LABEL_HEADERS
                        Comma separated string of column headers in label csv.
                        First field indicates sequence name column, second the
                        label column. Default: 'name,label'

Mode options:
  --recombi-mode        Allow colouring of query seqeunces by mutations
                        present in two 'recombi-references' from the input
                        alignment fasta file
  --recombi-references RECOMBI_REFERENCES
                        Specify two comma separated sequence IDs in the input
                        alignment to use as 'recombi-references'. Ex.
                        Sequence_ID_A,Sequence_ID_B
  --cds-mode            Assumes sequence supplied is a coding sequence

Output options:
  -d OUTPUT_DIR, --output-dir OUTPUT_DIR
                        Output directory. Default: current working directory
  -o OUTFILE, --output-file OUTFILE
                        Output file name stem. Default: snp_plot
  -s, --write-snps      Write out the SNPs in a csv file.
  -f FORMAT, --format FORMAT
                        Format options (png, jpg, pdf, svg, tiff) Default: png

Figure options:
  --height HEIGHT       Overwrite the default figure height
  --width WIDTH         Overwrite the default figure width
  --size-option SIZE_OPTION
                        Specify options for sizing. Options: expand, scale
  --solid-background    Force the plot to have a solid background, rather than
                        a transparent one.
  -c , --colour-palette 
                        Specify colour palette. Options: [classic,
                        classic_extended, primary, purine-pyrimidine,
                        greyscale, wes, verity, ugene]. Use ugene for protein
                        alignments.
  --flip-vertical       Flip the orientation of the plot so sequences are
                        below the reference rather than above it.
  --sort-by-mutation-number
                        Render the graph with sequences sorted by the number
                        of SNPs relative to the reference (fewest to most).
                        Default: False
  --sort-by-id          Sort sequences alphabetically by sequence id. Default:
                        False
  --sort-by-mutations SORT_BY_MUTATIONS
                        Sort sequences by bases at specified positions.
                        Positions are comma separated integers. Ex. '1,2,3'
  --high-to-low         If sorted by mutation number is selected, show the
                        sequences with the fewest SNPs closest to the
                        reference. Default: False

SNP options:
  --show-indels         Include insertion and deletion mutations in snipit
                        plot.
  --include-positions INCLUDED_POSITIONS [INCLUDED_POSITIONS ...]
                        One or more range (closed, inclusive; one-indexed) or
                        specific position only included in the output. Ex.
                        '100-150' or Ex. '100 101' Considered before '--
                        exclude-positions'.
  --exclude-positions EXCLUDED_POSITIONS [EXCLUDED_POSITIONS ...]
                        One or more range (closed, inclusive; one-indexed) or
                        specific position to exclude in the output. Ex.
                        '100-150' or Ex. '100 101' Considered after '--
                        include-positions'.
  --ambig-mode {all,snps,exclude}
                        Controls how ambiguous bases are handled - [all]
                        include all ambig such as N,Y,B in all positions;
                        [snps] only include ambig if a snp is present at the
                        same position; [exclude] remove all ambig, same as
                        depreciated --exclude-ambig-pos

Misc options:
  -v, --version         show program's version number and exit

Cite

Please cite this tool as follows:

Aine O'Toole, snipit (2024) GitHub repository, https://github.com/aineniamh/snipit

snipit's People

Contributors

Stargazers

Watchers

snipit's Issues

Can the graph be directly drawn through the reference sequence and VCF file?

Dear @aineniamh,
Thank you for developing such a great project.
When I'm conducting SNP analyzing, the typical scenario is that: align the second-generation sequencing data to the reference genome, then obtain a VCF file, and then visualize the SNP.
So, I'm wondering if Snipit can support the input of the reference genome and the VCF file, and then generate a SNP distribution plot accordingly? If that's possible, Snipit will have even more application scenarios.
Best wishes.

BUG: Release 1.0.4 on pypa seems to be an outdated code base

Installed v1.0.4 and it doesn't behave at all how it should. None of the new arguments appear in help. Maybe you uploaded the wrong zip folder?

Error: name not a column name in label.csv

Having trouble with renaming the samples in my plot (which has worked many times before!?) with the error

Error: name not a column name in label.csv

Input

snipit aligned.fasta -r MN908947.3 -l label.csv

csv file looks fine?

`(snipit) rebee@server:~/projects/feb2022/filtered/consensus/with-ref$ head label.csv
name,label`

Coloring reference bases in --recombi-mode

I'm currently working on integrating your visualization script into my recombination detection tools. I love the --recombi-mode option, but since the reference bases aren't colored, it can be difficult to see the breakpoints.

For example, this command produces the following visualization of XBF:

snipit XBF.fasta \
  --recombi-mode \
  --recombi-references BA.5.2,CJ.1 \
  --reference Reference \
  --flip-vertical \
  --format png \
  --solid-background

Alignment: XBF.fasta.zip (I've masked non-barcode positions for BA.5.2 and CJ.1)

But I'd find it more useful if:

The reference bases were also colored for each recombi-reference (ex. BA.5.2=blue, CJ.1=red)
The recombi-references maybe had different saturation for reference versus mutations (ex. reference=light color, mutation = dark color)?

This is a mock-up of a color scheme I'm imagining. Which makes the genomic composition of XBF more clear, and the breakpoints easier to see.

I'm wondering if you'd also find value (or problems?) with this style of visualization? If you think it's worthwhile, I could experiment in a pull request to see what kind of code changes would be needed?

Thanks for developing this amazing visualizer!

Plans for Bioconda?

hi @aineniamh!

This looks really cool and useful! Would it be ok with you if I submitted a recipe to Bioconda?

Cheers,
Robert

Option to remove the SNP base and coordinate text from image?

Hi,
This tool is very easy and works well on small alignments. Nevertheless, when visualising larger alignments the image becomes messy with the text of the bases in each tile as well as the coordinates on the top. Would it be possible to have an option to remove these elements?
Thanks!

Suggestion: show coordinates in reference sequence

Hi,

I love this program!

One suggestion: The coordinates on top show the position in the alignment, and it would be lovely if they could show the position in the reference sequence instead, maybe using an additional command line switch.

Suggestion: reformat the code with black

I would be happy to submit some pull requests for snipit, but I find the code style extremely dense, hard to read and to imitate, and generally discouraging.... Without meaning to sound rude or disrespectful etc., I think it would be a healthy move for the project if you installed black and ran it. The result would produce Python that is much more typical and I think encourage participation. Anyway, just a suggestion.

For example, I thought of a way to preserve the original order of sequences in the input, but I don't want to work on Python that looks like this (my editor really screams at me when I load the files). So I just didn't do it, and leave this comment instead. I hope that's ok. Thanks for snipit!

Can we plot like this?

Dear @corneliusroemer @aineniamh,

I would like to plot like the figure below, is this possible using snipit?

Suggestion: ability to order sequences according to a file

Hi there,

Really love this tool, thanks for sharing it! :)

I think it would also be really handy to be able to give a specific order for the sequences to appear in (perhaps the same order as the labels csv file?) so it can be easier to compare across predefined conditions rather than closeness to reference.

Thanks :)

pip install snipit missing latest parameters displayed on github repo

Love this tool - thank you so much for sharing it!

I installed this last week and noticed the latest parameters (e.g. below) aren't available or recognised when used!

  --flip-vertical       Flip the orientation of the plot so sequences are below the reference rather than above it.
  --include-positions INCLUDED_POSITIONS [INCLUDED_POSITIONS ...]
                        One or more range (closed, inclusive; one-indexed) or specific position only included in the output. Ex. '100-150' or Ex. '100 101' Considered before '--exclude-positions'.
  --exclude-positions EXCLUDED_POSITIONS [EXCLUDED_POSITIONS ...]
                        One or more range (closed, inclusive; one-indexed) or specific position to exclude in the output. Ex. '100-150' or Ex. '100 101' Considered after '--include-positions'.
  --exclude-ambig-pos   Exclude positions with ambig base in any sequences. Considered after '--include-positions'
  --sort-by-mutation-number
                        Render the graph with sequences sorted by the number of SNPs relative to the reference (fewest to most). Default: False
  --high-to-low         If sorted by mutation number is selected, show the sequences with the fewest SNPs closest to the reference. Default: False

Licenses put/patch bug

I am currently working on an API to automatically check and connect licenses. The problem is that the Put and Patch commands both default the License ID to 1. This is so far not changeable. I have been able to recreate the bug on the readme and on a local project. No matter which ID you put in. It will always use 1.

Suggestion: Export option as .tsv or .csv !!

Hey! @aineniamh , snipit is an excellent tool for displaying SNPs.
It is great that it outputs as .png, but suggesting it should have an option to export as .tsv or .csv would really help.
Thanks

ModuleNotFoundError: No module named 'snp_functions'

I have trying to run snipit from a conda environment in Windows 10 and in Linux but I get the subject error. I have checked the snipt files and the "snp_funcitons" is there. Any light on why this is happening?
File "c:\users\xxxx\appdata\local\conda\conda\envs\snipit\lib\site-packages\snipit\command.py", line 9, in
import snp_functions as sfunks

snipit 1.1.2

Hello
I installed snipit with pip but it does not seem to have all the options (notably the most recent ones like --sequence-type.

can you advise?
thank you!

snipit --version
snipit 1.1.2

 snipit --help
usage: snipit <alignment> [options]

snipit

positional arguments:
  alignment             Input alignment fasta file

optional arguments:
  -h, --help            show this help message and exit
  -r REFERENCE, --reference REFERENCE
                        Indicates which sequence in the alignment is the reference (by sequence ID). Default: first sequence in alignment
  -l LABELS, --labels LABELS
                        Optional csv file of labels to show in output snipit plot. Default: sequence names
  --l-header LABEL_HEADERS
                        Comma separated string of column headers in label csv. First field indicates sequence name column, second the label column. Default: 'name,label'
  -d OUTPUT_DIR, --output-dir OUTPUT_DIR
                        Output directory. Default: current working directory
  -o OUTFILE, --output-file OUTFILE
                        Output file name stem. Default: snp_plot
  -s, --write-snps      Write out the SNPs in a csv file.
  -f FORMAT, --format FORMAT
                        Format options (png, jpg, pdf, svg, tiff) Default: png
  --height HEIGHT       Overwrite the default figure height
  --width WIDTH         Overwrite the default figure width
  --size-option SIZE_OPTION
                        Specify options for sizing. Options: expand, scale
  --solid-background    Force the plot to have a solid background, rather than a transparent one.
  --flip-vertical       Flip the orientation of the plot so sequences are below the reference rather than above it.
  --show-indels         Include insertion and deletion mutations in snipit plot.
  --include-positions INCLUDED_POSITIONS [INCLUDED_POSITIONS ...]
                        One or more range (closed, inclusive; one-indexed) or specific position only included in the output. Ex. '100-150' or Ex. '100 101' Considered before '--exclude-positions'.
  --exclude-positions EXCLUDED_POSITIONS [EXCLUDED_POSITIONS ...]
                        One or more range (closed, inclusive; one-indexed) or specific position to exclude in the output. Ex. '100-150' or Ex. '100 101' Considered after '--include-positions'.
  --exclude-ambig-pos   Exclude positions with ambig base in any sequences. Considered after '--include-positions'
  --sort-by-mutation-number
                        Render the graph with sequences sorted by the number of SNPs relative to the reference (fewest to most). Default: False
  --sort-by-id          Sort sequences alphabetically by sequence id. Default: False
  --sort-by-mutations SORT_BY_MUTATIONS
                        Sort sequences by bases at specified positions. Positions are comma separated integers. Ex. '1,2,3'
  --high-to-low         If sorted by mutation number is selected, show the sequences with the fewest SNPs closest to the reference. Default: False
  -v, --version         show program's version number and exit
  -c COLOUR_PALETTE, --colour-palette COLOUR_PALETTE
                        Specify colour palette. Options: primary, classic, purine-pyrimidine, greyscale, wes, verity
  --recombi-mode        Allow colouring of query seqeunces by mutations present in two 'recombi-references' from the input alignment fasta file
  --recombi-references RECOMBI_REFERENCES
                        Specify two comma separated sequence IDs in the input alignment to use as 'recombi-references'. Ex. Sequence_ID_A,Sequence_ID_B

v1.1.2 issues

a couple of observations for the current version:

--sort-by-mutations is broken (fix in #24)
code in master branch has __version__ set to 1.1, but pypi version has 1.1.2 - so the repo code does not seem to be identical to the pypi one (was the pypi upload generated from the tidy-up branch?)
README still lists --snps-only instead of --show-indels

snipit for amino acid

is there a way to use snipit for aminoacid alignments?
it works great with my nt data but I am interested in plotting AA

thank you for this nice program!

END: Make background transparency an option, rather than forcing it

It appears that the background of the .png output has become transparent between 1.0.3 and new release.

Is this on purpose? This could be due to a change in matplotlib? I personally perfer background to be white - this is not so readable:

Suggestion: Writing out SNPs of a specific region in CSV file

Hello @aineniamh,

I have been using this program and so far I love it. I also like the idea of exporting a CSV file with all the SNPs detected.

I'm using genomic sequences as input and the option --include-positions to plot mutations on a specific region. However, when creating the CSV file, it shows mutations observed in the whole genomic sequences. Could it be possible to create a CSV file showing SNPs located in an specific range or positions?

Thanks for developing this useful tool!

Error about: "UnboundLocalError: local variable 'top_polygon' referenced before assignment"

A couple of times snipit has failed with the following error:

Traceback (most recent call last):
  File "/mnt/userapps/miniconda3/bin/snipit", line 8, in <module>
    sys.exit(main())
  File "/mnt/userapps/miniconda3/lib/python3.8/site-packages/snipit/command.py", line 72, in main
    sfunks.make_graph(num_seqs,num_snps,record_ambs,record_snps,output,label_map,colours,length,args.width,args.height,args.size_option)
  File "/mnt/userapps/miniconda3/bin/snp_functions.py", line 331, in make_graph
    rect = patches.Rectangle((0,(top_polygon)), length, y_inc ,alpha=0.2, fill=True, edgecolor='none',facecolor="dimgrey")
UnboundLocalError: local variable 'top_polygon' referenced before assignment

I think this occurs when the samples in the multi-fasta input file all have the same SNP's. I can send an example fasta file containing two fasta sequences that produced this error. If this is the reason, then perhaps could just output a "No SNP's found message."

I'm using MAFFT v7.487 with the "linsi" option (eg: "mafft-linsi input_fasta > aligned_fasta"), and using miniconda with conda 4.10.3, and Python 3.8.10, and Snipit version 1.0.3:

$ pip show snipit
Name: snipit
Version: 1.0.3
Summary: snipit
Home-page: https://github.com/aineniamh/snipit
etc...

Citation

Thank you very much for your awesome scripts. How can I cite your work?

Suggestion: enhancements to show possible effects on protein

This is a great tool for visualsing nucleotide changes, thanks for making it available to the community!

A related use case would be to visualise amino acid changes vs a reference (eg if giving snipit an alignment of a specific CDS of interest), which I don't think is currently supported? Either by giving it the amino acid sequence directly as the input, or by defining the start codon in the ref and converting on the fly.

As an alternative (or complement), it would be nice to be able to highlight nucleotide changes expected to be non-synonymous in the ouput in some way.

Maybe these suggestions don't scale so well to whole genomes which is what I know snipit was initially designed for.

Image size of 127348x38612 pixels is too large.

ENH: Add option to sort by presence/absence of mutations at certain positions

It would be nice if one could sort sequences by presence/absence of mutations at certain positions.

e.g. --sort-by-mutation 500,2500,241

Would sort by nucleotide at 500, then resolve ties by nucleotide at 2500, then by nucleotide at 241.