Git Product home page Git Product logo

usum's Introduction

USUM: Plotting sequence similarity embeddings using USEARCH & UMAP

USUM uses USEARCH and UMAP (or t-SNE) to plot DNA 🧬 and protein 🧶 sequence similarity embeddings.

PyPI - Downloads PyPI license PyPI version CI

Installation

  1. Install USEARCH dependency manually: https://drive5.com/usearch/download.html
    (consider supporting the author by buying the 64bit license)

  2. Install usum using PIP:

pip install usum

Usage

Use usum to plot input protein or DNA sequences in FASTA format.

Show all available options using usum --help

Minimal example

usum example.fa --maxdist 0.2 --termdist 0.3 --output example

Multiple input files with labels

usum first.fa second.fa --labels First Second --maxdist 0.2 --termdist 0.3 --output example

This will produce a PNG plot:

UMAP static example

An interactive Bokeh HTML plot is also created:

UMAP Bokeh example

Using t-SNE instead of UMAP

You can also produce a t-SNE plot using the --tsne flag.

usum first.fa second.fa --labels First Second --maxdist 0.2 --termdist 0.3 --tsne --output example

This will produce a PNG plot:

UMAP static example

Plotting random subset

You can use --limit to extract and plot a random subset of the input sequences.

# Plot 10k sequences from each input file
usum first.fa second.fa --labels First Second --limit 10000 --maxdist 0.2 --termdist 0.3 --output example

You can control randomness and reproducibility using the --seed option.

Plotting options

See usum --help for all plotting options.

See UMAP API Guide for more info about the UMAP options.

  • Use --limit to plot a random subset of records
  • Use --width and --height to control plot size in pixels
  • Use --resume to reuse previous distance matrix from the output folder
  • Use --tsne to produce a t-SNE embedding instead of UMAP (you can use this with --resume)
  • Use --umap-spread to control how close together the embedded points are in the UMAP embedding
  • Use --umap-min-dist to control minimum distance between points in UMAP embedding
  • Use --neighbors to control number of neighbors in UMAP graph

Reusing previous results

When changing just the plot options, you can use --resume to reuse previous results from the output folder.

Warning This will reuse the previous distance matrix, so changes to limits or USEARCH args won't take effect.

# Reuse result from umap output directory
usum --resume --output example --width 600 --height 600 --theme fire

Programmatic use

from usum import usum

# Show help
help(usum)

# Run USUM
usum(inputs=['input.fa'], output='usum', maxdist=0.2, termdist=0.3)

How it works

  • A sparse distance matrix is calculated using USEARCH calc_distmx command.
  • The distances are based on % identity, so the method is agnostic to sequence type (DNA or protein)
  • The distance matrix is embedded as a precomputed metric using UMAP
  • The embedding is plotted using umap.plot.

usum's People

Contributors

prihoda avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

usum's Issues

Save and reuse reducer

We can reuse reducer when n_neighbors and random_state are the same. This can only be done after the reducer is using the sparse matrix, since saving the full reducer is not possible for >4GB inputs.

Load USEARCH matrix as scipy sparse matrix

The sparse matrix is problematic since missing values have to be 0, which is not what we want from a distance matrix. A solution might be to use the exact UMAP metric.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.