Git Product home page Git Product logo

antiref's Introduction

DOI Made withJupyter GitHub

AntiRef: reference clusters of human antibody sequences

AntiRef (Antibody Reference Clusters), which was inspired by UniRef, provides clustered datasets of filtered human antibody sequences. The Jupyter notebooks in this repository contain all code necessary to recreate AntiRef entirely from scratch:

  • download: downloads raw data from the Observed Antibody Space (OAS) repository. Note that the combined total of these datasets is quite large -- nearly 4TB after decompression.
  • filter: filters the raw sequences to ensure only productive, full-length sequences are used to compile AntiRef.
  • cluster: performs a nested clustering procedure using several identity thresholds. This process is similar to that used by UniRef, although the thresholds were optimized for antibody sequence data rather than general protein seqeunces.

What is nested clustering?

AntiRef is a series of antibody sequence datasets, each clustered at an identity threshold of decreasing stringency. Rather than clustering the filtered input dataset using each threshold in parallel, we perform the clustering sequentially using the output from the previous round as input for the subsequent clustering iteration:

iterative clustering schematic

This has two primary benefits. First and most importantly, it ensures that cluster and sequence names are conserved across all AntiRef datasets. Each cluster is named after its representative sequence (as determined by mmseqs), and by using the output of one clustering round as input for the next, we can ensure that the representative sequence will be present in all previous clustering outputs. For example, if we separately clustered the input dataset at 99% and 98% identity, there is the possibility that some cluster representatives in the 98% dataset are not present in the 99% dataset because these sequences were not selected as representatives for their respective 99% cluster.

Why AntiRef?

Biases in the human antibody repertoire result in publicly available antibody sequence datasets containing many duplicate or highly similar sequences. These redundant sequences are a barrier to rapid similarity searches and reduce the efficiency with which these datasets can be used to train statistical or machine learning models of human antibodies. Identity-based clustering provides a solution, however, the extremely large size of available antibody repertoire datasets make such clustering operations computationally intensive and potentially out of reach for many scientists and researchers who would benefit from such data.

Starting from a dataset of ~335M unique, full-length, productive human antibody sequences from the Observed Antibody Space repository, several AntiRef cluster sets were generated. Due to the modular nature of recombined antibody genes, the clustering thresholds used by UniRef (100, 90 and 50 percent identity) to cluster general protein sequences are suboptimal for antibody clustering. AntiRef provides reference antibody sequence datasets clustered at a range of relevant identity thresholds: 100, 99, 98, 96, 94, 92 and 90 percent. AntiRef90, which uses the lowest clustering threshold of any AntiRef dataset, is roughly one-third the size of the filtered input dataset and less than half the size of the non-redundant AntiRef100.

Where can I download AntiRef datasets?

AntiRef datasets are available on Zenodo and can be downloaded at the following links:

  • AntiRef100: representative sequences resulting from clustering all filtered AntiRef input sequences at 100% identity.
  • AntiRef99: representative sequences resulting from clustering AntiRef100 at 99% identity.
  • AntiRef98: representative sequences resulting from clustering AntiRef99 at 98% identity.
  • AntiRef96: representative sequences resulting from clustering AntiRef98 at 96% identity.
  • AntiRef94: representative sequences resulting from clustering AntiRef96 at 94% identity.
  • AntiRef92: representative sequences resulting from clustering AntiRef94 at 92% identity.
  • AntiRef90: representative sequences resulting from clustering AntiRef92 at 90% identity.

How should I cite AntiRef?

Antiref has been published in Bioinformatics Advances and can be cited as:

Briney B. (2023). AntiRef: reference clusters of human antibody sequences.
Bioinformatics Advances. https://doi.org/10.1093/bioadv/vbad109.

Zenodo provides a unique DOI for each version of deposited dataset. The DOI of the current version of AntiRef (v2022.12.14) is 10.5281/zenodo.7474336, so an appropriate citation would be:

Briney, Bryan. (2022). AntiRef: reference clusters of human antibody sequences (v2022.12.14). 
[Data set]. Zenodo. https://doi.org/10.5281/zenodo.7474336

antiref's People

Contributors

briney avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.