Git Product home page Git Product logo

fcc's Introduction

FCC Clustering Algorithm

Fraction of Common Contacts Clustering Algorithm for Protein Models from Structure Prediction Methods

About FCC

Structure prediction methods generate a large number of models of which only a fraction matches the biologically relevant structure. To identify this (near-)native model, we often employ clustering algorithms, based on the assumption that, in the energy landscape of every biomolecule, its native state lies in a wide basin neighboring other structurally similar states. RMSD-based clustering, the current method of choice, is inadequate for large multi-molecular complexes, particularly when their components are symmetric. We developed a novel clustering strategy that is based on a very efficient similarity measure - the fraction of common contacts. The outcome of this calculation is a number between 0 and 1, which corresponds to the fraction of residue pairs that are present in both the reference and the mobile complex.

Advantages of FCC clustering vs. RMSD-based clustering:

  • 100-times faster on average.
  • Handles symmetry by consider complexes as entities instead of collections of chains.
  • Does not require atom equivalence (clusters mutants, missing loops, etc).
  • Handles any molecule type (protein, DNA, RNA, carbohydrates, lipids, ligands, etc).
  • Allows multiple levels of "resolution": chain-chain contacts, residue-residue contacts, residue-atom contacts, etc.

How to Cite

Rodrigues JPGLM, Trellet M, Schmitz C, Kastritis P, Karaca E, Melquiond ASJ, Bonvin AMJJ. [Clustering biomolecular complexes by residue contacts similarity.] 1 Proteins: Structure, Function, and Bioinformatics 2012;80(7):1810–1817.

Requirements

  • Python 2.6+
  • C/C++ Compiler

Installation

Navigate to the src/ folder and issue 'make' to compile the contact programs. Edit the Makefile if necessary (e.g. different compiler, optimization level).

Usage

All scripts produce usage documentation if called without any arguments. Further, the '-h' option produces (for Python scripts) a more detailed help with descriptions of all available options.

For most cases, the following setup is enough:

# Make a file list with all your PDB files
ls *pdb > pdb.list

# Ensure all PDB models have segID identifiers
# Convert chainIDs to segIDs if necessary using scripts/pdb_chainxseg.py
for pdb in $( cat pdb.list ); do pdb_chainxseg.py $pdb > temp; mv temp $pdb; done

# Generate contact files for all PDB files in pdb.list
# using 4 cores on this machine.
python2.6 make_contacts.py -f pdb.list -n 4

# Create a file listing the names of the contact files
# Use file.list to maintain order in the cluster output
sed -e 's/pdb/contacts/' pdb.list | sed -e '/^$/d' > pdb.contacts

# Calculate the similarity matrix
python2.6 calc_fcc_matrix.py -f pdb.contacts -o fcc_matrix.out

# Cluster the similarity matrix using a threshold of 0.75 (75% contacts in common)
python2.6 cluster_fcc.py fcc_matrix.out 0.75 -o clusters_0.75.out

# Use ppretty_clusters.py to output meaningful names instead of model indexes
python2.6 ppretty_clusters.py clusters_0.75.out pdb.list

Authors

João Rodrigues

Mikael Trellet

Adrien Melquiond

Christophe Schmitz

Ezgi Karaca

Panagiotis Kastritis

[Alexandre Bonvin] 2

fcc's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

fcc's Issues

calculation of FCC matrix fails because of large number of chains

I am trying to cluster a complex consisting of 27 chains. calc_fcc_matrix.py fails with the following error message:

rackham.abonvin.Target170-scoring/rescored> python /home/abonvin/haddock_git/fcc/scripts/calc_fcc_matrix.py -f file.contacts -o rmsds/fcc.out
+ BEGIN: Tue Aug  4 16:58:59 2020
+ Parsing 951 contact files
Traceback (most recent call last):
  File "/home/abonvin/haddock_git/fcc/scripts/calc_fcc_matrix.py", line 129, in <module>
    c = parse_contact_file(args, exclude_chains)
  File "/home/abonvin/haddock_git/fcc/scripts/calc_fcc_matrix.py", line 20, in parse_contact_file
    contacts = [ set([ int(l) for l in open(f)]) for f in f_list if f.strip()]
ValueError: invalid literal for int() with base 10: '10001-1110031-10\n'

I did a fresh pull request before running it. Something is wrong with the contacts possibly.

Issue with makefile

When I use make command I am seeing the following error.

cl : Command line warning D9024 : unrecognized source file type 'Makefile', object file assumed
Microsoft (R) Incremental Linker Version 14.40.33811.0
Copyright (C) Microsoft Corporation. All rights reserved.

/out:Makefile.exe
Makefile
LINK : fatal error LNK1181: cannot open input file 'Makefile.obj'

Why I am seeing this error?

Allow numbers as chainIDs/segIDs

When a PDB has numbers instead of characters as chainIDs and/or segIDs, calc_fcc_matrix returns the following error:

1 + BEGIN: Mon Nov 13 01:18:23 2017
2 + Parsing 202 contact files
3 Traceback (most recent call last):
4   File "/home/enmr/services-enmr/HADDOCK2.2/server/run/userrun002472/run1/tools/calc_fcc_matrix.py", line 142, in ?
5     c = parse_contact_file(args, exclude_chains)
6   File "/home/enmr/services-enmr/HADDOCK2.2/server/run/userrun002472/run1/tools/calc_fcc_matrix.py", line 26, in parse_contact_file
7     contacts = [ set([ int(l) for l in open(f)]) for f in f_list if f.strip()]
8 ValueError: invalid literal for int(): 10087-1510438-14

This is due to a wrong output of the contacts by make_contacts:

10242-1510309-14
10242-1510310-14
10244-1510308-14

Switching back to alphabetical characters in the PDB solves the problem. However, numbers are allowed in the official PDB format.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.