Git Product home page Git Product logo

mdlatticeanalysistool's Introduction

The MDLatticeAnalysisTool (mdl) is a toolkit designed to identify instances of multi-contact-events (mce) in PolyLibScan (pls) datasets. A mce is a snapshot of a pls run where more than one bead is in close vicinity of the target protein (the standard threshold used is 4 A). The pipeline allows comparisons between the mdl results and occupancy predictions made by Epitopsy for the corresponding Polymer / Protein interactions. It also contains a genetic algorithm making use of the mdl pipeline to optimize polymer composition in respect to the amount of mces and the overlap between predicted occupancies and observed probabilities of mces in the pls simulations.

Dependencies

  • PolyLibScan
  • Epitopsy
  • pathlib2
  • sklearn
  • yaml
  • pandas
  • matplotlib
  • Bio.PDB

Genetic algorithm specific:

  • LammpsSubmit

MDLatticeAnalysisTool (mdl)

The mdl can be used to identify and quantify mces of pls runs. A example use case is shown below:

import MDlatticeAnalysisTool as mdl

Create a Environment to parse pdb data and create a box around the protein

env = mdl.environment.Environment('path/to/pls/project', 'path/to/polymer_poses.pdb')

Calculate the distance to the nearest protein for each monomer, this is done for each snapshot of a single repetition (run) of a pls job.

ana = mdl.analysis.Analytics(env, distance_cutoff = 4)

Dataframe containing information for each snapshot on the monomer coordinate, coordinate of the closest protein atom, distance (in A) and to which residue the protein atom belongs.

ana.dataframe

Plot to show the probability of each monomer to bind to the protein per snapshot (binding ="single") / the probability of a monomer to be part of a "multi-contact" snapshot (binding="multi")

ana.probability_distribution_polymer(binding="multi")

Plot to show the probability of each protein residue to bind to a monomer per snapshot (binding="single") / the probability of a residue to be part of a "multi-contact" snapshot (binding="multi")

ana.probability_distribution_protein(binding="multi")

Visualize the mces in Pymol (pymol has to be started with the -R flag)

pym = mdl.Pymol.pymol(ana)

load the Protein into Pymol

pym.setup()

Color protein residues according to their probability to bind to a monomer (binding='single') or to be part in a mce (binding='multi') (between 1 and 5%: yellow; 5 to 10%: orange; 10 to 20%: brown; over 20%: red)

pym.show_bindings(binding = "multi")

Pipeline

The pipeline incorporates pls analysis, mdl analysis and (optionally, but recommended) epitopsy analysis. It will automatically create all the files it needs from the pls project and will analyze all samples/repetitions/runs of a pls job. Input:

    Pls project folder 
    (empty) output folder
    job id

Output:

	Pose pdb files of all runs
	.pdb and .pqr file of the protein
	.pqr file of the polymer
	Epitopsy esp, epi and occ files
	'protein'job'nr'.yml, containing the average mces for each residue
	'protein'_coords_job'nr'.yml, containing all coordinates where a mce took place
	'protein'_coords_dict_job'nr'.yml containing all coordinates where a mce took place and how often such an event took place
	'protein'_box_coords_job'nr'.yml containing all coordinates where a mce took place, translated to epitopsy gid coordinates
	a plot in .png format showing the probabilities of all monomers to be part of a mce

To start the pipeline the following commands are used:

from MDlatticeAnalysisTool import pipeline
pipeline.Pipeline_run('path/to/pls/project/', '/path/to/output/folder/', job=0, meshsize = [0.8, 0.8, 0.8])

job: Job nr of the pls job that you want to analyze

meshsize: Size of the Epitopsy mesh, smaller = higher resolution but longer calculation time

The output files can be further analyzed with the scripts rendering.pml and mean_absolute_error.py:

rendering.pml is a pymol script to visualize the epitopsy and mdl results in pymol, you have to adjust the path to the path of your 'protein'_job'nr'.yml in it and can then start it by typing in the terminal:

pymol rendering.pml

mean_absolute_error.py is a python script to calculate the mean absolute error between the occupancies predicted by epitopsy and the mces of the pls simulation, it will create a yaml file called mean_absolute_error.yml, containg both the raw mae and the mae weighted by the sum of mce probabilities.

To use it, adjust:

line 4 to the path of the folder containing the MDlatticeAnalysisTool
line 10 to your pipeline output folder
line 11 to your pls project folder
and line 12 to the pls job nr

you can run it in the terminal with the command:

python mean_absolute_error.py

Genetic algorithm:

The genetic algorithm aims to improve the overlap between mce probabilities and epitopsy occupancies and the frequencies of mce by mimicking natural selection: Each monomer is seen as a 'gene' and each polymer as a 'individual', a group of polymers is seen as a 'population'. For each 'generation', the fitness for each indiviual is calculated by the mean absolute error between the the occupancies predicted by epitopsy and the mce of the pls simulation, weighted by the sum of the mce probabilities. The n individual with the highest fitness will be recombined for creating the next generation, but also some 'lucky few' randomly chosen from the whole population will be used for recombination, so that local minima are of less impact. Also, the genes can randomly mutate (being switched with another monomer) according to a predefined mutation rate. The algorithm switches between running pls simulations on the cluster and analyzing the data locally. A example use case is shown below:

import MDlatticeAnalysisTool as mdl

mdl.genetic_algorithm.Genetic_algorithm('/local/project/folder/', '/cluster/project/folder/', best_sample = 4, lucky_few = 2, nr_children = 4, nr_generations = 11, mutation_rate = 0.02, meshsize = [4,4,4])

local project folder: The folder containing your initial parent generation, the hierarchy has to be project_folder/generation0/MD/jobs/''parents'', the configuration file is taken from a parent called 'abcd', which needs to be in the jobs folder. the MD folder also has to contain a static folder, which can be copied from the pls project

cluster project folder: the project folder on the cluster, has to be empty

best_sample: The best n individuals used to create the next generation

lucky_few: n randomly chosen individuals from the whole population that will be used for recombination

nr_children: Number of children created from each parent pair

nr_generations: Number of generations the algorithm will run and analyse

mutation rate: probability for a gene to mutate, per gene per generation

meshsize: The size of the Epitopsy mesh, has to be adjusted depending on the available ressources

output:

Each individual will have its own analysis folder, containing the same output as described in the Pipeline section. Furthermore, the local project folder will contain a 'best_hit.yml' file, containing the generation, jobname, fitness and sequence of the individual with the highest fitness. It will also contain a 'history.yml' file, containing the same information for all individuals, which allows for example plotting the mean difference in fitness between generations.

mdlatticeanalysistool's People

Contributors

mw55 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.