Git Product home page Git Product logo

aposteriori's Introduction


Protein Structures Voxelisation for Deep Learning


CI

aposteriori is a library for the voxelization of protein structures for protein design. It uses conventional PDB files to create fixed discretized areas of space called "frames". The atoms belonging to the side-chain of the residues are removed so to allow a Deep Learning classifier to determine the identity of the frames based solely on the protein backbone structure.

Installation

PyPI

Coming soon...

pip install aposteriori

Creating a Dataset

There are two ways to create a dataset using aposteriori: through the Python API in aposteriori.make_frame_dataset or using the command line tool make-frame-dataset that installs along side the module:

make-frame-dataset /path/to/folder

If you want to try out an example, run:

make-frame-dataset tests/testing_files/pdb_files/

Check the make-frame-dataset help page for more details on its usage:

Usage: make-frame-dataset [OPTIONS] STRUCTURE_FILE_FOLDER

  Creates a dataset of voxelized amino acid frames.

  A frame refers to a region of space around an amino acid. For every
  residue in the input structure(s), a cube of space around the region (with
  an edge length equal to `--frame_edge_length`, default 12 Å), will be
  mapped to discrete space, with a defined number of voxels per edge (equal
  to `--voxels-per-side`, default = 21).

  Basic Usage:

  `make-frame-dataset $path_to_folder_with_pdb/`

  eg. `make-frame-dataset tests/testing_files/pdb_files/`

  This command will make a tiny dataset in the current directory
  `test_dataset.hdf5`, containing all residues of the structures in the
  folder.

  Globs can be used to define the structure files to be processed. `make-
  frame-dataset pdb_files/**/*.pdb` would include all `.pdb` files in all
  subdirectories of the `pdb_files` directory.

  You can process gzipped pdb files, but the program assumes that the format
  of the file name is similar to `1mkk.pdb.gz`. If you have more complex
  requirements than this, we recommend using this library directly from
  Python rather than through this CLI.

  The hdf5 object itself is like a Python dict. The structure is simple:
  
    └─[pdb_code] Contains a number of subgroups, one for each chain.
      └─[chain_id] Contains a number of subgroups, one for each residue.
        └─[residue_id] voxels_per_side^3 array of ints, representing element number.
          └─.attrs['label'] Three-letter code for the residue.
          └─.attrs['encoded_residue'] One-hot encoding of the residue.
    └─.attrs['make_frame_dataset_ver']: str - Version used to produce the dataset.
    └─.attrs['frame_dims']: t.Tuple[int, int, int, int] - Dimentsions of the frame.
    └─.attrs['atom_encoder']: t.List[str] - Lables used for the encoding (eg, ["C", "N", "O"]).
    └─.attrs['encode_cb']: bool - Whether a Cb atom was added at the avg position of (-0.741287356, -0.53937931, -1.224287356).
    └─.attrs['atom_filter_fn']: str - Function used to filter the atoms in the frame.
    └─.attrs['residue_encoder']: t.List[str] - Ordered list of residues corresponding to the encoding used.
    └─.attrs['frame_edge_length']: float - Length of the frame in Angstroms (A)
    └─.attrs['voxels_as_gaussian']: bool - Whether the voxels are encoded as a floating point of a gaussian (True) or boolean (False)

  So hdf5['1ctf']['A']['58'] would be an array for the voxelized.

Options:
  -o, --output-folder PATH        Path to folder where output will be written.
                                  Default = `.`

  -n, --name TEXT                 Name used for the dataset file, the `.hdf5`
                                  extension does not need to be included as it
                                  will be appended. Default = `frame_dataset`

  -e, --extension TEXT            Extension of structure files to be included.
                                  Default = `.pdb`.

  --pieces-filter-file PATH       Path to a Pieces format file used to filter
                                  the dataset to specific chains inspecific
                                  files. All other PDB files included in the
                                  input will be ignored.

  --frame-edge-length FLOAT       Edge length of the cube of space around each
                                  residue that will be voxelized. Default =
                                  12.0 Angstroms.

  --voxels-per-side INTEGER       The number of voxels per side of the frame.
                                  This will give a final cube of `voxels-per-
                                  side`^3. Default = 21.

  -p, --processes INTEGER         Number of processes to be used to create the
                                  dataset. Default = 1.

  -z, --is_pdb_gzipped            If True, this flag indicates that the
                                  structure files are gzipped. Default =
                                  False.

  -r, --recursive                 If True, all files in all subfolders will be
                                  processed.

  -v, --verbose                   Sets the verbosity of the output, use `-v`
                                  for low level output or `-vv` for even more
                                  information.

  -cb, --encode_cb BOOLEAN        Encode the Cb at an average position
                                  (-0.741287356, -0.53937931, -1.224287356) in
                                  the aligned frame, even for Glycine
                                  residues. Default = True

  -ae, --atom_encoder [CNO|CNOCB|CNOCBCA]
                                  Encodes atoms in different channels,
                                  depending on atom types. Default is CNO,
                                  other options are ´CNOCB´ and `CNOCBCA` to
                                  encode the Cb or Cb and Ca in different
                                  channels respectively.  [required]

  -d, --download_file PATH        Path to csv file with PDB codes to be
                                  voxelised. The biological assembly will be
                                  used for download. PDB codes will be
                                  downloaded the /pdb/ folder.

  -g, --voxels_as_gaussian BOOLEAN
                                  Boolean - whether to encode voxels as
                                  gaussians (True) or voxels (False). The
                                  gaussian representation uses the
                                  Van der Waals radius of each atom using the
                                  formula e^(-x^2) where x is Vx - x)^2 + (Vy
                                  - y)^2) + (Vz - z)^2)/ r^2 and  (Vx, Vy, Vz)
                                  is the position of the voxel in space. (x,
                                  y, z) is the position of the atom in space,
                                  r is the Van der Waals radius of the atom.
                                  They are then normalized to add up to 1.

  -b, --blacklist_csv PATH        Path to csv file with structures to be
                                  removed.

  -comp, --compression_gzip BOOLEAN
                                  Whether to comrpess the dataset with gzip
                                  compression.

  -vas, --voxelise_all_states BOOLEAN
                                  Whether to voxelise only the first state of
                                  the NMR structure (False) or all of them
                                  (True).

  --help                          Show this message and exit.

Example 1: Create a Dataset Using Biological Units of Proteins

Ideally, if you are trying to solve the Inverse Protein Folding Problem , you should use Biological Units as they are the minimal functional part of a protein. This prevents having solvent-exposed hydrophobic residues as training data.

Download the dataset:

To read more about biological units: https://pdbj.org/help/about-aubu and https://pdb101.rcsb.org/learn/guide-to-understanding-pdb-data/biological-assemblies

Once the dataset is downloaded, you will have a directory with sub-directory containig the gzipped PDB structures (ie. your Protein Data Bank Files).

To voxelize the structures into frames, run:

make-frame-dataset /path/to/biounits/  -e .pdb1.gz 

If everything went well, you should be seeing the number of structures that will be voxelised and a list of default parameters, to which you will press "y " to proceed.

Example 2: Create a Dataset Using Biological Units of Proteins and PISCES

PISCES (Protein Sequence Culling Server) is a curated subset of protein structures. Each file contains a list of structures with parameters such as resolution, percentage identity and R-Values.

Aposteriori supports filtering with a PISCES file as such:

make-frame-dataset /path/to/biounits/  -e .pdb1.gz --pieces-filter-file
 path/to/pisces/cullpdb_pc90_res1.6_R0.25_d190114_chains8082

If everything went well, you should be seeing the number of structures that will be voxelised and a list of default parameters, to which you will press "y " to proceed.

Development

The easiest way to install a development version of aposteriori is using Poetry:

git clone https://github.com/wells-wood-research/aposteriori.git
cd aposteriori/
poetry install

In case you get an error about Python Versions (eg. if your python version is >3.7), we recommend you use PyEnv to install a separate version of Python alonside your main one. Then run:

pyenv install 3.7.0
pyenv local 3.7.0

If you are still having problems switching versions, (eg. you run the previous program but running python -V shows another version of Python.), try running

eval "$(pyenv init -)"

if Poetry doesn't pick up on the correct version of Python from PyEnv, run

poetry env use python3.7

You can then use either poetry shell to activate the development environment or use poetry run to execute single commands in the environment:

poetry run make-frame-dataset --help

Make sure you test your install:

poetry run pytest tests/

aposteriori's People

Contributors

universvm avatar chriswellswood avatar dependabot[bot] avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.