Git Product home page Git Product logo

biovec's Introduction

2017Bio2Vec

Protein classification over sum of protein ngrams vector representation

Ordinarily, biological information is represented by an array of characters, but it is suggested that by expressing it as a vector, information can be stored more easily for analysis. As a specific application range,

  1. family classification
  2. protein visualization
  3. structure prediction
  4. disordered protein identification
  5. protein-protein interaction prediction.

Such Classification and prediction are easy to understand usage, but personally I felt that protein visualization would be most useful. Unless the sequence is short or the structure is already known, it seems that the current method of grasping the whole of protein is not popular in general, so I think that such expression method has certain usefulness. Although this idea seems strange at first glance, it is recognized to some extent in natural language.

See another implementation in https://github.com/kyu999/biovec, https://github.com/peter-volkov/biovec

Paper : http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0141287

If you don't have Database, you can download from the below link.

Uniprot (Swiss-prot)

Disprot

If you don't working on mac OS try this

How to install and use

  1. Install python packages.
  • pip install -r requirements.txt

cf) If you use macos and get a problem about installation issue with matplotlib python, go to the next link. https://stackoverflow.com/questions/21784641/installation-issue-with-matplotlib-python

  1. Download data file.
  1. Move the downloaded file to our project directory.

  2. And then, unzip downloaded file.

  1. If you download small DB
  • tar -xzvf small_DB.tar.gz
  1. If you download original DB
  • tar -xzvf original_DB.tar.gz
  1. Run make_data_uniprot.py

  2. Now you get ngram's corpus and ngram's vectors, protein's vectors, protein's families to uniprot_sprot.fasta

  3. If you want to get how to we classify proteins into each family, please run bio_svm/train_svm_biovec.py

  • then you want to know how to we organize SVM using RBF kernels, try next commend.
tensorboard --logdir=./logs

description

  • word2vec : Generating word2vec model from protein databases(gensim).

  • document : Protain databases(uniprot, Pfam, disprot, PDB...).

  • bio_tsne : TSNE(100D to 2D) 3gram vectors and protein vectors.

  • trained_models : Trained data made by make_data_uniprot.py

  • bio_svm : Classifying proteins (random PDB and FG-nups).

  • processd_data : Processing data( json file to fatsta , select data , merge data)

  • biovisual : Visualization protein vectors

  • ngrams_properties : For the labeing 3gram aminoacid

How can see graph

1 3gram protein space

  1. Install python packages.
  • pip install -r requirements.txt
  1. download document

  2. run make_data_uniprot.py

  • python make_data_uniprot.py
  1. run visualize.py
  • python visualize.py
  1. choose PS(protein space)
  • just type PS
  1. finally you can see 3gram protein space

proteinspave.png

2 binay svm with FG-nups and random PDBs

  1. Install python packages.
  • pip install -r requirements.txt
  1. download document
  • unzip document dis-disprot.json , disprot.json ,dis-fg-nups.fasta , fg-nups.fasta , pdb_seqres.fasta , disordered-pdb.fasta move document to processed_data
  1. run processed_sequence.py in processed_data
  • processed_seqence.py generate dir of binary_svm
  1. have to gzip dataset.fasta file and move binary_svm to document
  • dataset.fasta located 2017Bio2Vec/processed_data/binary_svm
  1. run make_data_uniprot.py

  2. run visualize.py

  3. choos BSVM (binary svm)

  4. run binary_svm.py

  5. finally you can see binary svm graph

binarysvm.png

3 density map

  1. Install python packages.
  • pip install -r requirements.txt
  1. download document
  • unzip document dis-disprot.json , disprot.json ,dis-fg-nups.fasta , fg-nups.fasta , pdb_seqres.fasta , disordered-pdb.fasta move document to processed_data
  1. run processed_sequence.py in processed_data
  • processed_seqence.py generate dir of binary_svm
  1. have to gzip all the data
  • the data located 2017Bio2Vec/processed_data/binary_svm
  1. run make_data_uniprot.py

  2. run visualize.py

  3. choos DM(density map)

  4. finally you can see density map

densitymap.png

biovec's People

Contributors

changgeonlee avatar haeun45 avatar jowoojun avatar soonmok avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

biovec's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.