Git Product home page Git Product logo

smsnet's Introduction

SMSNet

Sequence-Mask-Search framework and SMSNet model for de novo peptide sequencing.

The pre-trained model can be downloaded from FigShare.

SMSNet's predicted amino acid sequences for a public HLA peptidome dataset (MassIVE accession MSV000080527) and phosphoproteome dataset (PRIDE accession PXD009227) can be found on FigShare.

The preprint can be found on bioRxiv.

Dependencies

This project uses Python 3.5.2, with the following lib dependencies:

A list of all python packages can be found in requirement.txt

Instructions

Decode

python run.py --model_dir <model_directory> --inference_input_file <input_file_directory/input_file.mgf> --rescore

<model_directory> is the directory of the model (can be downloaded from the link above).

<input_file_directory/input_file.mgf> is the path to the input file.

Using --rescore flag will generate another probability file with suffix “_rescore” in the same directory. The output will be in “<input_file_directory>_output/”.

Other options can be found in "run.py".

Model parameters (including possible amino acids) can be found in "nmt/input_config.py".

Note

  • In order to generate the report file, the TITLE lines in .mgf file must end with "scan=".
  • To switch between m-mod and p-mod model, the following changes are needed (default: p-mod):
    • tgt_vocab_size (24 for m_mod / 27 for p-mod) and tgt_vocab_file in run.py line 61-62.
    • Comment/uncomment the possible vocab in inverse_vocab in nmt/input_config.py accordingly ('s', 't', 'y' at line 65, 67, 71).
    • Select the corresponding AA_weight_table in function create_aa_tables() in nmt/input_config.py (by comment/uncomment line 169-174 or 176-180).

Outputs

  • For each input file, three output files will be generated in the output directory: <input_file>, <input_file>_prob and, <input_file>_rescore. They are the output sequences, probabilities of each amino acid, and probabilities of each amino acid after rescoring, respectively.
  • The report summarizing the outputs in .tsv format will be in the same parent directory as the input.

Example

Decoding “test_decode/test_file.mgf” with a model in "model/m_mod/".


python run.py --model_dir model/m_mod/ --inference_input_file test_decode/test_file.mgf --rescore

The report will be in test_decode_output.

The model will also produce three files:

test_decode_output/test_file (sequence)
test_decode_output/test_file_prob (score)
test_decode_output/test_file_rescore (score after rescoring with post-processing model)

Database search

For database searching, change the file name in utils_masking/SMSNet_final_database_search.py, then run python utils_masking/SMSNet_final_database_search.py

Acknowledgement

This code is based on TensorFlow Neural Machine Translation GNMT.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.