Git Product home page Git Product logo

buscador's Introduction

buscadoR

The goal of buscadoR is to find plant receptor proteins in information from from hmmer, deeptmhmm and BLAST to find signal peptides, repeats, transmembrane and ectodomains then applies a heuristic to classify into receptor classes.

Prerequisites

buscadoR requires input from hmmer, deeptmhmm and BLAST. A Google Colab notebook is available to help you run these time consuming searches easily. Go to the Notebook to run searches before proceeding. If you wish to run these locally please see the section Running buscadoR searches locally”.

This R package assumes that you have already performed those searches and have the necessary files to hand.

Installation

You can install buscadoR from GitHub with:

# install.packages("devtools")
devtools::install_github("TeamMacLean/buscadoR")

Compiling the results, classifying the proteins

The search results can be combined and the RLK classes predicted with the buscar function. Pass in all the input files, including the original FASTA file.

Note that you can pass either gzipped (.gz) files or plain text files.

library(buscadoR)

fasta_file <- "arabidopsis_proteome.fa"
hmmer_file <- "hmmer_results.txt.gz"
deeptmhmm_file <- "deeptmhmm_results.txt.gz"
blast_file <- "blast_results.txt.gz"

b <- buscar(hmmer = hmmer_file, deeptmhmm = deeptmhmm_file, blast = blast_file, fasta = fasta_file)

Examining the results

A set of helper functions to help you extract data into a useable result is provided.

A summary table of each type found can be generated

mesa(b)
b_type count
lrr_rk 196
lrr_rp 66
lrr_rp_rk_with_ecto 50
non_lrr_rk 201
non_lrr_rp 64

BuscadoR RLK Finding results

A dataframe of one row per receptor protein found (ideal for exporting) can be created

as.data.frame(b)
#> # A tibble: 577 × 10
#>    seq_name    sp_cut_site tm_start tm_end pfams_hit  pfams_acc pfams_loc b_type
#>    <chr>             <int>    <int>  <int> <chr>      <chr>     <chr>     <chr> 
#>  1 AT1G17240.1          44      708    728 LRR_8;LRR… PF13855.… 94-150;1… lrr_rp
#>  2 AT1G17250.1          50      726    738 LRR_8;LRR… PF13855.… 121-157;… lrr_rp
#>  3 AT1G25570.1          23      554    574 LRR_4;LRR… PF12799.… 443-460;… lrr_rp
#>  4 AT1G28340.1          22      550    570 LRR_8;LRR… PF13855.… 426-458;… lrr_rp
#>  5 AT1G34290.1          26      170    190 LRR_4;LRR… PF12799.… 94-130;7… lrr_rp
#>  6 AT1G45616.1          29      951    971 LRR_8;LRR… PF13855.… 117-157;… lrr_rp
#>  7 AT1G58190.2          27      985   1005 LRR_8;LRR… PF13855.… 123-162;… lrr_rp
#>  8 AT1G65380.1          25      687    707 LRR_8;LRR… PF13855.… 84-133;1… lrr_rp
#>  9 AT1G69990.1          18      214    234 LRR_8;LRR… PF13855.… 68-101;1… lrr_rp
#> 10 AT1G71390.1          25      740    760 LRR_8;LRR… PF13855.… 79-114;1… lrr_rp
#> # … with 567 more rows, and 2 more variables: ectos_hit <chr>,
#> #   ectos_coord <chr>

A tidy format dataframe of each type of receptor protein can be extracted

lrr_rp(b)
#> # A tibble: 3,244 × 14
#>    seq_name cut_site tm_start tm_end hit   acc     eval seq_from seq_to hit_from
#>    <chr>       <int>    <int>  <int> <chr> <chr>  <dbl>    <dbl>  <dbl>    <int>
#>  1 AT1G172…       44      708    728 LRR_8 PF13… 2  e-7       94    150        5
#>  2 AT1G172…       44      708    728 LRR_8 PF13… 1.5e-6      115    181        2
#>  3 AT1G172…       44      708    728 LRR_8 PF13… 2.3e-2      171    233        3
#>  4 AT1G172…       44      708    728 LRR_8 PF13… 4.6e-6      246    305        2
#>  5 AT1G172…       44      708    728 LRR_8 PF13… 9.6e-5      315    353       23
#>  6 AT1G172…       44      708    728 LRR_8 PF13… 2.1e-6      362    402       21
#>  7 AT1G172…       44      708    728 LRR_8 PF13… 1.9e-1      406    450       16
#>  8 AT1G172…       44      708    728 LRR_8 PF13… 1  e-7      470    529        2
#>  9 AT1G172…       44      708    728 LRR_8 PF13… 7.5e-7      581    636        6
#> 10 AT1G172…       44      708    728 LRR_8 PF13… 2.9e-5      602    660        3
#> # … with 3,234 more rows, and 4 more variables: hit_to <int>, base_acc <chr>,
#> #   b_type <chr>, pfam_length <dbl>

also lrr_rk(), non_lrr_rp(), non_lrr_rk(), lrr_rp_rk_with_ecto()

Raw search results from the databases can be extracted

hmmer_results(b)
#> # A tibble: 22,890 × 11
#>    seq_name   hit   acc     eval seq_from seq_to hit_from hit_to base_acc b_type
#>    <chr>      <chr> <chr>  <dbl>    <dbl>  <dbl>    <int>  <int> <chr>    <chr> 
#>  1 AT1G17600… LRR_4 PF12… 1.5e-1      607    646        5     42 PF12799  LRR_P…
#>  2 AT1G17600… LRR_4 PF12… 8.7e-2      626    662        1     35 PF12799  LRR_P…
#>  3 AT1G17600… LRR_4 PF12… 4  e+0      649    679        1     26 PF12799  LRR_P…
#>  4 AT1G17600… LRR_4 PF12… 2.4e-4      798    835        3     42 PF12799  LRR_P…
#>  5 AT1G17600… LRR_4 PF12… 7.2e-4      843    883        2     43 PF12799  LRR_P…
#>  6 AT1G17600… LRR_4 PF12… 9.9e-3      893    931        3     42 PF12799  LRR_P…
#>  7 AT1G17600… LRR_4 PF12… 2.4e-3      915    958        1     42 PF12799  LRR_P…
#>  8 AT1G17600… LRR_3 PF07… 2.5e-8      604    622        1     19 PF07725  LRR_P…
#>  9 AT1G17600… LRR_8 PF13… 5.6e-3      605    659        3     59 PF13855  LRR_P…
#> 10 AT1G17600… LRR_8 PF13… 3.5e-6      797    852        2     59 PF13855  LRR_P…
#> # … with 22,880 more rows, and 1 more variable: pfam_length <dbl>

also deeptmhmm_results(), ecto_results()

Each set of putative proteins can be rendered as a plot

dibujar(b, which = "lrr_rp")

also use “lrr_rk”, “non_lrr_rp”, “non_lrr_rk”, “lrr_rp_rk_with_ecto”

Exporting results

A dataframe of one row per receptor protein can be created and written out in the usual way

res <- as.data.frame(b)
readr::write_csv(res, "my_results.csv")

Annotated FASTA sequences can be exported

write_seqs(b, "my_seqs.fa")

Other Stuff

A dataframe compatible with the drawProteins package here on bioconductor can be created for further plotting work. Note the plot is a ggplot2 object and can be styled using that package too.

dp <- as.drawProteins(b)

The definitions used for the classification can be returned

definiciones(which="lrr_rp")
#> lrr_rp =
#>      * A signal peptide according to Phobius.
#>      * Exactly one Transmembrane Domain according to Phobius.
#>      * At least one of the `lrr_pfams` according to PFAMscan.
#>      * The Transmembrane domain should be closer to the C terminal than the end of the pfam hit.

also use “lrr_rk”, “non_lrr_rp”, “non_lrr_rk”, “lrr_rp_rk_with_ecto”, “all”

Running buscadoR searches locally

If you wish to run the searches locally the output formats for each program need to be set carefully to work with buscar().

hmmer

The PFAM domain hmms used for the hmmer search are provided here as a file buscador_pfam_hmm. Alternatively you can parse the latest versions from the PFAM-A database. The required IDs can be listed from buscadoR as they are held in the data objects lrr_pfam, non_lrr_pfam and kinase_pfams.

When running hmmscan use --domtblout as the output option.

deeptmhmm

There are no options for deeptmhmm. When running this locally, a single output file is created for each sequence. Just concatenate these into one file for input into buscar()

blastp

The ecto-domains for use as query sequence in the blastp are available here as a fasta buscador_ecto_fasta

Use blastp with these as the query and the proteins of interest as the subject, select outfmt 6. The resulting text file can be used as input to buscar()

buscador's People

Contributors

danmaclean avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.