drostlab / metablastr Goto Github PK

View Code? Open in Web Editor NEW

31.0 5.0 8.0 672 KB

Seamless Integration of BLAST Sequence Searches in R

Home Page: https://drostlab.github.io/metablastr/

License: GNU General Public License v2.0

R 100.00%

blast-searches blast blastn blasting sequence-diversity biological-sequences biomartr species nucleotide blast-hits

metablastr's People

Contributors

Stargazers

Watchers

Forkers

michbur gogleva hongzhonglu jianshu93 alexga yongming-duan rufus-willy

metablastr's Issues

blast_nt for nucleotide against NCBI database

Hello
thank you for developping this set of tools
I am trying to run my nucleotide sequences againts NCBI database
From your vignette, it should be blast_n()?
but the function is not available is the version of metablastr I just installed with

devtools::install_github("drostlab/metablastr", build_vignettes = TRUE, dependencies = TRUE)

do you have an alternative to this function for nucleotides?
https://drostlab.github.io/metablastr/reference/blast_protein_to_nr_database.html

does the database need to be stored locally? advices?

thank you!

Feature Request: extract the same regions from multiple samples

It would be helpful if your command extract_random_seqs_from_genome() could be modified to include multiple samples, so that the same random sequences are extracted from multiple genomes (potentially based on scaffold/chromosome ID and position).

Feature request : add taxon id for each blast hit

Default blast tabular format output (outfmt 7) doesn't add taxon id for each blast hit. Taxon id is very important for downstream phylogenetic analysis. Indirect approach to add taxon id is to run the blastdbcmd with option %T once the results are obtained. This is very time consuming as you have to get taxon first and map back to original blast results. Can metablstr has function which can map taxon id to blast outcome ?

additional functionality when extracting random sequences

Hi,

I just stumbled across your package today while looking for a way to extract the same set of 10,000 1000kb loci randomly from five different genomes. Thank you so much for writing this package, I think it's going to be a huge help in this process. If you're still open to hearing requests for additional functionality/flexibility, I have two questions for you. For your function 'extract_random_seqs_from_genome', is it possible to set it so that replacement = FALSE once each random locus is selected and extracted (or is this already true)? Would it also be possible to set a minimum distance between randomly selected loci, e.g. if I wanted to specify that all loci are at least 50bp apart?

Thank you,
Amy

Error in .Call2("new_input_filexp", filepath, PACKAGE = "XVector") : cannot open file ''

Hi I finally downloaded Blast+ and created the pathway to R. Now I am having issues loading my fasta file. Thanks!

I tried to load it from my hard drive and from my desktop and get the same errors:

sealionfeces <- readDNAStringSet("H:/ONRdolphinsealionpooled/sealionfecespooled/canu/medaka/consensus.fasta", + package = "rBLAST")) Error: unexpected ')' in: " sealionfeces <- readDNAStringSet("H:/ONRdolphinsealionpooled/sealionfecespooled/canu/medaka/consensus.fasta", package = "rBLAST"))"

sealionfeces <- readDNAStringSet(system.file("H:/ONRdolphinsealionpooled/sealionfecespooled/canu/medaka/consensus.fasta", + package = "rBLAST")) Error in .Call2("new_input_filexp", filepath, PACKAGE = "XVector") : cannot open file ''

sealionfeces <- readDNAStringSet(system.file("C:\Users\katie\OneDrive\Desktop\R\sealionfecespooled.consensus.fasta", Error: '\U' used without hex digits in character string starting ""C:\U"

sealionfeces <- readDNAStringSet(system.file("C:/Users/katie/OneDrive/Desktop/R/sealionfecespooled.consensus.fasta", + package = "rBLAST")) Error in .Call2("new_input_filexp", filepath, PACKAGE = "XVector") : cannot open file ''
--

| >

Feature Request: exclude sequences with Ns

In extract_random_seqs_from_genome(), It would be helpful to have an option that allows users to decide whether to exclude sequences with too many Ns (e.g. N > 0 or N > 10%). For me, it would be fine for this filtering step to happen after X sequences are drawn (e.g. if 100 sequences are drawn, then 10 are excluded because they have too many Ns, resulting in 90 sequences). It would be great to have a short printout at the end that says how many sequences were drawn and how many were filtered out due to an issue with Ns.

How can I load sequences from a CSV file and do the massive Blast search locally?

Hi,

I have a CSV dataset of edited DNA sequences by DADA2 pipeline and wonder how I can load and blast these sequences automatically using metablastr packages.

In the CSV file, each row represents a unique sequence and each column has a sample name(see attached image):

These COI gene sequences are clean and ready to Blast directly on NCBI website. Most of the sequences are from mammalian and avian blood. Since there are over 2000 sequences, it'd be great if I can use this package to load and blast automatically instead of manually.

Any R scripts to achieve this goal with metablastr package would highly appreciated. Thank you.

Best,

Gabriel

Package release request

Hi,
would be possible to provide a package release? It is required for creating the conda package.

Regards

blast_best_reciprocal_hit nucleotide-protein comparison task error

I ran the following code:

blast_test_reciprocal <- blast_best_reciprocal_hit(
    query   = 'A.fasta', ##protein sequence
    subject = 'B.fasta', ##nucleotide sequence
    search_type = "protein_to_nucleotide",
    task = "tblastn",
    evalue = 0.000001,
    output.path = tempdir(),
    db.import  = FALSE)

which gives the following result:


Starting 'tblastn -task tblastn' with  query: A.fasta and subject: B.fasta using 1 core(s) ...

BLAST search finished! The BLAST output file was imported into the running R session. The BLAST output file has been stored at: C:/Users/A_B_tblastn_eval_1e-06.blast_tbl
Error: Please choose a nucleotide-protein comparison task that is supported by BLAST: task = 'blastx' or task = 'blastx-fast'.

How to specify the second blast task ('blastx') when performing tblastn?

function blast_protein_to_protein with argument is.subject.db = TRUE doesn't work with nr database

I have nr blast database downloaded from NCBI, which contains the files given in the attached snapshot. When I run the command below, it throws the as shown. I wonder, what input should I give as a blast-able database ?

blast_test <- blast_protein_to_protein(
        query   =  "aa_query.fasta",
        subject = "path/to/nr/db/nr",
        is.subject.db = TRUE,
        output.path = tempdir(),
        db.import  = FALSE ,cores = 4)

Error in .Call2("new_input_filexp", filepath, PACKAGE = "XVector") : 
  cannot open file '/Users/chiragparsania/Documents/Database/nr_protein_db/nr'

Error: Internal error in `dict_hash_with()`: Dictionary is full.

Thank you metablastr developers for sharing this tool with the community. I'd like to seek for your help for the error I've encountered following blast_best_reciprocal_hit() run. Both BLASTp seem to have completed, but the reciprocal best hit step appears to have failed. One database I'm using has around 25M records, and I'm wondering if this could be the reason why the reciprocal best hit step failed. For reference, I'm sharing the snippets of the error:

BLAST search finished! The BLAST output file was imported into the running R session. The BLAST output file has been stored at: /expt/datb/data/HiC/Rp_RNA-Seq/embryo/timeseries-vs-RpedSuzhou/rna-seq_rpedszv-reannot/annot_gene_sym/metablastr_bbh/metazoa_refseq_biopython-validated_Riptortus_pedestris_SZV_blastp-fast_eval_1e-05.blast_tbl
Error: Internal error in dict_hash_with(): Dictionary is full.

rlang::last_error()
<error/rlang_error>
Internal error in dict_hash_with(): Dictionary is full.
Backtrace:

metablastr::blast_best_reciprocal_hit(...)

metablastr::blast_best_hit(...)

dplyr:::group_by.data.frame(blast_res, query_id)

dplyr::grouped_df(groups$data, groups$group_names, .drop)

dplyr:::compute_groups(data, vars, drop = drop)

dplyr:::vec_split_id_order(group_vars)

vctrs::vec_group_loc(x)
Run rlang::last_trace() to see the full context.

Is it also possible to skip the BLAST step to directly proceed with the reciprocal best hit step when re-running this procedure?

Thank you very much!

Reg: Reciprocal hits from already run blast outcomes

Hi!

I have performed the following to obtain blast format 6 tabular format:

A vs B with Diamond
B vs A with Diamond

I read AvB.hits and BvA.hits using read_blast() function. Will it be possible to identify the reciprocal hits? Or should I rerun the searches with metablastr?