drostlab / metablastr Goto Github PK
View Code? Open in Web Editor NEWSeamless Integration of BLAST Sequence Searches in R
Home Page: https://drostlab.github.io/metablastr/
License: GNU General Public License v2.0
Seamless Integration of BLAST Sequence Searches in R
Home Page: https://drostlab.github.io/metablastr/
License: GNU General Public License v2.0
Hello
thank you for developping this set of tools
I am trying to run my nucleotide sequences againts NCBI database
From your vignette, it should be blast_n()?
but the function is not available is the version of metablastr I just installed with
devtools::install_github("drostlab/metablastr", build_vignettes = TRUE, dependencies = TRUE)
do you have an alternative to this function for nucleotides?
https://drostlab.github.io/metablastr/reference/blast_protein_to_nr_database.html
does the database need to be stored locally? advices?
thank you!
It would be helpful if your command extract_random_seqs_from_genome()
could be modified to include multiple samples, so that the same random sequences are extracted from multiple genomes (potentially based on scaffold/chromosome ID and position).
Default blast tabular format output (outfmt 7) doesn't add taxon id for each blast hit. Taxon id is very important for downstream phylogenetic analysis. Indirect approach to add taxon id is to run the blastdbcmd
with option %T
once the results are obtained. This is very time consuming as you have to get taxon first and map back to original blast results. Can metablstr
has function which can map taxon id to blast outcome ?
Hi,
I just stumbled across your package today while looking for a way to extract the same set of 10,000 1000kb loci randomly from five different genomes. Thank you so much for writing this package, I think it's going to be a huge help in this process. If you're still open to hearing requests for additional functionality/flexibility, I have two questions for you. For your function 'extract_random_seqs_from_genome', is it possible to set it so that replacement = FALSE once each random locus is selected and extracted (or is this already true)? Would it also be possible to set a minimum distance between randomly selected loci, e.g. if I wanted to specify that all loci are at least 50bp apart?
Thank you,
Amy
Hi I finally downloaded Blast+ and created the pathway to R. Now I am having issues loading my fasta file. Thanks!
I tried to load it from my hard drive and from my desktop and get the same errors:
sealionfeces <- readDNAStringSet("H:/ONRdolphinsealionpooled/sealionfecespooled/canu/medaka/consensus.fasta", + package = "rBLAST")) Error: unexpected ')' in: " sealionfeces <- readDNAStringSet("H:/ONRdolphinsealionpooled/sealionfecespooled/canu/medaka/consensus.fasta", package = "rBLAST"))"
sealionfeces <- readDNAStringSet(system.file("H:/ONRdolphinsealionpooled/sealionfecespooled/canu/medaka/consensus.fasta", + package = "rBLAST")) Error in .Call2("new_input_filexp", filepath, PACKAGE = "XVector") : cannot open file ''
sealionfeces <- readDNAStringSet(system.file("C:\Users\katie\OneDrive\Desktop\R\sealionfecespooled.consensus.fasta", Error: '\U' used without hex digits in character string starting ""C:\U"
sealionfeces <- readDNAStringSet(system.file("C:/Users/katie/OneDrive/Desktop/R/sealionfecespooled.consensus.fasta", + package = "rBLAST")) Error in .Call2("new_input_filexp", filepath, PACKAGE = "XVector") : cannot open file ''
--
ย
| >
In extract_random_seqs_from_genome()
, It would be helpful to have an option that allows users to decide whether to exclude sequences with too many Ns (e.g. N > 0 or N > 10%). For me, it would be fine for this filtering step to happen after X sequences are drawn (e.g. if 100 sequences are drawn, then 10 are excluded because they have too many Ns, resulting in 90 sequences). It would be great to have a short printout at the end that says how many sequences were drawn and how many were filtered out due to an issue with Ns.
Hi,
I have a CSV dataset of edited DNA sequences by DADA2 pipeline and wonder how I can load and blast these sequences automatically using metablastr packages.
In the CSV file, each row represents a unique sequence and each column has a sample name(see attached image):
These COI gene sequences are clean and ready to Blast directly on NCBI website. Most of the sequences are from mammalian and avian blood. Since there are over 2000 sequences, it'd be great if I can use this package to load and blast automatically instead of manually.
Any R scripts to achieve this goal with metablastr package would highly appreciated. Thank you.
Best,
Gabriel
Hi,
would be possible to provide a package release? It is required for creating the conda package.
Regards
I ran the following code:
blast_test_reciprocal <- blast_best_reciprocal_hit(
query = 'A.fasta', ##protein sequence
subject = 'B.fasta', ##nucleotide sequence
search_type = "protein_to_nucleotide",
task = "tblastn",
evalue = 0.000001,
output.path = tempdir(),
db.import = FALSE)
which gives the following result:
Starting 'tblastn -task tblastn' with query: A.fasta and subject: B.fasta using 1 core(s) ...
BLAST search finished! The BLAST output file was imported into the running R session. The BLAST output file has been stored at: C:/Users/A_B_tblastn_eval_1e-06.blast_tbl
Error: Please choose a nucleotide-protein comparison task that is supported by BLAST: task = 'blastx' or task = 'blastx-fast'.
How to specify the second blast task ('blastx') when performing tblastn?
I have nr blast database downloaded from NCBI, which contains the files given in the attached snapshot. When I run the command below, it throws the as shown. I wonder, what input should I give as a blast-able database
?
blast_test <- blast_protein_to_protein(
query = "aa_query.fasta",
subject = "path/to/nr/db/nr",
is.subject.db = TRUE,
output.path = tempdir(),
db.import = FALSE ,cores = 4)
Error in .Call2("new_input_filexp", filepath, PACKAGE = "XVector") :
cannot open file '/Users/chiragparsania/Documents/Database/nr_protein_db/nr'
Thank you metablastr developers for sharing this tool with the community. I'd like to seek for your help for the error I've encountered following blast_best_reciprocal_hit() run. Both BLASTp seem to have completed, but the reciprocal best hit step appears to have failed. One database I'm using has around 25M records, and I'm wondering if this could be the reason why the reciprocal best hit step failed. For reference, I'm sharing the snippets of the error:
BLAST search finished! The BLAST output file was imported into the running R session. The BLAST output file has been stored at: /expt/datb/data/HiC/Rp_RNA-Seq/embryo/timeseries-vs-RpedSuzhou/rna-seq_rpedszv-reannot/annot_gene_sym/metablastr_bbh/metazoa_refseq_biopython-validated_Riptortus_pedestris_SZV_blastp-fast_eval_1e-05.blast_tbl
Error: Internal error indict_hash_with()
: Dictionary is full.rlang::last_error()
<error/rlang_error>
Internal error indict_hash_with()
: Dictionary is full.
Backtrace:
- metablastr::blast_best_reciprocal_hit(...)
- metablastr::blast_best_hit(...)
- dplyr:::group_by.data.frame(blast_res, query_id)
- dplyr::grouped_df(groups$data, groups$group_names, .drop)
- dplyr:::compute_groups(data, vars, drop = drop)
- dplyr:::vec_split_id_order(group_vars)
- vctrs::vec_group_loc(x)
Runrlang::last_trace()
to see the full context.
Is it also possible to skip the BLAST step to directly proceed with the reciprocal best hit step when re-running this procedure?
Thank you very much!
Hi!
I have performed the following to obtain blast format 6 tabular format:
I read AvB.hits and BvA.hits using read_blast() function. Will it be possible to identify the reciprocal hits? Or should I rerun the searches with metablastr?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.