envmetagen / metabinkit Goto Github PK
View Code? Open in Web Editor NEWSet of programs to perform taxonomic binning.
License: GNU General Public License v3.0
Set of programs to perform taxonomic binning.
License: GNU General Public License v3.0
Maybe going one step too far, but I like to run a test blast as part of the makeblastdb command. i.e. take the first sequence of the fasta and blast it against the newly created db to ensure it is created correctly. Also, to run the blastdbcheck command to check the blastdb and to further ensure all seqs have taxids.
from my own makeblastdb function:
if(do.checks){
message("Running test blast")
phylotools::dat2fasta(head(tempfasta,n=1),gsub(".fasta",".blastdbformatted.test.fasta",infasta))
h<-blast.min.bas(gsub(".fasta",".blastdbformatted.test.fasta",infasta),refdb = gsub(".fasta","",infasta))
message("Does new db have taxids - column V3?")
print(data.table::fread(gsub(".fasta",".blastdbformatted.test.blast.txt",infasta)))
system2(command = "blastdbcheck",args = c("-must_have_taxids","-db",gsub(".fasta","",infasta)))
}
When a taxonomy dump is used that is older than the BLAST performed (or whatever was used to get the taxids), then there can often be taxids not found, leading to NAs
When this happens, I think the program should STOP (or at least provide obvious warning) and report an error like:
"Some taxids were not found in the taxonomy database, consider updating NCBI taxonomy database by running ./install -i your_metabinkit_install_directory -x taxonomy_db"
example in R
a<-data.table::fread("2019_August_002.UNIO.lenFilt.trimmed.ids.SC4.pol.blast.filt.txt",data.table = F)
b<-add.lineage.df(a,ncbiTaxDir = "/home/tutorial/TOOLS/DBS/ncbi_taxonomy/taxdump/") #an old taxonomy folder
#some stderr output
11:39:50.515 [WARN] taxid 1823760 was deleted
11:39:50.540 [WARN] taxid 1936990 was deleted
11:39:50.591 [WARN] taxid 2563896 was deleted
11:39:50.641 [WARN] taxid 2714934 not found
11:39:50.642 [WARN] taxid 2715212 not found
11:39:50.642 [WARN] taxid 2715678 not found
11:39:50.643 [WARN] taxid 2715735 not found
Warning messages:
1: In `[<-.factor`(`*tmp*`, thisvar, value = "unknown") :
invalid factor level, NA generated
2: In `[<-.factor`(`*tmp*`, thisvar, value = "unknown") :
invalid factor level, NA generated
In metabin
metabin -i 2019_August_002.UNIO.lenFilt.trimmed.ids.SC4.pol.blast.filt.nopaths.csv -o 2019_UNIO.metabins.new.nopath.txt -S 98 -G 95 -F 92 -A 80 --discard_sp TRUE -D /home/tutorial/TOOLS/DBS/ncbi_taxonomy/taxdump/
#some output
11:55:47.603 [WARN] taxid 2721245 not found
11:55:47.603 [WARN] taxid 2721246 not found
11:55:47.603 [WARN] taxid 2722751 not found
11:55:47.604 [WARN] taxid 2724150 not found
11:55:47.604 [WARN] taxid 2724191 not found
11:55:47.604 [WARN] taxid 2724192 not found
#but program completes
See the readme: Example 2. Custom settings. The error is in the output:
16:29:14.561 [WARN] taxid 2746931 not found
--SpeciesBL testspecies2exclude.txt
consist of
$ cat testspecies2exclude.txt
6573
full command (using the files in metabinkit/tests/test_files)
$ metabin -i in1.blast.short.tsv -o out1.blast.short.bins -S 98 -G 94 -F 93 -A 88 --SpeciesBL testspecies2exclude.txt --FilterFile Mioduchowska2018_flaggedAccessions.txt --FilterCol saccver --TopSpecies 2 --TopGenus 2 --TopFamily 2 --TopAF 2 --sp_discard_sp --sp_discard_mt2w --sp_discard_num
Added test files to a new test_files folder. Includes one UNIO set (blast.filt+bins) and one VENE set. Also added disabled_taxa.txt (They are all tsv format)
Certain codes are used by NCBI to distinguish PREDICTED in-silico sequences. I wonder if we could add an option to remove these? The codes are XM_, XR_, XP_ so the option could be -rm_predicted [colname] something like
if(length(grep("XM_.*",btab$colname))>0) btab<-btab[-grep("XM_.*",btab$colname),]
if(length(grep("XR_.*",btab$colname))>0) btab<-btab[-grep("XR_.*",btab$colname),]
if(length(grep("XP_.*",btab$colname))>0) btab<-btab[-grep("XP_.*",btab$colname),]
Perhaps a personal wish, but I often use the -taxidlist option to limit the blast to a certain taxonomical section of the database (and we used it for mussels). I see you have the negative taxids option, but not the "positive" option. Suggest adding this functionality.
example of previous blast function, in case it helps
> blast.min.bas2
function(infasta,refdb,blast_exec="blastn",wait=T,taxidlimit=NULL,inverse=F,ncbiTaxDir=NULL,overWrite=F,out=NULL,
opts=c("-task","blastn","-outfmt", "6 qseqid evalue pident qcovs saccver staxid ssciname sseq","-num_threads", 64,
"-max_target_seqs", 100, "-max_hsps",1,"-word_size", 11, "-perc_identity", 50,
"-qcov_hsp_perc", 98, "-gapopen", 0, "-gapextend", 2, "-reward", 1, "-penalty", -1)){
t1<-Sys.time()
require(processx)
if(!is.null(taxidlimit)) if(is.null(ncbiTaxDir)) stop("to use taxidlimit, ncbiTaxDir must be supplied")
if(is.null(out)) out<-paste0(gsub(".fasta", ".blast.txt",infasta))
outdircheck<-
if(overWrite==F) if(file.exists(out)) stop("The following file already exists ", out, "Use overWrite=T to overwrite")
if(!is.null(taxidlimit)){
h<-list()
#generate children of taxids and store in file
taxid.list<-list()
taxids_fileA<-paste0("taxids",as.numeric(Sys.time()),".txt")
for(i in 1:length(taxidlimit)){
system2(command = "taxonkit",args = c("list", "--ids", taxidlimit[i], "--indent", '""',"--data-dir",ncbiTaxDir)
,wait=T,stdout = taxids_fileA)
taxid.list[i]<-read.table(taxids_fileA)
}
write.table(unlist(taxid.list),taxids_fileA,row.names = F,quote = F,col.names = F,)
#change options to include taxidlimit
opts<-c(opts,"-taxidlist",taxids_fileA)
if(inverse) opts<-gsub("-taxidlimit","-negative_taxids",opts)
}
#run BLAST
error.log.file<-paste0("blast.error.temp.processx.file",as.numeric(Sys.time()),".txt")
h<-process$new(command = blast_exec, args=c("-query", infasta, "-db",refdb,opts, "-out", out),echo_cmd = T,
stderr = error.log.file)
Sys.sleep(time = 2)
#report PID
message(paste("PID:",h$get_pid()))
#check immediate exit status
exits<-h$get_exit_status()
if(1 %in% exits){
message("************
There was a problem with ", infasta[match(1,exits)], ", aborting blast
************")
print(grep("Error",readLines(error.log.file),value = T))
h$kill()
}
if(wait==T){
h$wait()
message(readLines(error.log.file))
message("exit_status=",exits)
file.remove(error.log.file)
if(!is.null(taxidlimit)) file.remove(taxids_fileA)
}
headers<-paste0(paste("'1i",paste(unlist(strsplit(opts[match("-outfmt",opts)+1]," "))[-1],collapse = "\t"),collapse = "\t"),"'")
system2("sed",c("-i", headers, out),wait = T)
t2<-Sys.time()
t3<-round(difftime(t2,t1,units = "mins"),digits = 2)
message(c("All blasts complete in ",t3," mins."))
return(h)
}
I think it is a little dangerous to have a strict top (e.g. 1) as default. Actually would prefer 100, so that this is effectively not used, unless specified by user.
If I am seeing it properly, it appears that all blacklisting is done on the btab object before binning, rather than for each binning round as before.
I see this as an issue, for example, in the following case:
Two species belong to GenusA
On site, we know we have SpeciesA
On site, we know we do not have SpeciesB, and so it is in our species.blacklist
In our DB, we do not have SpeciesA
In our DB, we have SpeciesB
Thus, in our alignment results, we only have alignments to SpeciesB.
bin3 would do:
metabin would do:
By removing all entries matching SpeciesB, metabin is mistakenly removing it from genus level binning.
Hi Nuno,
probably a blast issue(?) but I get this error when trying to make a blast database on my laptop. I have seen it before on my laptop, but if you know what the issue might be it would be great to solve it!
I just tried on emg1 and it is fine! (fresh metabinkit installs on both machines)
$metabinkit_blastgendb -f SANGER_PLUS_GENBANK_CUT.fasta -o SANGER_PLUS_GENBANK_CUT_BLASTDB
metabinkit 0.1.7
Building a new DB, current time: 06/12/2020 12:49:44
New DB name: /media/sf_Documents/WORK/CIBIO/AA_PROJECTS/IRANVERTS/SHEEP/SANGER_PLUS_GENBANK_CUT_BLASTDB
New DB title: SANGER_PLUS_GENBANK_CUT.fasta
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 1000000000B
No volumes were created.
Error: mdb_env_open: Invalid argument
Hi,
Everything seems to work and pass the checks, but the command reports an error.
$ metabinkit_blastgendb -f SANGER_PLUS_GENBANK_CUT.fasta -o /home/tutorial/testgenDB -c
metabinkit 0.1.7
Building a new DB, current time: 06/16/2020 20:13:25
New DB name: /home/tutorial/testgenDB
New DB title: SANGER_PLUS_GENBANK_CUT.fasta
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 23 sequences in 0.0220871 seconds.
Checking database...
Writing messages to <stdout> at verbosity (Summary)
ISAM testing is ENABLED.
Legacy testing is DISABLED.
TaxID testing is ENABLED.
By default, testing 200 randomly sampled OIDs.
Testing 1 volume(s).
Result=SUCCESS. No errors reported for 1 volume(s).
Testing 0 alias(es).
Result=SUCCESS. No errors reported for 0 alias(es).
metabinkit version: 0.1.7
Warning: [blastn] Examining 5 or more matches is recommended
INFO: blast finished.
ERROR: query newly created database with /tmp/tmp.Uh0BHtj3NF did not work as expected
qseqid evalue pident qcovs saccver staxids ssciname sseqid
WC_KR059210.1 4.41e-27 100.000 100 WC_KR059222.1 9922 Capra WC_KR059222.1
Hi Nuno,
I was just looking at mussels results and noticed that sometimes bacteria get weird designations
e.g. taxid 1884634 has the 7-level taxonomy of:
K | P | C | O | F | G | S
Bacteria | Actinobacteria | Actinobacteria | Candidatus Nanopelagicales | Candidatus Nanopelagicaceae | Candidatus Nanopelagicus | Candidatus Nanopelagicus limnes
Note the spaces at order, family, genus level (and two spaces at species level). This definitely annoys my other scripts. Honestly, I dont know how it will affect things here in the end, but I wanted to have it here as a note.
## Bastian: not sure if the following conditions are correct
btab.f<-btab[btab$F!="unknown" || btab$G!="unknown" || btab$S!="unknown",,drop=FALSE]
Hi Nuno, tried to run installer. Worked for a while then..
[INFO] Installing blast...done.
[INFO] Installing metabinkit...
'R/lca.R' -> '/home/tutorial/R/lca.R'
'R/metabinkit.R' -> '/home/tutorial/R/metabinkit.R'
'exe' -> '/home/tutorial/exe'
'exe/metabin' -> '/home/tutorial/exe/metabin'
'exe/taxonkit_children.sh' -> '/home/tutorial/exe/taxonkit_children.sh'
[INFO] Creating /home/tutorial/metabinkit_env.sh...
You may want to consider adding the following line to your .bash_profile file.
source /home/tutorial/metabinkit_env.sh
[INFO] Creating /home/tutorial/metabinkit_env.sh...done.
[INFO] Installing metabinkit...done.
./install.sh: line 247: install_r_packages: command not found
Describe the bug
The install.sh checks for the R version and fails with R version 4.1.2.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Install without errors.
Screenshots
NA
Desktop (please complete the following information):
Additional context
Add any other context about the problem here.
(the good news: metabin is installing and runnning well, and is much faster than bin.blast3!)
Example of issue:
Input: tests/test_files/2020_01.blast.filt.txt
metabin output: tests/test_files/2020_01.metabins.txt.tsv
bin.blast3 output: tests/test_files/2020_01.bins3.txt
This OTU (AllMergedReads.Uniq.1004;size=8) has only one hit in the blast.filt.txt file.
taxids | qseqid | evalue | staxid | pident | qcovs | min_pident | old_taxids | K | P | C | O | F | G | S | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1701145 | AllMergedReads.Uniq.1004;size=8 | 1.24E-45 | 1701145 | 99.057 | 100 | 98.06643 | 1701145 | Eukaryota | Chordata | Mammalia | Lagomorpha | Leporidae | Lepus | Lepus tibetanus |
bin.blast3 bins it as
Eukaryota;Chordata;Mammalia;Lagomorpha;Leporidae;Lepus;Lepus tibetanus
metabin bins it as
Eukaryota;Chordata;Mammalia;Lagomorpha;Leporidae;Lepus;NA
bin.blast3 command (top=default of 1 for all levels)
bin.blast3(filtered_blastfile = "2020_01.blast.filt.txt",
ncbiTaxDir = "/home/tutorial/TOOLS/DBS/ncbi_taxonomy/taxdump/May2020/",
out = "2020_01.bins3.txt",spident = 98,gpident = 95,fpident = 92,abspident = 80)
metabin command:
metabin -i 2020_01.blast.filt.txt -o 2020_01.metabins.txt -S 98 -G 95 -F 92 -A 80 --discard_sp TRUE
I started going through a few more differences, but so far all have the same likely reason. Note this also occurs when there are originally multiple hits, but only one hit survives -S filter:
qseqids that have different bin outcomes | possible reason | METABIN | BIN.BLAST3 |
---|---|---|---|
AllMergedReads.Uniq.1004;size=8 | one hit only | EukaryotaChordataMammaliaLagomorphaLeporidaeLepusNA | EukaryotaChordataMammaliaLagomorphaLeporidaeLepusLepus tibetanus |
AllMergedReads.Uniq.1009;size=8 | one hit only | EukaryotaChordataMammaliaLagomorphaLeporidaeLepusNA | EukaryotaChordataMammaliaLagomorphaLeporidaeLepusLepus tibetanus |
AllMergedReads.Uniq.1015;size=8 | only one hit remaining after S filter | EukaryotaChordataMammaliaArtiodactylaBovidaeGazellaNA | EukaryotaChordataMammaliaArtiodactylaBovidaeGazellaGazella bennettii |
AllMergedReads.Uniq.1032;size=8 | one hit only | EukaryotaChordataMammaliaLagomorphaLeporidaeLepusNA | EukaryotaChordataMammaliaLagomorphaLeporidaeLepusLepus tibetanus |
AllMergedReads.Uniq.1034;size=8 | only one hit remaining after S filter | EukaryotaChordataMammaliaArtiodactylaCamelidaeCamelusNA | EukaryotaChordataMammaliaArtiodactylaCamelidaeCamelusCamelus dromedarius |
AllMergedReads.Uniq.1041;size=7 | only one hit remaining after S filter | EukaryotaChordataMammaliaCarnivoraFelidaeNANA | EukaryotaChordataMammaliaCarnivoraFelidaePantheraPanthera tigris |
AllMergedReads.Uniq.1046;size=7 | one hit only | EukaryotaChordataMammaliaLagomorphaLeporidaeLepusNA | EukaryotaChordataMammaliaLagomorphaLeporidaeLepusLepus tibetanus |
Metabin info
metabinkit v0.0.1
[info] Starting Binning
[info] binning at species level
[info] excluding 32253 entries with pident below 98
[info] Not considering species with 'sp.', numbers or more than one space
[info] applying species top threshold of 1
[info] binned 0 sequences at species level
[info] binning at genus level
[info] excluding 8720 entries with pident below 95
[info] applying genus top threshold of 1
[info] binned 1529 sequences at genus level
[info] binning at family level
[info] excluding 1138 entries with pident below 92
[info] applying family top threshold of 1
[info] binned 826 sequences at family level
[info] binning at higher-than-family level
[info] excluding 36 entries with pident below 80
[info] applying htf top threshold of 1
[info] binned 473 sequences at higher than family level
[info] Total number of binned 2828 sequences
[info] not binned 0 sequences
[info] Complete. 134002 hits from 2828 queries processed in 1.54 mins.
[info] Note: If none of the hits for a BLAST query pass the binning thesholds, the results will be NA for all levels.
If the LCA for a query is above kingdom, e.g. cellular organisms or root, the results will be 'unknown' for all levels.
Queries that had no BLAST hits, or did not pass the filter.blast step will not appear in results.
[info] binned table written to 2020_01.metabins.txt.tsv
$total_hits
[1] 134002
$total_queries
[1] 2828
$binned.species.level
[1] 0
$binned.genus.level
[1] 1529
$binned.family.level
[1] 826
$binned.htf.level
[1] 473
$not.binned
[1] 0
[info] Binning complete in 1.57 min
R version 3.6.0 (2019-04-26)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.10
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.8.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.8.0
locale:
[1] LC_CTYPE=pt_PT.UTF-8 LC_NUMERIC=C
[3] LC_TIME=pt_PT.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=pt_PT.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=pt_PT.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=pt_PT.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.12.8 optparse_1.6.6
loaded via a namespace (and not attached):
[1] compiler_3.6.0 magrittr_1.5 tools_3.6.0 getopt_1.20.3 stringi_1.4.6
[6] stringr_1.4.0
WARNING! missing columns in input table with taxonomic information:K,P,C,O,F,G,S
Seems like error to user. Consider adding "Using taxid to retrieve taxonomy"
Note: If none of the hits for a BLAST query pass the binning thesholds, the results will be NA for all levels.
If the LCA for a query is above kingdom, e.g. cellular organisms or root, the results will be 'unknown' for all levels.
Queries that had no BLAST hits, or did not pass the filter.blast step will not appear in results.
Consider revising to
Note: If none of the alignments for a query passed the binning thesholds, the results will be NA for all levels.
If the lowest common ancestor for a query is above kingdom, e.g. cellular organisms or root, the results will be 'unknown' for all levels.
Queries that do not appear in the input table (for example, they had no significant alignments) will not appear in results.
I am not sure if the distinction is clear, but originally it was like this
This is because I used NA to be an indicator that the binning thresholds were not passed. I used unknown only for above family (htf), to indicate that while a common ancestor may have been found it could not be assigned at e.g. order level (=unknown). Comapring metabin vs bin3 we get different outputs in this regard. It appears that this may only be happening at genus level - metabin is giving it "unknown"...in some cases...
example:
qseqid BIN3 METABIN IDENTICAL
011afebc-d053-40ed-aa39-764d06884887_runid=407cb32920f83b2252d840c6a949244d8c2a3bb9_ss_sample_id=Mussels-ITD26-A-UNIO-RUN7 NANANANANANANA NANANANANANANA IDENTICAL
0122ab94-162e-4198-b044-782b51ae0acd_runid=407cb32920f83b2252d840c6a949244d8c2a3bb9_ss_sample_id=Mussels-ITD23-A-UNIO-RUN7 NANANANANANANA NANANANANAunknownNA NOT
023be541-e63a-44ca-ad19-375672394340_runid=407cb32920f83b2252d840c6a949244d8c2a3bb9_ss_sample_id=Mussels-ITD24-A-UNIO-RUN7 NANANANANANANA NANANANANANANA IDENTICAL
023c6ba8-769d-4ecb-a10e-6ff95081e199_runid=407cb32920f83b2252d840c6a949244d8c2a3bb9_ss_sample_id=Mussels-ITD26-A-UNIO-RUN7 NANANANANANANA unknownunknownunknownunknownNAunknownNA NOT
0dd84164-c359-4b25-9a3e-bfeb39ac293b_runid=407cb32920f83b2252d840c6a949244d8c2a3bb9_ss_sample_id=Mussels-ITD26-A-UNIO-RUN7 BacteriaunknownunknownunknownNANANA BacteriaunknownunknownunknownNANANA IDENTICAL
code used
#bin3
bin.blast3(filtered_blastfile = "2019_August_002.UNIO.lenFilt.trimmed.ids.SC4.pol.blast.filt.txt",
ncbiTaxDir = "//home/tutorial/TOOLS/metabinkit.install/db/",
out = "/home/tutorial/TOOLS/metabinkit/tests/test_files/2019_UNIO.bin3.txt",spident = 98,gpident = 95,fpident = 92,abspident = 80)
#metabin
metabin -i 2019_August_002.UNIO.lenFilt.trimmed.ids.SC4.pol.blast.filt.txt -o 2019_UNIO.metabin.txt -S 98 -G 95 -F 92 -A 80 --discard_sp TRUE
Often after binning mis-classified entries in genbank become evident. Would be nice to be able to exclude these. i.e. if the user keeps track of accession numbers found to be mis-classified, then they can permanently use that list for their binning, and their list can grow. This would require sseqid to be a required column.
when applying top of 0.1 the message is
[info] applying top threshold of 0
Is it just an issue with the message or is the threshold being rounded down?
Line 69 in exe/metabinkit_blast
presume typo:
metabibkit_blast
should be
metabinkit_blast
(didnt make any changes)
With metabinkit installed and sourced, I try:
$ metabinkit_blast
bash: /home/tutorial/TOOLS/metabinkit.install/exe/metabinkit_blast: /bin/env: bad interpreter: No such file or directory
similar for
$ metabinkit_blastgendb -h
bash: /home/tutorial/TOOLS/metabinkit.install/exe/metabinkit_blastgendb: /bin/env: bad interpreter: No such file or directory
"tabbing" metabin I get
$ metabin
metabin metabinkit_blast metabinkit_blastgendb
contents of install folder/exe are:
tutorial@tutorial-VirtualBox:~/TOOLS/metabinkit.install/exe$ ls
metabin metabinkit_blast metabinkit_blastgendb metabinkit_shared.sh taxonkit_children.sh
See the README.md for example. Here is output again:
$ head -4 out0.bins.tsv
qseqid pident min_pident K P C O F G S
6fcff7c8-2031-4e3a-a8f0-72dc2da71c79_runid=407cb32920f83b2252d840c6a949244d8c2a3bb9_ss_sample_id=Mussels-ITD11-A-UNIO-RUN7 97.015 -2.985 Eukaryota Mollusca Bivalvia Unionida Unionidae Sinanodonta Sinanodonta woodiana
d36ef3ba-f3d5-4952-b683-301f1a959cfa_runid=407cb32920f83b2252d840c6a949244d8c2a3bb9_ss_sample_id=Mussels-ITD11-A-UNIO-RUN7 100 0 Eukaryota Mollusca Bivalvia Unionida Unionidae Sinanodonta Sinanodonta woodiana
9ef96e73-a5b6-4c4f-bc59-2b8238281d77_runid=407cb32920f83b2252d840c6a949244d8c2a3bb9_ss_sample_id=Mussels-ITD24-A-UNIO-RUN7 97.059 -2.941 Eukaryota Mollusca Bivalvia Unionida Unionidae Sinanodonta Sinanodonta woodiana
Hi installed ok now, but got this:
tutorial@tutorial-VirtualBox:~$ metabin -h
metabinkit v0.0.1
Error in library(optparse) : there is no package called ‘optparse’
Execution halted
This was part of my filter.blast3 function. It is similar to the consider_sp. option. We often get species-level taxids in results, but the species equate to things like "uncultured nematode", "environmental eukaryote". These can cause bins to be pushed far back up the tree. option to remove these. The terms used here are just those I noted as I came across them.
#remove crappy hits
if(rm.unclassified==T){
#1. btab$S contains uncultured
message("Removing species containing the terms: uncultured, environmental,
unidentified,fungal, eukaryote or unclassified")
if(length(grep("uncultured",btab$S,ignore.case = T))>0) btab<-btab[-grep("uncultured",btab$S,ignore.case = T),]
if(length(grep("environmental",btab$S,ignore.case = T))>0) btab<-btab[-grep("environmental",btab$S,ignore.case = T),]
if(length(grep("unclassified",btab$S,ignore.case = T))>0) btab<-btab[-grep("unclassified",btab$S,ignore.case = T),]
if(length(grep("unidentified",btab$S,ignore.case = T))>0) btab<-btab[-grep("unidentified",btab$S,ignore.case = T),]
if(length(grep("fungal ",btab$S,ignore.case = T))>0) btab<-btab[-grep("fungal ",btab$S,ignore.case = T),]
if(length(grep("eukaryote",btab$S,ignore.case = T))>0) btab<-btab[-grep("eukaryote",btab$S,ignore.case = T),]
}
Sorry if I missed this, but all the taxid options in blastn are only available after 2.9.0, so perhaps this should be a requirement. Also, these options will only work with version 5 databases (which is the default for makeblastdb in 2.10.0, not sure about 2.9.0...
Current get children from taxids approach will not work if people are using their own taxonomies (by providing the KPCOFGS columns)
But, by the time metabin reaches the blacklisting step, it will always have the KPCOFGS columns.
Perhaps it would be better to use names rather than taxids for this blacklisting (and they are possibly easier for user to provide). Downside: ambiguities with user-defined taxon names vs NCBI taxon names. Very difficult. So perhaps we need both options...
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.