envmetagen / metabinkit Goto Github PK

View Code? Open in Web Editor NEW

2.0 3.0 1.0 6.53 MB

Set of programs to perform taxonomic binning.

License: GNU General Public License v3.0

Shell 42.36% R 56.91% Dockerfile 0.73%

taxonomy dna-sequences docker metabarcoding-data ncbi-taxid binning-thresholds binning

metabinkit's People

Contributors

Stargazers

Watchers

Forkers

peteradavey21

metabinkit's Issues

makeblastdb checks

Maybe going one step too far, but I like to run a test blast as part of the makeblastdb command. i.e. take the first sequence of the fasta and blast it against the newly created db to ensure it is created correctly. Also, to run the blastdbcheck command to check the blastdb and to further ensure all seqs have taxids.

from my own makeblastdb function:

if(do.checks){
  message("Running test blast")
  phylotools::dat2fasta(head(tempfasta,n=1),gsub(".fasta",".blastdbformatted.test.fasta",infasta))
  h<-blast.min.bas(gsub(".fasta",".blastdbformatted.test.fasta",infasta),refdb = gsub(".fasta","",infasta))
  message("Does new db have taxids - column V3?")
  print(data.table::fread(gsub(".fasta",".blastdbformatted.test.blast.txt",infasta)))
  system2(command = "blastdbcheck",args = c("-must_have_taxids","-db",gsub(".fasta","",infasta)))
  }

provide error when taxids not found

When a taxonomy dump is used that is older than the BLAST performed (or whatever was used to get the taxids), then there can often be taxids not found, leading to NAs

When this happens, I think the program should STOP (or at least provide obvious warning) and report an error like:

"Some taxids were not found in the taxonomy database, consider updating NCBI taxonomy database by running ./install -i your_metabinkit_install_directory -x taxonomy_db"

example in R

a<-data.table::fread("2019_August_002.UNIO.lenFilt.trimmed.ids.SC4.pol.blast.filt.txt",data.table = F)
b<-add.lineage.df(a,ncbiTaxDir = "/home/tutorial/TOOLS/DBS/ncbi_taxonomy/taxdump/") #an old taxonomy folder

#some stderr output
11:39:50.515 [WARN] taxid 1823760 was deleted
11:39:50.540 [WARN] taxid 1936990 was deleted
11:39:50.591 [WARN] taxid 2563896 was deleted
11:39:50.641 [WARN] taxid 2714934 not found
11:39:50.642 [WARN] taxid 2715212 not found
11:39:50.642 [WARN] taxid 2715678 not found
11:39:50.643 [WARN] taxid 2715735 not found

Warning messages:
1: In `[<-.factor`(`*tmp*`, thisvar, value = "unknown") :
  invalid factor level, NA generated
2: In `[<-.factor`(`*tmp*`, thisvar, value = "unknown") :
  invalid factor level, NA generated

In metabin

metabin -i 2019_August_002.UNIO.lenFilt.trimmed.ids.SC4.pol.blast.filt.nopaths.csv -o 2019_UNIO.metabins.new.nopath.txt -S 98 -G 95 -F 92 -A 80 --discard_sp TRUE -D /home/tutorial/TOOLS/DBS/ncbi_taxonomy/taxdump/


#some output

11:55:47.603 [WARN] taxid 2721245 not found
11:55:47.603 [WARN] taxid 2721246 not found
11:55:47.603 [WARN] taxid 2722751 not found
11:55:47.604 [WARN] taxid 2724150 not found
11:55:47.604 [WARN] taxid 2724191 not found
11:55:47.604 [WARN] taxid 2724192 not found

#but program completes

Odd warning about taxid missing even though taxdb up to date

See the readme: Example 2. Custom settings. The error is in the output:

16:29:14.561 [WARN] taxid 2746931 not found

--SpeciesBL testspecies2exclude.txt consist of

$ cat testspecies2exclude.txt
6573

full command (using the files in metabinkit/tests/test_files)
$ metabin -i in1.blast.short.tsv -o out1.blast.short.bins -S 98 -G 94 -F 93 -A 88 --SpeciesBL testspecies2exclude.txt --FilterFile Mioduchowska2018_flaggedAccessions.txt --FilterCol saccver --TopSpecies 2 --TopGenus 2 --TopFamily 2 --TopAF 2 --sp_discard_sp --sp_discard_mt2w --sp_discard_num

test_files

Added test files to a new test_files folder. Includes one UNIO set (blast.filt+bins) and one VENE set. Also added disabled_taxa.txt (They are all tsv format)

Add option to remove in-silico generated NCBI accessions

Certain codes are used by NCBI to distinguish PREDICTED in-silico sequences. I wonder if we could add an option to remove these? The codes are XM_, XR_, XP_ so the option could be -rm_predicted [colname] something like

 if(length(grep("XM_.*",btab$colname))>0) btab<-btab[-grep("XM_.*",btab$colname),] 
  if(length(grep("XR_.*",btab$colname))>0) btab<-btab[-grep("XR_.*",btab$colname),] 
  if(length(grep("XP_.*",btab$colname))>0) btab<-btab[-grep("XP_.*",btab$colname),]

-taxidlist option

Perhaps a personal wish, but I often use the -taxidlist option to limit the blast to a certain taxonomical section of the database (and we used it for mussels). I see you have the negative taxids option, but not the "positive" option. Suggest adding this functionality.

example of previous blast function, in case it helps

> blast.min.bas2
function(infasta,refdb,blast_exec="blastn",wait=T,taxidlimit=NULL,inverse=F,ncbiTaxDir=NULL,overWrite=F,out=NULL,
                         opts=c("-task","blastn","-outfmt", "6 qseqid evalue pident qcovs saccver staxid ssciname sseq","-num_threads", 64,
                                "-max_target_seqs", 100, "-max_hsps",1,"-word_size", 11, "-perc_identity", 50,
                                "-qcov_hsp_perc", 98, "-gapopen", 0, "-gapextend", 2, "-reward", 1, "-penalty", -1)){
  
  t1<-Sys.time()
  
  require(processx)
  
  if(!is.null(taxidlimit)) if(is.null(ncbiTaxDir)) stop("to use taxidlimit, ncbiTaxDir must be supplied")
  if(is.null(out)) out<-paste0(gsub(".fasta", ".blast.txt",infasta))
  
  outdircheck<-
  
  if(overWrite==F) if(file.exists(out)) stop("The following file already exists ", out, "Use overWrite=T to overwrite")
  
  if(!is.null(taxidlimit)){
    
    h<-list()
    
    #generate children of taxids and store in file
    taxid.list<-list()
    
    taxids_fileA<-paste0("taxids",as.numeric(Sys.time()),".txt")
    
    for(i in 1:length(taxidlimit)){
      system2(command = "taxonkit",args = c("list", "--ids", taxidlimit[i], "--indent", '""',"--data-dir",ncbiTaxDir)
              ,wait=T,stdout = taxids_fileA)
      taxid.list[i]<-read.table(taxids_fileA)
    }
    
    write.table(unlist(taxid.list),taxids_fileA,row.names = F,quote = F,col.names = F,)
    
    #change options to include taxidlimit
    opts<-c(opts,"-taxidlist",taxids_fileA)
    
    if(inverse) opts<-gsub("-taxidlimit","-negative_taxids",opts)
  }
  
  #run BLAST  
  
  error.log.file<-paste0("blast.error.temp.processx.file",as.numeric(Sys.time()),".txt")
  
  h<-process$new(command = blast_exec, args=c("-query", infasta, "-db",refdb,opts, "-out", out),echo_cmd = T,
                 stderr = error.log.file)
  
  Sys.sleep(time = 2)
  
  #report PID
  message(paste("PID:",h$get_pid()))
  
  #check immediate exit status
  exits<-h$get_exit_status()
  
  if(1 %in% exits){
    message("************
             There was a problem with ", infasta[match(1,exits)], ", aborting blast
             ************")
    print(grep("Error",readLines(error.log.file),value = T))
    
    h$kill()
  }
  
  if(wait==T){
    h$wait()
    message(readLines(error.log.file))
    message("exit_status=",exits)
    file.remove(error.log.file)
    if(!is.null(taxidlimit)) file.remove(taxids_fileA)
  }
  
  headers<-paste0(paste("'1i",paste(unlist(strsplit(opts[match("-outfmt",opts)+1]," "))[-1],collapse = "\t"),collapse = "\t"),"'")
  
  system2("sed",c("-i", headers, out),wait = T)
  
  t2<-Sys.time()
  t3<-round(difftime(t2,t1,units = "mins"),digits = 2)
  
  message(c("All blasts complete in ",t3," mins."))
  
  return(h)
}

change default tops to be "off"

I think it is a little dangerous to have a strict top (e.g. 1) as default. Actually would prefer 100, so that this is effectively not used, unless specified by user.

-s is an available flag, but I think is not catered for in code

blacklisting should be done separately for each binning level

If I am seeing it properly, it appears that all blacklisting is done on the btab object before binning, rather than for each binning round as before.

I see this as an issue, for example, in the following case:

Two species belong to GenusA
On site, we know we have SpeciesA
On site, we know we do not have SpeciesB, and so it is in our species.blacklist
In our DB, we do not have SpeciesA
In our DB, we have SpeciesB

Thus, in our alignment results, we only have alignments to SpeciesB.

bin3 would do:

create a temporary copy of the table (btab), remove SpeciesB and attempt binning at species level
create a new temporary copy of the table (btab), do not remove GenusA and attempt binning at genus level
Result would be GenusA:NA

metabin would do:

remove SpeciesB from btab
attempt binning at species level
attempt binning at genus level
Result would be NA:NA

By removing all entries matching SpeciesB, metabin is mistakenly removing it from genus level binning.

metabinkit_blastgendb not creating db

Hi Nuno,

probably a blast issue(?) but I get this error when trying to make a blast database on my laptop. I have seen it before on my laptop, but if you know what the issue might be it would be great to solve it!

I just tried on emg1 and it is fine! (fresh metabinkit installs on both machines)

$metabinkit_blastgendb -f SANGER_PLUS_GENBANK_CUT.fasta -o SANGER_PLUS_GENBANK_CUT_BLASTDB
metabinkit 0.1.7


Building a new DB, current time: 06/12/2020 12:49:44
New DB name:   /media/sf_Documents/WORK/CIBIO/AA_PROJECTS/IRANVERTS/SHEEP/SANGER_PLUS_GENBANK_CUT_BLASTDB
New DB title:  SANGER_PLUS_GENBANK_CUT.fasta
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 1000000000B

No volumes were created.

Error: mdb_env_open: Invalid argument

ERROR checking new db using metabinkit_blastgendb

Hi,

Everything seems to work and pass the checks, but the command reports an error.

SANGER_PLUS_GENBANK_CUT.zip

$ metabinkit_blastgendb -f SANGER_PLUS_GENBANK_CUT.fasta -o /home/tutorial/testgenDB -c
metabinkit 0.1.7


Building a new DB, current time: 06/16/2020 20:13:25
New DB name:   /home/tutorial/testgenDB
New DB title:  SANGER_PLUS_GENBANK_CUT.fasta
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 23 sequences in 0.0220871 seconds.


Checking database...
Writing messages to <stdout> at verbosity (Summary)
ISAM testing is ENABLED.
Legacy testing is DISABLED.
TaxID testing is ENABLED.
By default, testing 200 randomly sampled OIDs.

Testing 1 volume(s).
 Result=SUCCESS. No errors reported for 1 volume(s).
Testing 0 alias(es).
 Result=SUCCESS. No errors reported for 0 alias(es).
metabinkit version: 0.1.7
Warning: [blastn] Examining 5 or more matches is recommended
INFO: blast finished.
ERROR: query newly created database with /tmp/tmp.Uh0BHtj3NF did not work as expected
qseqid	evalue	pident	qcovs	saccver	staxids	ssciname	sseqid
WC_KR059210.1	4.41e-27	100.000	100	WC_KR059222.1	9922	Capra	WC_KR059222.1

NCBI: sometimes there are spaces in order, family, genus etc.

Hi Nuno,
I was just looking at mussels results and noticed that sometimes bacteria get weird designations

e.g. taxid 1884634 has the 7-level taxonomy of:

K | P | C | O | F | G | S
Bacteria | Actinobacteria | Actinobacteria | Candidatus Nanopelagicales | Candidatus Nanopelagicaceae | Candidatus Nanopelagicus | Candidatus Nanopelagicus limnes

Note the spaces at order, family, genus level (and two spaces at species level). This definitely annoys my other scripts. Honestly, I dont know how it will affect things here in the end, but I wanted to have it here as a note.

double check the family level unknown selection

## Bastian: not sure if the following conditions are correct
    btab.f<-btab[btab$F!="unknown" || btab$G!="unknown" || btab$S!="unknown",,drop=FALSE]

installing

Hi Nuno, tried to run installer. Worked for a while then..

[INFO] Installing blast...done.
[INFO] Installing metabinkit...
'R/lca.R' -> '/home/tutorial/R/lca.R'
'R/metabinkit.R' -> '/home/tutorial/R/metabinkit.R'
'exe' -> '/home/tutorial/exe'
'exe/metabin' -> '/home/tutorial/exe/metabin'
'exe/taxonkit_children.sh' -> '/home/tutorial/exe/taxonkit_children.sh'
[INFO] Creating /home/tutorial/metabinkit_env.sh...
You may want to consider adding the following line to your .bash_profile file.

source /home/tutorial/metabinkit_env.sh
[INFO] Creating /home/tutorial/metabinkit_env.sh...done.
[INFO] Installing metabinkit...done.
./install.sh: line 247: install_r_packages: command not found

Install error with R>=4

Describe the bug

The install.sh checks for the R version and fails with R version 4.1.2.

To Reproduce
Steps to reproduce the behavior:

Run 'install.sh' with an R with version greater or equal to 4
See error

Expected behavior

Install without errors.

Screenshots

Desktop (please complete the following information):

OS:Linux
Version:latest

Additional context
Add any other context about the problem here.

When only one hit exists OR only one hit passes -S filter, metabin is not assigning at species level

(the good news: metabin is installing and runnning well, and is much faster than bin.blast3!)

Example of issue:

Input: tests/test_files/2020_01.blast.filt.txt
metabin output: tests/test_files/2020_01.metabins.txt.tsv
bin.blast3 output: tests/test_files/2020_01.bins3.txt

This OTU (AllMergedReads.Uniq.1004;size=8) has only one hit in the blast.filt.txt file.

taxids	qseqid	evalue	staxid	pident	qcovs	min_pident	old_taxids	K	P	C	O	F	G	S
1701145	AllMergedReads.Uniq.1004;size=8	1.24E-45	1701145	99.057	100	98.06643	1701145	Eukaryota	Chordata	Mammalia	Lagomorpha	Leporidae	Lepus	Lepus tibetanus

bin.blast3 bins it as

Eukaryota;Chordata;Mammalia;Lagomorpha;Leporidae;Lepus;Lepus tibetanus

metabin bins it as

Eukaryota;Chordata;Mammalia;Lagomorpha;Leporidae;Lepus;NA

bin.blast3 command (top=default of 1 for all levels)

bin.blast3(filtered_blastfile = "2020_01.blast.filt.txt",
           ncbiTaxDir = "/home/tutorial/TOOLS/DBS/ncbi_taxonomy/taxdump/May2020/",
           out = "2020_01.bins3.txt",spident = 98,gpident = 95,fpident = 92,abspident = 80)

metabin command:
metabin -i 2020_01.blast.filt.txt -o 2020_01.metabins.txt -S 98 -G 95 -F 92 -A 80 --discard_sp TRUE

I started going through a few more differences, but so far all have the same likely reason. Note this also occurs when there are originally multiple hits, but only one hit survives -S filter:

qseqids that have different bin outcomes	possible reason	METABIN	BIN.BLAST3
AllMergedReads.Uniq.1004;size=8	one hit only	EukaryotaChordataMammaliaLagomorphaLeporidaeLepusNA	EukaryotaChordataMammaliaLagomorphaLeporidaeLepusLepus tibetanus
AllMergedReads.Uniq.1009;size=8	one hit only	EukaryotaChordataMammaliaLagomorphaLeporidaeLepusNA	EukaryotaChordataMammaliaLagomorphaLeporidaeLepusLepus tibetanus
AllMergedReads.Uniq.1015;size=8	only one hit remaining after S filter	EukaryotaChordataMammaliaArtiodactylaBovidaeGazellaNA	EukaryotaChordataMammaliaArtiodactylaBovidaeGazellaGazella bennettii
AllMergedReads.Uniq.1032;size=8	one hit only	EukaryotaChordataMammaliaLagomorphaLeporidaeLepusNA	EukaryotaChordataMammaliaLagomorphaLeporidaeLepusLepus tibetanus
AllMergedReads.Uniq.1034;size=8	only one hit remaining after S filter	EukaryotaChordataMammaliaArtiodactylaCamelidaeCamelusNA	EukaryotaChordataMammaliaArtiodactylaCamelidaeCamelusCamelus dromedarius
AllMergedReads.Uniq.1041;size=7	only one hit remaining after S filter	EukaryotaChordataMammaliaCarnivoraFelidaeNANA	EukaryotaChordataMammaliaCarnivoraFelidaePantheraPanthera tigris
AllMergedReads.Uniq.1046;size=7	one hit only	EukaryotaChordataMammaliaLagomorphaLeporidaeLepusNA	EukaryotaChordataMammaliaLagomorphaLeporidaeLepusLepus tibetanus

Metabin info

metabinkit v0.0.1
[info] Starting Binning
[info] binning at species level
[info] excluding 32253 entries with pident below 98
[info] Not considering species with 'sp.', numbers or more than one space
[info] applying species top threshold of 1
[info] binned 0 sequences at species level
[info] binning at genus level
[info] excluding 8720 entries with pident below 95
[info] applying genus top threshold of 1
[info] binned 1529 sequences at genus level
[info] binning at family level
[info] excluding 1138 entries with pident below 92
[info] applying family top threshold of 1
[info] binned 826 sequences at family level
[info] binning at higher-than-family level
[info] excluding 36 entries with pident below 80
[info] applying htf top threshold of 1
[info] binned 473 sequences at higher than family level
[info] Total number of binned 2828 sequences
[info] not binned 0 sequences
[info] Complete. 134002 hits from 2828 queries processed in 1.54 mins.
[info] Note: If none of the hits for a BLAST query pass the binning thesholds, the results will be NA for all levels.
                 If the LCA for a query is above kingdom, e.g. cellular organisms or root, the results will be 'unknown' for all levels.
                 Queries that had no BLAST hits, or did not pass the filter.blast step will not appear in results.  
[info] binned table written to 2020_01.metabins.txt.tsv
$total_hits
[1] 134002

$total_queries
[1] 2828

$binned.species.level
[1] 0

$binned.genus.level
[1] 1529

$binned.family.level
[1] 826

$binned.htf.level
[1] 473

$not.binned
[1] 0

[info] Binning complete in 1.57 min
R version 3.6.0 (2019-04-26)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.10

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.8.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.8.0

locale:
 [1] LC_CTYPE=pt_PT.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=pt_PT.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=pt_PT.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=pt_PT.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=pt_PT.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.12.8 optparse_1.6.6   

loaded via a namespace (and not attached):
[1] compiler_3.6.0 magrittr_1.5   tools_3.6.0    getopt_1.20.3  stringi_1.4.6 
[6] stringr_1.4.0

messages

WARNING! missing columns in input table with taxonomic information:K,P,C,O,F,G,S
Seems like error to user. Consider adding "Using taxid to retrieve taxonomy"

Note: If none of the hits for a BLAST query pass the binning thesholds, the results will be NA for all levels.
                 If the LCA for a query is above kingdom, e.g. cellular organisms or root, the results will be 'unknown' for all levels.
                 Queries that had no BLAST hits, or did not pass the filter.blast step will not appear in results.

Consider revising to

Note: If none of the alignments for a query passed the binning thesholds, the results will be NA for all levels. 
If the lowest common ancestor for a query is above kingdom, e.g. cellular organisms or root, the results will be 'unknown' for all levels. 
Queries that do not appear in the input table (for example, they had no significant alignments) will not appear in results.

unknown vs NA

I am not sure if the distinction is clear, but originally it was like this

No hits passing binning filters = "NA;NA;NA;NA;NA;NA;NA"
Hits passing filter, but common ancestor above kingdom = "unknown;unknown;unknown;unknown;NA;NA;NA" #i.e. "unknown" can never occur at family, genus or species level
Hits passing filter, but common ancestor is class level = "kingdom1;phylum1;class1;unknown;NA;NA;NA"

This is because I used NA to be an indicator that the binning thresholds were not passed. I used unknown only for above family (htf), to indicate that while a common ancestor may have been found it could not be assigned at e.g. order level (=unknown). Comapring metabin vs bin3 we get different outputs in this regard. It appears that this may only be happening at genus level - metabin is giving it "unknown"...in some cases...

example:

qseqid	BIN3	METABIN	IDENTICAL
011afebc-d053-40ed-aa39-764d06884887_runid=407cb32920f83b2252d840c6a949244d8c2a3bb9_ss_sample_id=Mussels-ITD26-A-UNIO-RUN7	NANANANANANANA	NANANANANANANA	IDENTICAL
0122ab94-162e-4198-b044-782b51ae0acd_runid=407cb32920f83b2252d840c6a949244d8c2a3bb9_ss_sample_id=Mussels-ITD23-A-UNIO-RUN7	NANANANANANANA	NANANANANAunknownNA	NOT
023be541-e63a-44ca-ad19-375672394340_runid=407cb32920f83b2252d840c6a949244d8c2a3bb9_ss_sample_id=Mussels-ITD24-A-UNIO-RUN7	NANANANANANANA	NANANANANANANA	IDENTICAL
023c6ba8-769d-4ecb-a10e-6ff95081e199_runid=407cb32920f83b2252d840c6a949244d8c2a3bb9_ss_sample_id=Mussels-ITD26-A-UNIO-RUN7	NANANANANANANA	unknownunknownunknownunknownNAunknownNA	NOT
0dd84164-c359-4b25-9a3e-bfeb39ac293b_runid=407cb32920f83b2252d840c6a949244d8c2a3bb9_ss_sample_id=Mussels-ITD26-A-UNIO-RUN7	BacteriaunknownunknownunknownNANANA	BacteriaunknownunknownunknownNANANA	IDENTICAL

code used

#bin3
bin.blast3(filtered_blastfile = "2019_August_002.UNIO.lenFilt.trimmed.ids.SC4.pol.blast.filt.txt",
           ncbiTaxDir = "//home/tutorial/TOOLS/metabinkit.install/db/",
             out = "/home/tutorial/TOOLS/metabinkit/tests/test_files/2019_UNIO.bin3.txt",spident = 98,gpident = 95,fpident = 92,abspident = 80)

#metabin
metabin -i 2019_August_002.UNIO.lenFilt.trimmed.ids.SC4.pol.blast.filt.txt -o 2019_UNIO.metabin.txt -S 98 -G 95 -F 92 -A 80 --discard_sp TRUE

excluding specific hits

Often after binning mis-classified entries in genbank become evident. Would be nice to be able to exclude these. i.e. if the user keeps track of accession numbers found to be mis-classified, then they can permanently use that list for their binning, and their list can grow. This would require sseqid to be a required column.

using top of 0.1

when applying top of 0.1 the message is

[info] applying top threshold of 0

Is it just an issue with the message or is the threshold being rounded down?

Line 69 in exe/metabinkit_blast

presume typo:

metabibkit_blast

should be

metabinkit_blast

(didnt make any changes)

install problem? metabinkit_blast and metabinkit_blastgendb

With metabinkit installed and sourced, I try:

$ metabinkit_blast 
bash: /home/tutorial/TOOLS/metabinkit.install/exe/metabinkit_blast: /bin/env: bad interpreter: No such file or directory

similar for

$ metabinkit_blastgendb -h
bash: /home/tutorial/TOOLS/metabinkit.install/exe/metabinkit_blastgendb: /bin/env: bad interpreter: No such file or directory

"tabbing" metabin I get

$ metabin
metabin                metabinkit_blast       metabinkit_blastgendb

contents of install folder/exe are:

tutorial@tutorial-VirtualBox:~/TOOLS/metabinkit.install/exe$ ls
metabin  metabinkit_blast  metabinkit_blastgendb  metabinkit_shared.sh  taxonkit_children.sh

min_pident seems wrong (negative numbers and 0s)

See the README.md for example. Here is output again:

$ head -4 out0.bins.tsv 
qseqid	pident	min_pident	K	P	C	O	F	G	S
6fcff7c8-2031-4e3a-a8f0-72dc2da71c79_runid=407cb32920f83b2252d840c6a949244d8c2a3bb9_ss_sample_id=Mussels-ITD11-A-UNIO-RUN7	97.015	-2.985	Eukaryota	Mollusca	Bivalvia	Unionida	Unionidae	Sinanodonta	Sinanodonta woodiana
d36ef3ba-f3d5-4952-b683-301f1a959cfa_runid=407cb32920f83b2252d840c6a949244d8c2a3bb9_ss_sample_id=Mussels-ITD11-A-UNIO-RUN7	100	0	Eukaryota	Mollusca	Bivalvia	Unionida	Unionidae	Sinanodonta	Sinanodonta woodiana
9ef96e73-a5b6-4c4f-bc59-2b8238281d77_runid=407cb32920f83b2252d840c6a949244d8c2a3bb9_ss_sample_id=Mussels-ITD24-A-UNIO-RUN7	97.059	-2.941	Eukaryota	Mollusca	Bivalvia	Unionida	Unionidae	Sinanodonta	Sinanodonta woodiana

new error

Hi installed ok now, but got this:

tutorial@tutorial-VirtualBox:~$ metabin -h
metabinkit v0.0.1
Error in library(optparse) : there is no package called ‘optparse’
Execution halted

README and documentation

Consider an additional "rm.unclassified" filter

This was part of my filter.blast3 function. It is similar to the consider_sp. option. We often get species-level taxids in results, but the species equate to things like "uncultured nematode", "environmental eukaryote". These can cause bins to be pushed far back up the tree. option to remove these. The terms used here are just those I noted as I came across them.

#remove crappy hits 
  if(rm.unclassified==T){
    #1. btab$S contains uncultured
    message("Removing species containing the terms: uncultured, environmental, 
            unidentified,fungal, eukaryote or unclassified")
    if(length(grep("uncultured",btab$S,ignore.case = T))>0) btab<-btab[-grep("uncultured",btab$S,ignore.case = T),]
    if(length(grep("environmental",btab$S,ignore.case = T))>0) btab<-btab[-grep("environmental",btab$S,ignore.case = T),]
    if(length(grep("unclassified",btab$S,ignore.case = T))>0) btab<-btab[-grep("unclassified",btab$S,ignore.case = T),]
    if(length(grep("unidentified",btab$S,ignore.case = T))>0) btab<-btab[-grep("unidentified",btab$S,ignore.case = T),]
    if(length(grep("fungal ",btab$S,ignore.case = T))>0) btab<-btab[-grep("fungal ",btab$S,ignore.case = T),]
    if(length(grep("eukaryote",btab$S,ignore.case = T))>0) btab<-btab[-grep("eukaryote",btab$S,ignore.case = T),]
  }

ensure blast 2.9.0 or above is being used

Sorry if I missed this, but all the taxid options in blastn are only available after 2.9.0, so perhaps this should be a requirement. Also, these options will only work with version 5 databases (which is the default for makeblastdb in 2.10.0, not sure about 2.9.0...

blacklisting if NCBI taxids not being used

Current get children from taxids approach will not work if people are using their own taxonomies (by providing the KPCOFGS columns)

But, by the time metabin reaches the blacklisting step, it will always have the KPCOFGS columns.

Perhaps it would be better to use names rather than taxids for this blacklisting (and they are possibly easier for user to provide). Downside: ambiguities with user-defined taxon names vs NCBI taxon names. Very difficult. So perhaps we need both options...