soedinglab / mmseqs2 Goto Github PK

MMseqs2: ultra fast and sensitive search and clustering suite

License: GNU General Public License v3.0

CMake 0.60% Shell 1.62% C++ 26.22% C 69.57% Perl 0.01% Batchfile 0.04% Dockerfile 0.02% R 0.01% Makefile 0.57% Python 0.40% Meson 0.17% Lua 0.01% Starlark 0.07% HTML 0.54% Roff 0.14%

bioinformatics sequence-clustering profile-search sequence-search linclust mmseqs metagenomics alignment blast taxonomy

mmseqs2's Issues

Error: "invalid database read for database" for some, but not all, files.

I'm trying to create clusters of a number of pre-clustered databases, all stored in multi-fasta files. The createdb command seems to runs fine on all files, however, the cluster command leads to an error on some, but not all, files:

Invalid database read for database data file=tmp_clusters/DB, database index=tmp_clusters/DB.index
getData: local id (27) >= db size (27)
Rescore diagonals.
Could not open data file tmp_clusters/tmp/linclust/pref!
Could not open data file tmp_clusters/tmp/linclust/pref_rescore1!
awk: can't open file tmp_clusters/tmp/linclust/pre_clust.index
 source line number 1

This error is followed by many other errors, all related to being unable to open data files (e.g. Could not open data file tmp_clusters/tmp/linclust/pref!).

The error seems to depend on the sequences in the file. For example, if I merge two of the fasta-files, one of which runs without errors and one which leads to above error, the error is reproduced. I have not been able to identify which feature in the sequences leads to the error. Before every run of mmseqs, I delete all temp files.

Regarding my environment:

OSX 10.11.3
Git commit - 3ff5b8e
Self compiled

I've attached two files, one that leads to errors in my build, and one that does not. Hope you can help.

DB_errors.txt
DB_no_errors.txt

Get cluster sequence identities

Hi,
thank you very much for mmseqs2, it's really amazing!

We are clustering a few million of sequences following a similar approach than the one described in your uniclust pipeline. Before reinventing the wheel, we are wondering if mmseqs has an easy workaround to extract the sequence identities for each pair of sequences in a cluster from the alignment file. Our aim would be to be able to integrate the identities between pairs to the output of createtsv. Do you have any recommendations?

Many thanks
Antonio

Floating point exception running mmseqs search with index

Expected Behavior

Obtaining similar sequences to the queries from the target database

Current Behavior

Floating point exception at the prefilter step

Steps to Reproduce (for bugs)

Please make sure to execute the reproduction steps with newly recreated and empty tmp folders.
The target database is current nr (protein) from the NCBI (~120M sequences, 69GB).
Index creation runs ok:

mmseqs createindex nr

Program call:
nr 

MMseqs Version:     	a81227565da4e95d233e3bcbd5c0cdc6ada1c14a
Sub Matrix          	blosum62.out
K-mer size          	0
Alphabet size       	21
Max. sequence length	32000
Mask Residues       	1
Split DB            	0
Spaced Kmer         	1
Threads             	64
Verbosity           	3
...
Write MMSEQSFFINDEX 
Time for merging files: 0 h 0 m 0 s
Done.

ls -lrt

-rw-r--r--. 1 root root   2773738984 may 11 14:05 nr.lookup
-rw-r--r--. 1 root root  28462541941 may 11 14:07 nr_h
-rw-r--r--. 1 root root   2967783911 may 11 14:07 nr_h.index
-rw-r--r--. 1 root root  44976760168 may 11 14:10 nr
-rw-r--r--. 1 root root   3020702058 may 11 14:10 nr.index
drwxr-xr-x. 2 root root            6 may 12 12:52 tmp
-rw-r--r--. 1 root root   3020702058 may 12 13:23 nr.sk7.mmseqsindex
-rw-r--r--. 1 root root 330684926197 may 12 13:23 nr.sk7
-rw-r--r--. 1 root root          344 may 12 13:23 nr.sk7.index

When launching the search:

mmseqs search mmseq-testDB /junk/databases/mmseqs/nr test-2-mmseqsDB tmp

Program call:
mmseq-testDB /junk/databases/mmseqs/nr test-2-mmseqsDB tmp 

MMseqs Version:                    	a81227565da4e95d233e3bcbd5c0cdc6ada1c14a
Sub Matrix                         	blosum62.out
Add backtrace                      	false
Alignment mode                     	0
E-value threshold                  	0.001
Seq. Id Threshold                  	0
Coverage threshold                 	0
Target Coverage threshold          	0
Max. sequence length               	32000
Max. results per query             	300
Compositional bias                 	1
Query queryProfile                 	false
Realign hit                        	false
Max Reject                         	2147483647
Max Accept                         	2147483647
Include identical Seq. Id.         	false
Threads                            	64
Verbosity                          	3
Sensitivity                        	4
K-mer size                         	0
K-score                            	2147483647
Alphabet size                      	21
Target queryProfile                	false
Offset result                      	0
Split DB                           	0
Split mode                         	2
Diagonal Scoring                   	1
Mask Residues                      	1
Minimum Diagonal score             	15
Spaced Kmer                        	1
Profile e-value threshold          	0.001
Use global sequence weighting      	false
Maximum sequence identity threshold	0.9
Minimum seq. id.                   	0
Minimum score per column           	-20
Minimum coverage                   	0
Select n most diverse seqs         	1000
Pseudo count a                     	1
Pseudo count b                     	1.5
Number search iterations           	1
Start sensitivity                  	4
sensitivity step size              	1
Sets the MPI runner                	
Remove Temporary Files             	false

/root/tmp/blast
/root/tmp/blast
Program call:
mmseq-testDB /junk/databases/mmseqs/nr /root/tmp/blast/tmp/pref_4 --sub-mat blosum62.out -k 0 --k-score 2147483647 --alph-size 21 --max-seq-len 32000 --max-seqs 300 --offset-result 0 --split 0 --split-mode 2 -c 0 --comp-bias-corr 1 --diag-score 1 --mask 1 --min-ungapped-score 15 --spaced-kmer-mode 1 --threads 64 -v 3 -s 4 

MMseqs Version:           	a81227565da4e95d233e3bcbd5c0cdc6ada1c14a
Sub Matrix                	blosum62.out
Sensitivity               	4
K-mer size                	0
K-score                   	2147483647
Alphabet size             	21
Max. sequence length      	32000
Query queryProfile        	false
Target queryProfile       	false
Max. results per query    	300
Offset result             	0
Split DB                  	0
Split mode                	2
Coverage threshold        	0
Compositional bias        	1
Diagonal Scoring          	1
Mask Residues             	1
Minimum Diagonal score    	15
Include identical Seq. Id.	false
Spaced Kmer               	1
Threads                   	64
Verbosity                 	3

Initialising data structures...
Using 64 threads.

Use index  /junk/databases/mmseqs/nr.sk7
Index version: 774909490
KmerSize:     7
AlphabetSize: 21
Skip:         0
Split:        0
Type:         1
Spaced:       1
Query database: mmseq-testDB(size=2467)
Target database: /junk/databases/mmseqs/nr(size=121935717)
Use kmer size 7 and split 0 using split mode 0
tmp/blastp.sh: línea 60: 68389 Excepción de coma flotante   $RUNNER $MMSEQS prefilter "$INPUT" "$TARGET_DB_PREF" "$TMP_PATH/pref_$SENS" $PREFILTER_PAR -s $SENS
Program call:
mmseq-testDB /junk/databases/mmseqs/nr /root/tmp/blast/tmp/pref_4 /root/tmp/blast/tmp/aln_4 --sub-mat blosum62.out --alignment-mode 0 -e 0.001 --min-seq-id 0 -c 0 --target-cov 0 --max-seq-len 32000 --max-seqs 300 --comp-bias-corr 1 --max-rejected 2147483647 --max-accept 2147483647 --threads 64 -v 3 

MMseqs Version:           	a81227565da4e95d233e3bcbd5c0cdc6ada1c14a
Sub Matrix                	blosum62.out
Add backtrace             	false
Alignment mode            	0
E-value threshold         	0.001
Seq. Id Threshold         	0
Coverage threshold        	0
Target Coverage threshold 	0
Max. sequence length      	32000
Max. results per query    	300
Compositional bias        	1
Query queryProfile        	false
Realign hit               	false
Max Reject                	2147483647
Max Accept                	2147483647
Include identical Seq. Id.	false
Threads                   	64
Verbosity                 	3

Init data structures...
Compute score only.
Using 64 threads.
Could not open data file /root/tmp/blast/tmp/pref_4!
mv: no se puede efectuar `stat' sobre «/root/tmp/blast/tmp/aln_4»: No existe el fichero o el directorio

The index creation and search is done in the same machine

MMseqs Output (for bugs)

Please make sure to also post the complete output of MMseqs. You can use gist.github.com for large output.

Context

Providing context helps us come up with a solution and improve our documentation for the future.

Your Environment

Include as many relevant details about the environment you experienced the bug in.

Git commit used (The string after "MMseqs Version:" when you execute MMseqs without any parameters):
MMseqs Version: a812275
Which MMseqs version was used (Statically-compiled, self-compiled, Homebrew, etc.):

self-compiled

For self-compiled and Homebrew: Compiler and Cmake versions used and their invocation:

cmake --version
cmake version 2.8.12.2
cmake -DHAVE_MPI=0 -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=. ..
c++ --version
c++ (GCC) 6.2.1 20160916 (Red Hat 6.2.1-3)

Server specifications (especially CPU support for AVX2/SSE and amount of system memory):
cat /proc/cpuinfo

processor	: 63
vendor_id	: GenuineIntel
cpu family	: 6
model		: 46
model name	: Intel(R) Xeon(R) CPU           X7560  @ 2.27GHz
stepping	: 6
microcode	: 0xb
cpu MHz		: 1064.000
cache size	: 24576 KB
physical id	: 3
siblings	: 16
core id		: 11
cpu cores	: 8
apicid		: 119
initial apicid	: 119
fpu		: yes
fpu_exception	: yes
cpuid level	: 11
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 x2apic popcnt lahf_lm ida epb dtherm tpr_shadow vnmi flexpriority ept vpid
bogomips	: 4527.83
clflush size	: 64
cache_alignment	: 64
address sizes	: 44 bits physical, 48 bits virtual
power management:

free

              total        used        free      shared  buff/cache   available
Mem:      528377212     3193792   142947764        9564   382235656   523907652
Swap:             0           0           0

Operating system and version:
cat /etc/redhat-release
CentOS Linux release 7.3.1611 (Core)

result2msa --compress - Segmentation fault

Dear MMseqs developers,
I have encountered a problem with converting clustering results into ca3m MSAs.
result2msa works fine but adding '--compress' option causes a segmentation fault.

Below is the GDB output:

Starting program: /usr/lusers/aivan/prog/MMseqs2/build/bin/mmseqs result2msa DB DB clu cluMsa --compress
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib64/libthread_db.so.1".
MPI Init...
Rank: 0 Size: 1
Program call:
DB DB clu cluMsa --compress 

MMseqs Version:                    	31e25cb081a874f225d443eec307a6254f06a291
Sub Matrix                         	blosum62.out
Query queryProfile                 	false
Profile e-value threshold          	0.001
Allow Deletion                     	false
Add internal id                    	false
Compositional bias                 	1
Filter MSA                         	0
Maximum sequence identity threshold	0.9
Minimum seq. id.                   	0
Minimum score per column           	-20
Minimum coverage                   	0
Select n most diverse seqs         	1000
Threads                            	28
Verbosity                          	3
Compress MSA                       	true
Summarize headers                  	false
Summary prefix                     	cl
Omit Consensus                     	false
Skip Query                         	false

Compute split from 0 to 1089144
Using the first target sequence as center sequence for making each alignment.
[New Thread 0x2aaae4488700 (LWP 9149)]
[New Thread 0x2aaae4689700 (LWP 9150)]
[New Thread 0x2aaae488a700 (LWP 9151)]
[New Thread 0x2aaae4a8b700 (LWP 9152)]
[New Thread 0x2aaae4c8c700 (LWP 9153)]
[New Thread 0x2aaae4e8d700 (LWP 9154)]
[New Thread 0x2aaae508e700 (LWP 9155)]
[New Thread 0x2aaae528f700 (LWP 9156)]
[New Thread 0x2aaae5490700 (LWP 9157)]
[New Thread 0x2aaae5691700 (LWP 9158)]
[New Thread 0x2aaae5892700 (LWP 9159)]
[New Thread 0x2aaae5a93700 (LWP 9160)]
[New Thread 0x2aaae5c94700 (LWP 9161)]
[New Thread 0x2aaae5e95700 (LWP 9162)]
[New Thread 0x2aaae6096700 (LWP 9163)]
[New Thread 0x2aaae6297700 (LWP 9164)]
[New Thread 0x2aaae6498700 (LWP 9165)]
[New Thread 0x2aaae6699700 (LWP 9166)]
[New Thread 0x2aaae689a700 (LWP 9167)]
[New Thread 0x2aaae6a9b700 (LWP 9168)]
[New Thread 0x2aaae6c9c700 (LWP 9169)]
[New Thread 0x2aaae6e9d700 (LWP 9170)]
[New Thread 0x2aaae709e700 (LWP 9171)]
[New Thread 0x2aaae729f700 (LWP 9172)]
[New Thread 0x2aaae74a0700 (LWP 9173)]
[New Thread 0x2aaae76a1700 (LWP 9174)]
[New Thread 0x2aaae78a2700 (LWP 9175)]
Start computing compressed multiple sequence alignments.

Program received signal SIGSEGV, Segmentation fault.
0x00000000004e0415 in result2outputmode () at /usr/lusers/aivan/prog/MMseqs2/src/util/result2profile.cpp:349
349	                        if (res.keep[i + 1]) {
(gdb) bt
#0  0x00000000004e0415 in result2outputmode () at /usr/lusers/aivan/prog/MMseqs2/src/util/result2profile.cpp:349
#1  0x00000000004dd47f in result2outputmode (par=..., outpath="cluMsa_ca3m_tmp_0", dbFrom=0, dbSize=1089144, mode=2, referenceDBr=0x8d6ca0) at /usr/lusers/aivan/prog/MMseqs2/src/util/result2profile.cpp:166
#2  0x00000000004ded26 in result2outputmode (par=..., mode=2, mpiRank=0, mpiNumProc=1) at /usr/lusers/aivan/prog/MMseqs2/src/util/result2profile.cpp:564
#3  0x00000000004df6f5 in result2msa (argc=5, argv=0x7fffffffd8d8, command=...) at /usr/lusers/aivan/prog/MMseqs2/src/util/result2profile.cpp:657
#4  0x00000000004d5184 in runCommand (p=..., argc=5, argv=0x7fffffffd8d8) at /usr/lusers/aivan/prog/MMseqs2/src/mmseqs.cpp:361
#5  0x00000000004d54b2 in main (argc=7, argv=0x7fffffffd8c8) at /usr/lusers/aivan/prog/MMseqs2/src/mmseqs.cpp:406

Thanks!

Issue when translating nucleotide sequence DB into protein sequence DB

Expected Behavior

Create a translated db

Current Behavior

Error in argument --translation-table

Steps to Reproduce (for bugs)

mmseqs translatenucs db_nuc.db db_prot.db --translation-table 7
OR
mmseqs translatenucs db_nuc.db db_prot.db --translation-table 8
OR
mmseqs translatenucs db_nuc.db db_prot.db --translation-table 0

MMseqs Output (for bugs)

mmseqs translatenucs:
Translate nucleotide sequence DB into protein sequence DB

Please cite:
Steinegger, M. & Soding, J. Sensitive protein sequence searching for the analysis of massive data sets. bioRxiv 079681 (2016)

© Milot Mirdita <[email protected]>

Usage: <i:sequenceDB> <o:sequenceDB> [options]

misc options         	default   	description [value range]
  --translation-table	0         	CANONICAL=0, VERT_MITOCHONDRIAL=1, YEAST_MITOCHONDRIAL=2,MOLD_MITOCHONDRIAL=3, INVERT_MITOCHONDRIAL=4, CILIATE=5, FLATWORM_MITOCHONDRIAL=6, EUPLOTID=7, PROKARYOTE=8, ALT_YEAST=9, ASCIDIAN_MITOCHONDRIAL=10, ALT_FLATWORM_MITOCHONDRIAL=11, BLEPHARISMA=12, CHLOROPHYCEAN_MITOCHONDRIAL=13, TREMATODE_MITOCHONDRIAL=14, SCENEDESMUS_MITOCHONDRIAL=15, THRAUSTOCHYTRIUM_MITOCHONDRIAL=16, PTEROBRANCHIA_MITOCHONDRIAL=17, GRACILIBACTERIA=18, PACHYSOLEN=19, KARYORELICT=20, CONDYLOSTOMA=21, MESODINIUM=22, PERTRICH=23, BLASTOCRITHIDIA=24 (Note gaps between tables)

common options       	default   	description [value range]
  -v                 	3         	verbosity level: 0=nothing, 1: +errors, 2: +warnings, 3: +info

Error in argument --translation-table

Context

db_nuc.db file contains nucleotide sequences to be translated.
other --translation-table values (1,2...) seem to work except the value 0, 7 and 8, which produce this error: Error in argument --translation-table

Your Environment

mmseqs version 52c131b
Fedora 25

tempfile issue

I've been trying to perform some basic tests with MMSeqs2 and have encountered an issue where I repeated get the following error message:

 Init data structures...
 Compute score only.
 Could not open data file [path_to_dir]/mmseqs_tmp/pref_4!

The "[path_to_dir]/mmseqs_tmp/" directory contains two temporary files (pref_4.index_tmp_0.0 and pref_4_tmp_0.0) along with a blastp.sh script.

I'm not using any advanced options for my search, and both input databases are (as far as I can see) formatted correctly. Maybe I'm overlooking something embarrassingly simple?

Thanks!

.index file misplaced upon createdb ?

Runnig ./mmseqs createdb mydb.fasta tmp/db creates a db.index file in the tmp/ folder, but adding the '--max-seq-len 14000' option results in 14000.index file in the current folder (not the tmp/ folder) and no tmp/db.index file. Is this the expected behavior? I'm using the latest mmseqs version.

best way to extract a subset of clusters from a larger clusterDB?

Hi there,
what would be the best way to create a cluster DB containing only a subset of cluster IDs from a larger cluster DB? Does mmseqs createsubdb can be reliably used with the a set of human readable cluster IDs?
thanks! CCing @KristinaKa

How does mmseqs evalue compare to blastp evalue?

Just a quick question about mmseqs evalues. In the current version of the preprint, there is a line about how the mmseqs evalue compares to blastp evalue...

E ≤ 0.1 corresponds to the same false discovery rate as E ≤ 0.01 in BLAST.

If you run mmseqs search and blastp on the same queries against the same database, is the mmseqs evalue always going to be equivalent to 10 times the blastp evalue?

ERROR: Set 0 has more elements than allocated (1)!

I was trying to use MMSEQS2 for clustering with:

mmseqs cluster DB clu tmp --cascaded -e 1e-5 --max-seqs 30000 --similarity-type 1

It was working fine (MMseqs version ad5b994) but when I try to execute it in another machine (both macOS El Capitan 10.11.6, this time MMseqs version e3ca470), I get several error messages as follows,

ERROR: Set 0 has more elements than allocated (1)!

and the clustering never finishes.

Any idea of what is this due to or how to solve it?

Thank you in advance

mmseqs_update: Problem with data file.

When I am trying to run mmseq_update on some data set below error was popping up.
ERROR:
mmseqs_update: Problem with data file. Is it empty or is another process readning it?: Invalid argument
commons/DBReader.cpp:49 ffindex_index_parse: /anno/narsapvi/fern_clustering/run/tmp/A.index: Invalid argument
What can be the possible solution for this..!!

Create profiles from multiline fasta MSA

Expected Behavior

When using a MSA DB from a fasta MSA with multi-lines msa2profile should create a profile.

Current Behavior

Now it fails with the error Member sequence 0 in entry 0 too long!

Steps to Reproduce (for bugs)

Fasta MSA with multi-line: test_msa_ml.fa.gz
Fasta MSA with single line: test_msa_sl.fa.gz

When creating the profile with the multi line MSA:

ffindex_build -s test_msa_ml_db test_msa_ml_db.index test_msa_ml.fa
mmseqs msa2profile test_msa_ml_db test_profile_ml_db --msa-type 2 --threads 1

It results in:

Program call:
test_msa_ml_db test_profile_ml_db --msa-type 2 --threads 1

MMseqs Version:                         f1307fed7fda354ba3a1d126960de17e95ef70e9
MSA type                                2
Sub Matrix                              blosum62.out
Pseudo count a                          1
Pseudo count b                          1.5
Compositional bias                      1
Use global sequence weighting           false
Filter MSA                              1
Minimum coverage                        0
Minimum seq. id.                        0
Minimum score per column                -20
Maximum sequence identity threshold     0.9
Select n most diverse seqs              1000
Threads                                 1
Verbosity                               3

Finding maximum sequence length and set size.
Compute profiles from MSAs.
Member sequence 0 in entry 0 too long!
Invalid msa 0! Skipping entry.
Time for merging files: 0 h 0 m 0 s
Time for merging files: 0 h 0 m 0 s
Time for merging files: 0 h 0 m 0 s
Time for merging files: 0 h 0 m 0 s
Time for processing: 0 h 0 m 0s

But when using the fasta MSA with single lines:

ffindex_build -s test_msa_sl_db test_msa_sl_db.index test_msa_sl.fa
mmseqs msa2profile test_msa_sl_db test_profile_sl_db --msa-type 2 --threads 1

It seems to work perfectly:

Program call:
test_msa_sl_db test_profile_sl_db --msa-type 2 --threads 1

MMseqs Version:                         f1307fed7fda354ba3a1d126960de17e95ef70e9
MSA type                                2
Sub Matrix                              blosum62.out
Pseudo count a                          1
Pseudo count b                          1.5
Compositional bias                      1
Use global sequence weighting           false
Filter MSA                              1
Minimum coverage                        0
Minimum seq. id.                        0
Minimum score per column                -20
Maximum sequence identity threshold     0.9
Select n most diverse seqs              1000
Threads                                 1
Verbosity                               3

Finding maximum sequence length and set size.
Compute profiles from MSAs.
Time for merging files: 0 h 0 m 0 s
Time for merging files: 0 h 0 m 0 s
Time for merging files: 0 h 0 m 0 s
Time for merging files: 0 h 0 m 0 s
Time for processing: 0 h 0 m 0s

Your Environment

Include as many relevant details about the environment you experienced the bug in.

Git commit used (The string after "MMseqs Version:" when you execute MMseqs without any parameters):
f1307fe
Which MMseqs version was used (Statically-compiled, self-compiled, Homebrew, etc.):
self-compiled
For self-compiled and Homebrew: Compiler and Cmake versions used and their invocation:
gcc version 5.4.0 and cmake version 3.9.4
Server specifications (especially CPU support for AVX2/SSE and amount of system memory):
CPU: Intel(R) Xeon(R) CPU E7-4820 v4 @ 2.00GHz
Memory: 2TB
Operating system and version:
Debian 3.16.43-2+deb8u5 (2017-09-19) x86_64 GNU/Linux

uniclust database link dead

Expected Behavior

Go to uniclust website (https://uniclust.mmseqs.com/) and click "Download" to download uniclust database from "http://wwwuser.gwdg.de/~compbiol/uniclust/current_release/"

Current Behavior

The website http://wwwuser.gwdg.de/~compbiol/uniclust/current_release/ complaint:
Access forbidden!
You don't have permission to access the requested directory. There is either no index document or the directory is read-protected.
If you think this is a server error, please contact the webmaster.
Error 403
wwwuser.gwdg.de

Steps to Reproduce (for bugs)

http://wwwuser.gwdg.de/~compbiol/uniclust/current_release/

Compiling on Mac Sierra gives error

I get the following when compiling on mac:

make VERBOSE=1
...
[ 69%] Linking CXX executable mmseqs
cd /Users/rfonseca/Programs/mmseqs2/build/src && /usr/local/Cellar/cmake/3.6.3/bin/cmake -E cmake_link_script CMakeFiles/mmseqs.dir/link.txt --verbose=1
/usr/local/bin/gcc-6    -DOPENMP=1 -fopenmp  -mavx2 -mfpmath=sse -Wa,-q -std=c++0x -m64 -pedantic -Wall -Wextra -Winline -Wdisabled-optimization -Wno-unused-parameter -O3 -DNDEBUG  -mavx2 -mfpmath=sse -Wa,-q -O3 -DNDEBUG -ffast-math -fno-exceptions -ftree-vectorize -fno-strict-aliasing -Wl,-search_paths_first -Wl,-headerpad_max_install_names  CMakeFiles/mmseqs.dir/mmseqs.cpp.o  -o mmseqs  util/libutil.a workflow/libworkflow.a /usr/lib/libz.dylib util/libutil.a prefiltering/libprefiltering.a alignment/libalignment.a clustering/libclustering.a commons/libcommons.a
Undefined symbols for architecture x86_64:
  "std::ctype<char>::_M_widen_init() const", referenced from:
      diffseqdbs(int, char const**, Command const&) in libutil.a(diffseqdbs.cpp.o)
      createsubdb(int, char const**, Command const&) in libutil.a(createsubdb.cpp.o)
      readAllKeysIntoMap(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) in libutil.a(swapresults.cpp.o)
      mergeClusteringResults(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::list<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >) in libutil.a(mergeclusters.cpp.o)
      readLength(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) in libutil.a(summarizetabs.cpp.o)
      getEntries(unsigned int, char*, unsigned long, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, unsigned int, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, unsigned int> > > const&) in libutil.a(summarizetabs.cpp.o)
      getEntries(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) in libutil.a(extractdomains.cpp.o)
      ...

Any idea?

"there must be an error" messages while clustering

While clustering a database of about 10 million proteins, we receive lots of "there must be an error" messages, e.g.:

there must be an error: 1835775 deleted from 4496505 that now is empty, but not assigned to a cluster
there must be an error: 6964402 deleted from 7957737 that now is empty, but not assigned to a cluster
there must be an error: 2422408 deleted from 2413081 that now is empty, but not assigned to a cluster
...

This is the script running the clustering:

MMSEQS=mmseqs-static_avx2
SEQFILE=eggnog4.proteins.core_periphery.fa.gz
pigz -dc $SEQFILE >$TMPDIR/db.fasta
$MMSEQS createdb $TMPDIR/db.fasta $TMPDIR/db
mkdir $TMPDIR/clutmp
$MMSEQS cluster $TMPDIR/db $TMPDIR/db_clu $TMPDIR/clutmp --cascaded -e 1 --threads 16

The effect is independent from the mmseqs2 binary (your static avx2 binary, self-compiled sse4.1 binary). We run Fedora 25 with 4.9.5 kernel. The machine has 40 cores and 1 TB RAM.

Thanks for your help,
Thomas Rattei

--target-cov failing

Based on the latest version from Master, it appears that the --target-cov option is not currently working with mmseqs cluster. Specifically, something like:

mmseqs cluster DB DB_clu tmp --min-seq-id 0.4 --target-cov 0.8 --cluster-mode 2

returns a number of errors. I think what's happening is that internally, the options -c 0 and --target-cov 0.8 are being based passed to other commands (e.g. linclust), and they're failing because they're not expecting both options at once. Note that I am not passing -c 0 to the main command, but internally it's generated and passed to other commands.

Out of disk space after alignment, can i pick up from here?

Hi, I've been aligning 11 Million seqs for the last couple of days in an ALL vs ALL fashion.
I've passed the pre-filter and pairwise alignment steps, but I ran out of space in the 2tb hdd while trying to merge the final indexes and alignments.
Is there any way of completing the final step (maybe to another free disk or deleting the the prefilter file), or is my only option to recalculate the ALL v ALL in smaller batches.
Thanks for the help! And for making an 11Mx11M alignment feasible without a big cluster.

Andres

Support more substitution matrices

Expected Behavior

It should be possible to change the substitution matrice in MMseqs2.

Current Behavior

Currently MMseqs2 only supports matrices that are defined in BlastScoreUtils.
We have to fit lambda, k, alpha, beta. Maybe ALP (https://www.ncbi.nlm.nih.gov/CBBresearch/Spouge/html_ncbi/html/bib/alp_lib_v1_93.html) can help.

Is it possible to perform the suite on DNA sequence?

Just as the title, I am trying to use the mmseqs cluster on DNA sequence clustering, but it did not work. So how can I apply the suite on DNA sequence clustering?

[feature request] mmseqs prefilter missing option

The following options (available in mmseqs search but not in mmseqs prefilter) would be extremely useful when running a prefilter analysis:

-e                 evalue cut-off
--max-accept       max number of hits to report (currently all matches are reported creating very big files)
--start-sens       start sensitivity
--sens-step-size   sensitivity step sizes

Reading compressed fasta files? (suggested feature)

Expected Behavior

I expect mmseqs to also read files compressed with bzip2

Current Behavior

I was pleasantly surprised to find that mmseqs reads files compressed with gzip.

Context

I accidentally used a gripped file and the dabtabase was created all right. I tried a zipped file and it did not work.

Your Environment

Git commit used: 4051933
Which MMseqs version was used: self-compiled

Creating fasta files from clustering results

Hi
I would like to know if there is a native way in MMseqs2 to create separated files from the clustering results. Now we are using createseqfiledb to have a fasta version of the clustering database and we extract the cluster sequence from there using awk like this:

mawk '/^>/{close("cluster_"i".fa"); x="cluster_"++i".fa"}{gsub("\x0", "", $0); print > x}' cluster_db_fa

I looked through the documentation and I wasn't able to find anything. Do you have a better solution?

Many thanks
Antonio

Better handling of wrong input

Expected Behavior

Shouldn't segfault.
Ideally it should present the user with an error and some guidance on what to do.

Current Behavior

Segfaults during prefilter stage due to:

Query database: /share/input.fasta(size=0)

Steps to Reproduce (for bugs)

Create a FASTA file containing 1 sequence.
Run mmseqs createdb input.fasta db
rm -rf tmp && mkdir tmp
Run mmseqs search input.fasta db test tmp

MMseqs Output (for bugs)

Program call:
input.fasta db test tmp 

MMseqs Version:                    	7947b0035eef9ba41b64b0c752b0432465aaeb7c
Sub Matrix                         	blosum62.out
Add backtrace                      	false
Alignment mode                     	0
E-value threshold                  	0.001
Seq. Id Threshold                  	0
Coverage threshold                 	0
Coverage Mode                      	0
Max. sequence length               	32000
Max. results per query             	300
Compositional bias                 	1
Query queryProfile                 	false
Realign hit                        	false
Max Reject                         	2147483647
Max Accept                         	2147483647
Include identical Seq. Id.         	false
No preload                         	false
Early exit                         	false
Threads                            	40
Verbosity                          	3
Sensitivity                        	5.7
K-mer size                         	0
K-score                            	2147483647
Alphabet size                      	21
Target queryProfile                	false
Offset result                      	0
Split DB                           	0
Split mode                         	2
Diagonal Scoring                   	1
Mask Residues                      	1
Minimum Diagonal score             	15
Spaced Kmer                        	1
Profile e-value threshold          	0.001
Use global sequence weighting      	false
Filter MSA                         	1
Maximum sequence identity threshold	0.9
Minimum seq. id.                   	0
Minimum score per column           	-20
Minimum coverage                   	0
Select n most diverse seqs         	1000
Pseudo count a                     	1
Pseudo count b                     	1.5
Omit Consensus                     	false
Number search iterations           	1
Start sensitivity                  	4
sensitivity step size              	1
Sets the MPI runner                	
Remove Temporary Files             	false

Program call:
/share/input.fasta db /share/tmp/pref_5 --sub-mat blosum62.out -k 0 --k-score 2147483647 --alph-size 21 --max-seq-len 32000 --max-seqs 300 --offset-result 0 --split 0 --split-mode 2 -c 0 --cov-mode 0 --comp-bias-corr 1 --diag-score 1 --mask 1 --min-ungapped-score 15 --spaced-kmer-mode 1 --threads 40 -v 3 -s 5 

MMseqs Version:           	7947b0035eef9ba41b64b0c752b0432465aaeb7c
Sub Matrix                	blosum62.out
Sensitivity               	5
K-mer size                	0
K-score                   	2147483647
Alphabet size             	21
Max. sequence length      	32000
Query queryProfile        	false
Target queryProfile       	false
Max. results per query    	300
Offset result             	0
Split DB                  	0
Split mode                	2
Coverage threshold        	0
Coverage Mode             	0
Compositional bias        	1
Diagonal Scoring          	1
Mask Residues             	1
Minimum Diagonal score    	15
Include identical Seq. Id.	false
Spaced Kmer               	1
No preload                	false
Early exit                	false
Threads                   	40
Verbosity                 	3

Initialising data structures...
Using 40 threads.
Could not find precomputed index. Compute index.
Use kmer size 6 and split 1 using Target split mode.
Needed memory (1374076390 byte) of total memory (270920568832 byte)
Target database: db(Size: 1)
Substitution matrices...
Time for init: 0 h 0 m 0s


Query database: /share/input.fasta(size=0)
Process prefiltering step 1 of 1

Index table: counting k-mers...

Index table: Masked residues: 188
Index table: fill...
Index table: removing duplicate entries...
Index table init done.

DB statistic
Entries:         16342
DB Size:         686227020 (byte)
Avg Kmer Size:   0.000190541
Top 10 Kmers
   	RHCCAA		1
	QCICAA		1
	WSQFAA		1
	WQPHAA		1
	HPKLAA		1
	GHLLAA		1
	WRPNAA		1
	PHCQAA		1
	HRCQAA		1
	FHNQAA		1
Min Kmer Size:   0
Empty list: 85749779

Time for index table init: 0 h 0 m 1s


k-mer similarity threshold: 95
k-mer match probability: 0

tmp/blastp.sh: line 77: 32467 Segmentation fault      $RUNNER $MMSEQS prefilter "$INPUT" "$TARGET_DB_PREF" "$TMP_PATH/pref_$SENS" $PREFILTER_PAR -s $SENS
Error: Prefilter died

Issue

This was due to lack of RTFM but in general a segfault is not a good way to say goodbye :)

The source of the problem may have been:

Query database: /share/input.fasta(size=0)
                                        ^

metaclust include hhblits compatible format of metaclust50/95

metaclust50 and metaclust95 was clustered by mmseqs2. It is only available as gzip fasta. I was wondering if it is possible to get hhsuite compatible format of metaclust so that I can use hhblits to search my query against it.

"Can not fit databased into 65536 byte"

I just compiled mmseqs2 from latest source in a AVX2 machine and got the following error:

**$ grep -c '>' test.fa**
50
$ mmseqs createdb test.fa test.db
Program call:
test.fa test.db

MMseqs Version:         7885b8e5b2d4cb016d1c84455ae7d35b728497bd
Max. sequence length    32000
Split Seq. by len       true
Use fasta header        false
Offset of numeric ids   0
Verbosity               3

Time for merging files: 0 h 0 m 0 s
Time for merging files: 0 h 0 m 0 s
**$ mmseqs createindex test.db**
Program call:
test.db

MMseqs Version:         7885b8e5b2d4cb016d1c84455ae7d35b728497bd
Sub Matrix              blosum62.out
K-mer size              0
Alphabet size           21
Max. sequence length    32000
Split DB                0
Spaced Kmer             1
Threads                 40
Verbosity               3

Substitution matrices...
Can not fit databased into 65536 byte. Please use a computer with more main memory.

same happened using the linux avx2 binaries... but not with the sse4.1 binaries

MPI not working

Hi, I'm trying to run the MPI version of MMseqs2, without luck...
I've compiled last git version 259cecb with openmpi:
ldd /ngs/software/mmseqs/mmseqs-MPI

linux-vdso.so.1 => (0x00007ffdeadb7000)
libmpi_cxx.so.1 => /usr/lib64/openmpi/lib/libmpi_cxx.so.1 (0x00007f439eaaa000)
libmpi.so.12 => /usr/lib64/openmpi/lib/libmpi.so.12 (0x00007f439e7c5000)
libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007f439e4ad000)
libm.so.6 => /lib64/libm.so.6 (0x00007f439e1ab000)
libgomp.so.1 => /lib64/libgomp.so.1 (0x00007f439df84000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f439dd6e000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f439db52000)
libc.so.6 => /lib64/libc.so.6 (0x00007f439d790000)
libopen-rte.so.12 => /usr/lib64/openmpi/lib/libopen-rte.so.12 (0x00007f439d513000)
libopen-pal.so.13 => /usr/lib64/openmpi/lib/libopen-pal.so.13 (0x00007f439d26f000)
libdl.so.2 => /lib64/libdl.so.2 (0x00007f439d06a000)
librt.so.1 => /lib64/librt.so.1 (0x00007f439ce62000)
libutil.so.1 => /lib64/libutil.so.1 (0x00007f439cc5f000)
libhwloc.so.5 => /lib64/libhwloc.so.5 (0x00007f439ca24000)
/lib64/ld-linux-x86-64.so.2 (0x00007f439ecc6000)
libnuma.so.1 => /lib64/libnuma.so.1 (0x00007f439c818000)
libltdl.so.7 => /lib64/libltdl.so.7 (0x00007f439c60d000)

When I launch:
RUNNER="/usr/lib64/openmpi/bin/mpirun -np 32 " mmseqs-MPI search mmseq-testDB /junk/databases/mmseqs/test-nr-50m test-nr50-th32-MPI tmp --threads 32

It just uses 4 CPU cores and doesn't get too far:
Program call:
mmseq-testDB /junk/databases/mmseqs/test-nr-50m test-nr50-th32-MPI tmp --threads 32


MMseqs Version:                    	259cecbe2dcb0826f139bab3787daa03e83717bc
Sub Matrix                         	blosum62.out
Add backtrace                      	false
Alignment mode                     	0
E-value threshold                  	0.001
Seq. Id Threshold                  	0
Coverage threshold                 	0
Target Coverage threshold          	0
Max. sequence length               	32000
Max. results per query             	300
Compositional bias                 	1
Query queryProfile                 	false
Realign hit                        	false
Max Reject                         	2147483647
Max Accept                         	2147483647
Include identical Seq. Id.         	false
No preload                         	false
Early exit                         	false
Threads                            	32
Verbosity                          	3
Sensitivity                        	5.7
K-mer size                         	0
K-score                            	2147483647
Alphabet size                      	21
Target queryProfile                	false
Offset result                      	0
Split DB                           	0
Split mode                         	2
Diagonal Scoring                   	1
Mask Residues                      	1
Minimum Diagonal score             	15
Spaced Kmer                        	1
Profile e-value threshold          	0.001
Use global sequence weighting      	false
Filter MSA                         	1
Maximum sequence identity threshold	0.9
Minimum seq. id.                   	0
Minimum score per column           	-20
Minimum coverage                   	0
Select n most diverse seqs         	1000
Pseudo count a                     	1
Pseudo count b                     	1.5
Number search iterations           	1
Start sensitivity                  	4
sensitivity step size              	1
Sets the MPI runner                	/usr/lib64/openmpi/bin/mpirun -np 32
Remove Temporary Files             	false

It doesn't crash, but do not get any further.
Using additional MPI options:
RUNNER="/usr/lib64/openmpi/bin/mpirun -np 32 --report-bindings --map-by core --bind-to core" /ngs/software/mmseqs/mmseqs-MPI search mmseq-testDB /junk/databases/mmseqs/test-nr-50m test-nr50-th32-MPI tmp --threads 32

Uses 32 CPU core, but get stuck at the same point.
Using same query and target databases with the regular parallel version, runs in under a minute.

Core dumping during "createindex"

Hello,
I'm eager to try mmseqs but have been unsuccessful at building the database for UniRef90, as mmseqs seg faults and core dumps during the "createindex" phase. I tried both the precompiled mmseqs and my own compiled version, without --split or --threads, as well as with various combinations of both --split and --threads. My machine has 128 GB RAM (and same size swap space) and 6TB free space on hard drive. The process dies before ever reaching more than 40% of RAM. The output is:

mmseqs createdb  uniref90/uniref90.fasta uniref90.mms

ls -ltr uniref90/
-rw-r--r-- 1 hingamp.p MIO 19965315337 sept. 28 22:51 uniref90.fasta
-rw-r--r-- 1 hingamp.p MIO  1244574423 sept. 29 00:07 uniref90.mms.lookup
-rw-r--r-- 1 hingamp.p MIO  4652608129 sept. 29 00:07 uniref90.mms_h
-rw-r--r-- 1 hingamp.p MIO  1025056829 sept. 29 00:09 uniref90.mms_h.index
-rw-r--r-- 1 hingamp.p MIO 15172645206 sept. 29 00:09 uniref90.mms
-rw-r--r-- 1 hingamp.p MIO  1063262010 sept. 29 00:11 uniref90.mms.index

mmseqs createindex uniref90/uniref90.mms --split 10 --threads 20
Program call:
uniref90/uniref90.mms --split 10 --threads 20 

MMseqs Version:         ab6d7b3105611a0860c801603997f1721785916a
Sub Matrix              blosum62.out
K-mer size              0
Alphabet size           21
max. sequence length    32000
Split DB                10
spaced Kmer             1
Threads                 20
Verbosity               3

Substitution matrices...
Index table: counting k-mers...
.WARNING: Sequence (dbKey=10870) contains only ATGC. It might be a nucleotide sequence.
..................................................................................................  1 Mio. sequences processed
................................................................................................... 2 Mio. sequences processed
................................................................................................... 3 Mio. sequences processed
...............................................................................
Index table: Masked residues: 26370434
Index table: fill...
................................................................................................... 1 Mio. sequences processed
................................................................................................... 2 Mio. sequences processed
................................................................................................... 3 Mio. sequences processed
...............................................................................
Index table: removing duplicate entries...
Index table init done.

Write 10
Write 20
Write 60
Write 70
Write 80
Write 30
Write 40
Index table: counting k-mers...
................................................................................................... 1 Mio. sequences processed
...........................................................................................WARNING: Sequence (dbKey=5712154) contains only ATGC. It might be a nucleotide sequence.
........    2 Mio. sequences processed
...........................................................WARNING: Sequence (dbKey=6387662) contains only ATGC. It might be a nucleotide sequence.
........................................    3 Mio. sequences processed
...................................................................
Index table: Masked residues: 47802947
Index table: fill...
Erreur de segmentation (core dumped)

Many thanks for any help or advice. I have watched the mmseqs demo on https://www.youtube.com/watch?v=LqiHyCLjPno and am looking forward to enjoying the huge simplification it promises (the last example in the demo is the 2bLCA we applied to metagenomics data and my workflow was much more complex and slow than with mmseqs!)...
Best,
Pascal

[feature request] reads from gzipped FASTA files support in mmseqs createdb

creating dbs directly from gzipped FASTA files would be very useful when dealing with huge datasets that are usually compressed.

segfault in prefilter

I'm running uniclust_workflow.sh script on a custom fasta database. The first prefiltering step causes a segmentation fault. GDB output is provided below.

(gdb) run prefilter /usr/lusers/aivan/work/isolates/tmp/2017_06/uniprot_db /usr/lusers/aivan/work/isolates/tmp/2017_06/uniprot_db /usr/lusers/aivan/work/isolates/tmp/2017_06/tmp/pref_step_FRAG --max-seqs 4000 --min-ungapped-score 100 --comp-bias-corr 0 -s 1
Starting program: /usr/lusers/aivan/prog/MMseqs2/build/bin/mmseqs prefilter /usr/lusers/aivan/work/isolates/tmp/2017_06/uniprot_db /usr/lusers/aivan/work/isolates/tmp/2017_06/uniprot_db /usr/lusers/aivan/work/isolates/tmp/2017_06/tmp/pref_step_FRAG --max-seqs 4000 --min-ungapped-score 100 --comp-bias-corr 0 -s 1
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib64/libthread_db.so.1".
Program call:
/usr/lusers/aivan/work/isolates/tmp/2017_06/uniprot_db /usr/lusers/aivan/work/isolates/tmp/2017_06/uniprot_db /usr/lusers/aivan/work/isolates/tmp/2017_06/tmp/pref_step_FRAG --max-seqs 4000 --min-ungapped-score 100 --comp-bias-corr 0 -s 1 

MMseqs Version:           	31e25cb081a874f225d443eec307a6254f06a291
Sub Matrix                	blosum62.out
Sensitivity               	1
K-mer size                	0
K-score                   	2147483647
Alphabet size             	21
Max. sequence length      	32000
Query queryProfile        	false
Target queryProfile       	false
Max. results per query    	4000
Offset result             	0
Split DB                  	0
Split mode                	2
Coverage threshold        	0
Compositional bias        	0
Diagonal Scoring          	1
Mask Residues             	1
Minimum Diagonal score    	100
Include identical Seq. Id.	false
Spaced Kmer               	1
No preload                	false
Early exit                	false
Threads                   	28
Verbosity                 	3

Initialising data structures...
Using 28 threads.
Could not find precomputed index. Compute index.
Use kmer size 6 and split 1 using Target split mode.
Needed memory (5939332903 byte) of total memory (269768237056 byte)
Target database: /usr/lusers/aivan/work/isolates/tmp/2017_06/uniprot_db(Size: 1282204)
Substitution matrices...
[New Thread 0x2aaac5d87700 (LWP 8622)]
[New Thread 0x2aaac5f88700 (LWP 8623)]
[New Thread 0x2aaac6189700 (LWP 8624)]
[New Thread 0x2aaac638a700 (LWP 8625)]
[New Thread 0x2aaac658b700 (LWP 8626)]
[New Thread 0x2aaac678c700 (LWP 8627)]
[New Thread 0x2aaac698d700 (LWP 8628)]
[New Thread 0x2aaac6b8e700 (LWP 8629)]
[New Thread 0x2aaac6d8f700 (LWP 8630)]
[New Thread 0x2aaac6f90700 (LWP 8631)]
[New Thread 0x2aaac7191700 (LWP 8632)]
[New Thread 0x2aaac7392700 (LWP 8633)]
[New Thread 0x2aaac7593700 (LWP 8634)]
[New Thread 0x2aaac7794700 (LWP 8635)]
[New Thread 0x2aaac7995700 (LWP 8636)]
[New Thread 0x2aaac7b96700 (LWP 8637)]
[New Thread 0x2aaac7d97700 (LWP 8638)]
[New Thread 0x2aaac7f98700 (LWP 8639)]
[New Thread 0x2aaac8199700 (LWP 8640)]
[New Thread 0x2aaac839a700 (LWP 8641)]
[New Thread 0x2aaac859b700 (LWP 8642)]
[New Thread 0x2aaac879c700 (LWP 8643)]
[New Thread 0x2aaac899d700 (LWP 8644)]
[New Thread 0x2aaac8b9e700 (LWP 8645)]
[New Thread 0x2aaac8d9f700 (LWP 8646)]
[New Thread 0x2aaac8fa0700 (LWP 8647)]
[New Thread 0x2aaac91a1700 (LWP 8648)]
Time for init: 0 h 0 m 2s


Query database: /usr/lusers/aivan/work/isolates/tmp/2017_06/uniprot_db(size=1282204)
Process prefiltering step 1 of 1

Index table: counting k-mers...
...................................................................................................	1 Mio. sequences processed
............................
Index table: Masked residues: 4639530
Index table: fill...
...................................................................................................	1 Mio. sequences processed
............................
Index table: removing duplicate entries...
Index table init done.

DB statistic
Entries:         75296699
DB Size:         1137909162 (byte)
Avg Kmer Size:   0.877931
Top 10 Kmers
   	YTGTPK		2496
	GPGGTT		2371
	HQSGQR		1210
	AGDYKP		1057
	PHFGRQ		943
	PHLGRQ		837
	DPVLEP		661
	PFADTR		653
	MVQFFP		588
	NGAAHP		585
Min Kmer Size:   0
Empty list: 66045602

Time for index table init: 0 h 0 m 29s


k-mer similarity threshold: 130
k-mer match probability: 0

Starting prefiltering scores calculation (step 1 of 1)
Query db start  1 to 1282204
Target db start  1 to 1282204
Wrong prefiltering result: Query: 1 -> 1361079913	r
Invalid database read for id=1361079913, database index=/usr/lusers/aivan/work/isolates/tmp/2017_06/uniprot_db.index
Wrong prefiltering result: Query: 71 -> 1361079913	r
Invalid database read for id=getDbKey: local id (1361079913) >= db size (1282204)
1361079913, database index=/usr/lusers/aivan/work/isolates/tmp/2017_06/uniprot_db.index
getDbKey: local id (1361079913) >= db size (1282204)
Wrong prefiltering result: Query: 111 -> 1361079913	r
Invalid database read for id=
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x2aaac5d87700 (LWP 8622)]
0x00002aaaabb70dcd in __run_exit_handlers () from /usr/lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.17-106.el7_2.8.x86_64 libgcc-4.8.5-4.el7.x86_64 libgomp-4.8.5-4.el7.x86_64 libstdc++-4.8.5-4.el7.x86_64 zlib-1.2.7-15.el7.x86_64
(gdb) bt
#0  0x00002aaaabb70dcd in __run_exit_handlers () from /usr/lib64/libc.so.6
#1  0x00002aaaabb70eb5 in exit () from /usr/lib64/libc.so.6
#2  0x00000000005de79f in DBReader<unsigned int>::getDbKey (this=0x8b8f10, id=1361079913) at /usr/lusers/aivan/prog/MMseqs2/src/commons/DBReader.cpp:344
#3  0x0000000000548023 in Prefiltering::writePrefilterOutput (this=0x7fffffffda10, qdbr=0x8b8f10, dbWriter=0x7fffffffcc80, thread_idx=1, id=0, prefResults=..., seqIdOffset=0, diagonalScoring=true, resultOffsetPos=0, maxResults=4000)
    at /usr/lusers/aivan/prog/MMseqs2/src/prefiltering/Prefiltering.cpp:758
#4  0x00000000005494d4 in Prefiltering::runSplit () at /usr/lusers/aivan/prog/MMseqs2/src/prefiltering/Prefiltering.cpp:643
#5  0x00002aaaab4fcde5 in ?? () from /usr/lib64/libgomp.so.1
#6  0x00002aaaab923dc5 in start_thread () from /usr/lib64/libpthread.so.0
#7  0x00002aaaabc2eced in clone () from /usr/lib64/libc.so.6

SegFault when trying to index NR

Hi there,
I have been trying to use mmseqs createindex to process the NR database, but it keeps SegFaulting on me. The job had 750G memory available, and as far as I can tell it doesn't seem to be overblowing that (the core dump is <400G). I ran createindex in the same directory as the database, using an empty tmp dir.

This is off of a fresh build from commit c4436fb, using cmake 2.8.12.2 and gcc 4.8.5 on Linux 2.6.32-642.11.1.el6.x86_64.

GDB has this to say:

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `mmseqs createindex nr nr tmp --mask 0 --threads 30'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00000000004fbb68 in Sequence::mapSequence(unsigned long, unsigned int, std::pair<unsigned char const*, unsigned int const>) ()
[Current thread is 1 (Thread 0x2b6aefeaf700 (LWP 63514))]
warning: File "/opt/sw/software/gcc/4.8.5/lib64/libstdc++.so.6.0.19-gdb.py" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load".
To enable execution of this file add
	add-auto-load-safe-path /opt/sw/software/gcc/4.8.5/lib64/libstdc++.so.6.0.19-gdb.py
line to your configuration file "/home/bondsr/.gdbinit".
To completely disable this security protection add
	set auto-load safe-path /
line to your configuration file "/home/bondsr/.gdbinit".
For more information about this security protection see the
"Auto-loading safe path" section in the GDB manual.  E.g., run from the shell:
	info "(gdb)Auto-loading safe path"
(gdb) bt
#0  0x00000000004fbb68 in Sequence::mapSequence(unsigned long, unsigned int, std::pair<unsigned char const*, unsigned int const>) ()
#1  0x000000000048f4a3 in Prefiltering::fillDatabase(DBReader<unsigned int>*, Sequence*, IndexTable*, BaseMatrix*, unsigned long, unsigned long, bool, bool, int, int) [clone ._omp_fn.4] ()
#2  0x00002b5f9ede530a in gomp_thread_start (xdata=<optimized out>) at ../../../libgomp/team.c:115
#3  0x000000346c207aa1 in start_thread () from /lib64/libpthread.so.0
#4  0x000000346bae8aad in clone () from /lib64/libc.so.6
(gdb)

And the output from MMseqs (I've truncated progress output):

Program call:
nr nr tmp --mask 0 --threads 30

MMseqs Version:     	c4436fbec572c7e9ce02ec36af238f8b7e7f700d
Sub Matrix          	blosum62.out
K-mer size          	0
Alphabet size       	21
Max. sequence length	32000
Mask Residues       	0
Split DB            	0
Spaced Kmer         	1
Threads             	30
Verbosity           	3

Substitution matrices...
Use kmer size 7 and split 1 using split mode
Index table: counting k-mers...
......................WARNING: Sequence (dbKey=0) contains only ATGC. It might be a nucleotide sequence.
.............................................................................	1 Mio. sequences processed
..........................................WARNING: Sequence (dbKey=0) contains only ATGC. It might be a nucleotide sequence.
....................................................WARNING: Sequence (dbKey=0) contains only ATGC. It might be a nucleotide sequence.
.....	2 Mio. sequences processed

...

...................................................................................................	107 Mio. sequences processed
...................................................................................................	108 Mio. sequences processed
...................................................................................................	109 Mio. sequences processed
.................Segmentation fault (core dumped)

If it's useful, this is the mmseqs createdb command I ran first:

Program call:
nr_1-31-17.fasta nr --use-fasta-header

MMseqs Version:      	c4436fbec572c7e9ce02ec36af238f8b7e7f700d
Max. sequence length 	32000
Split Seq. by len    	true
Use fasta header     	true
Offset of numeric ids	0
Verbosity            	3

Time for merging files: 0 h 1 m 16 s
Time for merging files: 0 h 1 m 18 s

Any thoughts?
-Steve

non standard fasta file created by "mmseqs result2flat"

Expected Behavior

a fasta file with all the singleton sequences produced by a clustering analysis

Current Behavior

a non-standard fasta file where header lines are duplicated

Steps to Reproduce (for bugs)

Please make sure to execute the reproduction steps with newly recreated and empty tmp folders.

$mmseqs cluster all_proteomes.db cluster.db tmp
$ mmseqs createseqfiledb all_proteomes.db cluster.db cluster.seq --max-sequences 1
$ mmseqs result2flat all_proteomes.db all_proteomes.db cluster.seq cluster.fasta
$ head cluster.fasta
>10090.00080
>10090.00080
MKSFLLFLTIILLVVIQIQTGSLGQATTAASGTNKNSTSTKKTPLKSGASSIIDAGACSFLFFANTLMCLFYLS
>10090.00273
>10090.00273
MARARQEGSSPEPVEGLARDSPRPFPLGRLMPSAVSCSLCEPGLPAAPAAPALLPAAYLCAPTAPPAVTAALGGPRWPGGHRSRPRGPRPDGPQPSLSPAQQHLESPVPSAPEALAGGPTQAAPGVRVEEEEWAREIGAQLRRMADDLNAQYERRRQEEQHRHRPSPWRVMYNLFMGLLPLPRDPGAPEMEPN

Context

Your Environment

Using latest github version of mmseqs compiled on a AVX system
Include as many relevant details about the environment you experienced the bug in.

Git commit used (The string after "MMseqs Version:" when you execute MMseqs without any parameters): 88d72a9
Which MMseqs version was used (Statically-compiled, self-compiled, Homebrew, etc.): self-compiled
For self-compiled and Homebrew: Compiler and Cmake versions used and their invocation:
Server specifications (especially CPU support for AVX2/SSE and amount of system memory): AVX
Operating system and version: Linux

Latest commits (Jan 17) broke mmseqs cluster

I just tried building the latest version and using mmseqs cluster, but it fails with this error:

Could not open data file tmp/clu_step0!
mv: cannot stat `tmp/clu': No such file or directory
mv: cannot stat `tmp/clu.index': No such file or directory
Could not move result to clu

Reverting to an older copy (latest before Jan 17) solves the problem.

Convert BLAST databases to mmseq2 databases

Hi,
We are thinking about setting up mmseq2 as alternative to blast. I can convert blast databases to fasta files and create the corresponding mmseq databases from it but I was wondering if someone already wrote a script which is doing this and is willing to share it here?

That would save me a lot of trouble and I believe would be helpful to others as well.

All the best

Christoph

createseqfiledb: Invalid entry in line x!

When I use createseqfiledb on a clusterDB file, as described in the readme, I get a continuous stream of "Invalid entry in line x!" messages, where x are different lines.

What does this error mean? The resulting clu_seq file is empty.

Setting threads, verbosity, etc on the command line

I'm having trouble setting some command line args with the cluster command. When I run this command

time ~/projects/MMseqs2/build/bin/mmseqs cluster DB clu . --threads 3 -v 2 --min-seq-id 0.25

I get the following output showing that the threads, verbosity, and seq id are the default values rather than those specified on the command line.

Set cluster settings automatic to s=6 cascaded=1
Program call:
DB ./aln_redundancy

MMseqs Version:     	fb43334f0594c4b6a345d7419965cb0beae48430
Sub Matrix          	blosum62.out
Alphabet size       	3
Seq. Id Threshold   	0
Max. sequence length	32000
Threads             	1
Verbosity           	3

The version of the program I'm using is fb43334f0594c4b6a345d7419965cb0beae48430 installed from source on Mac 10.10.5.

Clustering with different min_seq_id levels produce same exact clusters

I'm likely not understand the --min-seq-id parameter. I am clustering the same database with different min_seq_id levels. The database is comprised of the ORFs of ~800 Vibrio genomes, with 3.5M ORFs. I don't expect the clusters to be the same for 95% and 30% similiarity threshold, many of the genomes in this database have less than 80% nucleotide identity.

Steps to Reproduce

output from mmseqs here: https://gist.githubusercontent.com/elsherbini/381bad1c3e340489514502f4e0f62696/raw/00d47064944d11238edb2f49ffd0af0e94e737cb/mmseqs_output

example_fastadb here: https://gist.githubusercontent.com/elsherbini/381bad1c3e340489514502f4e0f62696/raw/00d47064944d11238edb2f49ffd0af0e94e737cb/example_fastadb

>mkdir -p ./tmp30
>rm -rf ./tmp30/*
>mkdir -p ./tmp95
>rm -rf ./tmp95/*
>mmseqs createdb fastaDB DB
>mmseqs cluster DB clu_30 ./tmp30 --min-seq-id 30 --max-seqs 100000 -c 0.8
>mmseqs cluster DB clu_95 ./tmp95 --min-seq-id 95 --max-seqs 100000 -c 0.8
>mmseqs createtsv DB DB clu_30 clusters_30.tsv
>mmseqs createtsv DB DB clu_95 clusters_95.tsv
>cmp --silent clusters_30.tsv clusters_95.tsv && echo 'Files Are Identical! || echo 'Files Are Different!'
Files Are Identical!

Your Environment

MMseqs Version:                  	b929632f1b419373143faa2a36179c6b964bdfe3
statically compiled
centos6

Stable versions

Thanks for the awesome open-source tool.

I can't find any tags except "vNatBiotech". I was wondering whether there is interest in making a release for a stable version (with a numerical version), which may help users and developers.

Thanks!

Best wishes,

Error: Could not allocate memory by memalign.

Expected Behavior

The example file https://raw.githubusercontent.com/soedinglab/mmseqs2/master/examples/DB.fasta should be clustered

Current Behavior

After the installation of MMseqs Version: 1c45613 it fails when prefiltering with:

Error: Could not allocate memory by memalign. Please report this bug to developers

Steps to Reproduce (for bugs)

mmseqs createdb DB.fasta DB
mkdir tmp
mmseqs cluster DB cluDB tmp

MMseqs Output (for bugs)

Line 391

Context

Providing context helps us come up with a solution and improve our documentation for the future.

Your Environment

Include as many relevant details about the environment you experienced the bug in.

Which MMseqs version was used (Statically-compiled, self-compiled, Homebrew, etc.): Homebrew
For self-compiled and Homebrew: Compiler and Cmake versions used and their invocation: cmake version 3.8.1 and g++ (Homebrew GCC 7.1.0 --without-multilib) 7.1.0
Server specifications (especially CPU support for AVX2/SSE and amount of system memory): MacBook Pro (Retina, 13-inch, Late 2013) with 2.6 GHz Intel Core i5 and 16 GB 1600 MHz DDR3
Operating system and version: Mac OS X 10.12.4

Could not allocate entries memory in IndexTable::initMemory

Hi,
I was trying to generate the index files of a huge database. To be precise the nr database from blast. It creates the database without problems but when I run the createindex module mmseqs2, it stops with the error.

Could not allocate entries memory in IndexTable::initMemory

I'm aware that this is not a bug but the computer I'm trying to run it on runs out of memory. It processes all sequences though and stops right at the end when the index files should have been written.

I still can search the resulting database, its just indexing the database on the fly. The computer has 64 Gb of memory, which is used in the process to 94% before swap is used.

Now my question.
Is there anything I can do, except of extending the hardware to make this work?
In future I would set the -s parameter to 8 as this gives about the same sensitivity as blast?
Would this further increase the memory usage?

Thank you in advance

Christoph

Expected Behavior

Current Behavior

Steps to Reproduce (for bugs)

Please make sure to execute the reproduction steps with newly recreated and empty tmp folders.

MMseqs Output (for bugs)

Could not allocate entries memory in IndexTable::initMemory

Context

createindex module fails

Your Environment

64 bit latest debian linux
24 cpus
64 gb of RAM
latest git hub version self compiled with cmake.

Error when using clusterupdate

I was trying to use clusterupdate to update a clustering (DB_trimmed_clu) build from DB_trimmed (a library of proteins) to DB_clusterupdate from a extended version of the library (DB_new) with 2 sequences overlap.

However, the program is not able to finish and I get the error:

mv: rename tmp/aln_* to tmp/search/aln_: No such file or directory
mv: rename tmp/clu_ to tmp/search/clu_: No such file or directory
mv: rename tmp/input_ to tmp/search/input_*: No such file or directory

Although the program is able to continue until the merging of the updated clusterings (see log below).

I also get a random number of warnings (depending on the execution) pointing out that I am using DNA, but I am not. For instance:

WARNING: Sequence (dbKey=17) contains only ATGC. It might be a nucleotide sequence.

I attach the log of the cluster update call:

mmseqs clusterupdate DB_trimmed DB_new DB_trimmed_clu DB_clusterupdate tmp &> update_log.txt

Program call:
DB_trimmed DB_new DB_trimmed_clu DB_clusterupdate tmp

MMseqs Version: 5ba68d8
Sub Matrix blosum62.out
Add backtrace false
Alignment mode 0
E-value threshold 0.001
Seq. Id Threshold 0
Coverage threshold 0
Target Coverage threshold 0
Max. sequence length 32000
Compositional bias 1
Profile false
Realign hit false
Max Reject 2147483647
Max Accept 2147483647
Include identical Seq. Id. false
Threads 4
Verbosity 3
Sensitivity 4
K-mer size 0
K-score 2147483647
Alphabet size 21
Offset result 0
Split DB 0
Split mode 2
Diagonal Scoring 1
Mask Residues 1
Minimum Diagonal score 15
Spaced Kmer 1
Profile e-value threshold 0.001
Use global sequence weighting false
Maximum sequence identity threshold 0.9
Minimum seq. id. 0
Minimum score per column -20
Minimum coverage 0
Select n most diverse seqs 100
Pseudo count a 1
Pseudo count b 1.5
Number search iterations 1
Start sensitivity 4
sensitivity step size 1
Sets the MPI runner
Cluster mode 0
Max depth connected component 1000
Similarity type 2
Cascaded clustering false
Cluster fragments false
Remove Temporary Files false
Match sequences by their ID false

Program call:
DB_trimmed DB_new tmp/removedSeqs tmp/mappingSeqs tmp/newSeqs --threads 4 -v 3

MMseqs Version: 5ba68d8
Match sequences by their ID false
Threads 4
Verbosity 3

===================================================
====== Filter out the new from old sequences ======

Program call:
tmp/newSeqs DB_new tmp/NEWDB.newSeqs

MMseqs Version: 5ba68d8
Verbosity 3

Start writing to file tmp/NEWDB.newSeqs
Time for merging files: 0 h 0 m 0 s

=== Update the old clustering with the new keys ===

Program call:
DB_trimmed_clu tmp/OLDCLUST.mapped --mapping-file tmp/mappingSeqs

MMseqs Version: 5ba68d8
Filter column 1
Filter regex ^.*$
Positive filter true
Filter file
Mapping file tmp/mappingSeqs
Threads 4
Verbosity 3
trim the results to one column false
Extract n lines 0
Numerical comparison operator
Numerical comparison value 0
Sort (increasing:1, decreasing: 2, shuffle: 3) the entries by numerical value 0

Mapping keys by file tmp/mappingSeqs
Time for merging files: 0 h 0 m 0 s

======= Extract representative sequences ==========

Program call:
DB_new DB_new tmp/OLDCLUST.mapped tmp/OLDDB.mapped.repSeq --only-rep-seq

MMseqs Version: 5ba68d8
Sub Matrix blosum62.out
Profile false
Profile e-value threshold 0.001
Allow Deletion false
Add internal id false
Compositional bias 1
Maximum sequence identity threshold 0.9
Minimum seq. id. 0
Minimum score per column -20
Minimum coverage 0
Select n most diverse seqs 100
Threads 4
Verbosity 3
Compress MSA false
Summarize headers false
Summary prefix cl
Representative sequence true

Start computing representative sequences.
Time for merging files: 0 h 0 m 0 s

Done.
Time for processing: 0 h 0 m 0s

======= Search the new sequences against ==========
========= previous (rep seq of) clusters ==========

Program call:
tmp/NEWDB.newSeqs tmp/OLDDB.mapped.repSeq tmp/newSeqsHits tmp --max-seqs 1 --sub-mat blosum62.out --alignment-mode 0 -e 0.001 --min-seq-id 0 -c 0 --max-seq-len 32000 --comp-bias-corr 1 --max-rejected 2147483647 --max-accept 2147483647 --threads 4 -v 3 -s 4 -k 0 --k-score 2147483647 --alph-size 21 --offset-result 0 --split 0 --split-mode 2 --diag-score 1 --mask 1 --min-ungapped-score 15 --spaced-kmer-mode 1 --e-profile 0.001 --max-seq-id 0.9 --qid 0 --qsc -20 --cov 0 --diff 100 --pca 1 --pcb 1.5 --num-iterations 1 --start-sens 4 --sens-step-size 1

MMseqs Version: 5ba68d8
Sub Matrix blosum62.out
Add backtrace false
Alignment mode 0
E-value threshold 0.001
Seq. Id Threshold 0
Coverage threshold 0
Target Coverage threshold 0
Max. sequence length 32000
Max. results per query 1
Compositional bias 1
Profile false
Realign hit false
Max Reject 2147483647
Max Accept 2147483647
Include identical Seq. Id. false
Threads 4
Verbosity 3
Sensitivity 4
K-mer size 0
K-score 2147483647
Alphabet size 21
Offset result 0
Split DB 0
Split mode 2
Diagonal Scoring 1
Mask Residues 1
Minimum Diagonal score 15
Spaced Kmer 1
Profile e-value threshold 0.001
Use global sequence weighting false
Maximum sequence identity threshold 0.9
Minimum seq. id. 0
Minimum score per column -20
Minimum coverage 0
Select n most diverse seqs 100
Pseudo count a 1
Pseudo count b 1.5
Number search iterations 1
Start sensitivity 4
sensitivity step size 1
Sets the MPI runner

/Users/delia/Documents/Clustering/Conversions_updates/trial_rmtmp
/Users/delia/Documents/Clustering/Conversions_updates/trial_rmtmp
Program call:
tmp/NEWDB.newSeqs tmp/OLDDB.mapped.repSeq /Users/delia/Documents/Clustering/Conversions_updates/trial_rmtmp/tmp/pref_4 --sub-mat blosum62.out -k 0 --k-score 2147483647 --alph-size 21 --max-seq-len 32000 --max-seqs 1 --offset-result 0 --split 0 --split-mode 2 -c 0 --comp-bias-corr 1 --diag-score 1 --mask 1 --min-ungapped-score 15 --spaced-kmer-mode 1 --threads 4 -v 3 -s 4

MMseqs Version: 5ba68d8
Sub Matrix blosum62.out
Sensitivity 4
K-mer size 0
K-score 2147483647
Alphabet size 21
Max. sequence length 32000
Profile false
Max. results per query 1
Offset result 0
Split DB 0
Split mode 2
Coverage threshold 0
Compositional bias 1
Diagonal Scoring 1
Mask Residues 1
Minimum Diagonal score 15
Include identical Seq. Id. false
Spaced Kmer 1
Threads 4
Verbosity 3

Initialising data structures...
Using 4 threads.

Cound not find precomputed index. Compute index.
Query database: tmp/NEWDB.newSeqs(size=182)
Target database: tmp/OLDDB.mapped.repSeq(size=3)
Use kmer size 6 and split 1 using split mode 0
Needed memory (1381015863 byte) of total memory (25769803776 byte)
Substitution matrices...
Time for init: 0 h 0 m 1s

Process prefiltering step 0 of 1

Index table: counting k-mers...
WARNING: Sequence (dbKey=17) contains only ATGC. It might be a nucleotide sequence.
WARNING: Sequence (dbKey=21) contains only ATGC. It might be a nucleotide sequence.

Index table: Masked residues: 0
Index table: fill...
Index table: removing duplicate entries...
Index table init done.

DB statistic
Entries: 181
DB Size: 686130054 (byte)
Avg Kmer Size: 2.11039e-06
Top 10 Kmers
LRIDDA 1
YSLDDA 1
RRLGEA 1
IGEREA 1
RDKPGA 1
GFTIIA 1
WIRAKA 1
LRRDPA 1
HKRERA 1
KTEKRA 1
Min Kmer Size: 0
Empty list: 85765939

Time for index table init: 0 h 0 m 1s

k-mer similarity threshold: 103
k-mer match probability: 0

Starting prefiltering scores calculation (step 0 of 1)
Query db start 0 to 182
Target db start 0 to 3

67 k-mers per position.
0 DB matches per sequence.
0 Overflows .
0 sequences passed prefiltering per query sequence.
Median result list size: 0
176 sequences with 0 size result lists.

Time for prefiltering scores calculation: 0 h 0 m 0s
Time for merging files: 0 h 0 m 0 s

Overall time for prefiltering run: 0 h 0 m 1s
Program call:
tmp/NEWDB.newSeqs tmp/OLDDB.mapped.repSeq /Users/delia/Documents/Clustering/Conversions_updates/trial_rmtmp/tmp/pref_4 /Users/delia/Documents/Clustering/Conversions_updates/trial_rmtmp/tmp/aln_4 --sub-mat blosum62.out --alignment-mode 0 -e 0.001 --min-seq-id 0 -c 0 --max-seq-len 32000 --max-seqs 1 --comp-bias-corr 1 --max-rejected 2147483647 --max-accept 2147483647 --threads 4 -v 3

MMseqs Version: 5ba68d8
Sub Matrix blosum62.out
Add backtrace false
Alignment mode 0
E-value threshold 0.001
Seq. Id Threshold 0
Coverage threshold 0
Target Coverage threshold 0
Max. sequence length 32000
Max. results per query 1
Compositional bias 1
Profile false
Realign hit false
Max Reject 2147483647
Max Accept 2147483647
Include identical Seq. Id. false
Threads 4
Verbosity 3

Init data structures...
Compute score only.
Using 4 threads.
Calculation of Smith-Waterman alignments.
Time for merging files: 0 h 0 m 0 s

All sequences processed.

6 alignments calculated.
6 sequence pairs passed the thresholds (1 of overall calculated).
0.032967 hits per query sequence.
Time for alignments calculation: 0 h 0 m 0s
Program call:
tmp/NEWDB.newSeqs tmp/OLDDB.mapped.repSeq tmp/newSeqsHits tmp/newSeqsHits.swapped.all

MMseqs Version: 5ba68d8
Threads 4
Verbosity 3

Time for merging files: 0 h 0 m 0 s
Program call:
tmp/newSeqsHits.swapped.all tmp/newSeqsHits.swapped --trim-to-one-column

MMseqs Version: 5ba68d8
Filter column 1
Filter regex ^.*$
Positive filter true
Filter file
Mapping file
Threads 4
Verbosity 3
trim the results to one column true
Extract n lines 0
Numerical comparison operator
Numerical comparison value 0
Sort (increasing:1, decreasing: 2, shuffle: 3) the entries by numerical value 0

Time for merging files: 0 h 0 m 0 s

= Merge found sequences with previous clustering =

Program call:
tmp/OLDCLUST.mapped tmp/updatedClust tmp/newSeqsHits.swapped tmp/OLDCLUST.mapped

MMseqs Version: 5ba68d8
Merge prefixes
Verbosity 3

Merging the results to tmp/updatedClust
Done
Time for merging files: 0 h 0 m 0 s
Time for merging: 0 h 0 m 0s

=========== Extract unmapped sequences ============

Program call:
tmp/noHitSeqList DB_new tmp/toBeClusteredSeparately

MMseqs Version: 5ba68d8
Verbosity 3

Start writing to file tmp/toBeClusteredSeparately
Time for merging files: 0 h 0 m 0 s

===== Cluster separately the alone sequences ======

mv: rename tmp/aln_* to tmp/search/aln_: No such file or directory
mv: rename tmp/clu_ to tmp/search/clu_: No such file or directory
mv: rename tmp/input_ to tmp/search/input_*: No such file or directory
Program call:
tmp/toBeClusteredSeparately tmp/newClusters tmp --sub-mat blosum62.out -s 4 -k 0 --k-score 2147483647 --alph-size 21 --max-seq-len 32000 --offset-result 0 --split 0 --split-mode 2 -c 0 --comp-bias-corr 1 --diag-score 1 --mask 1 --min-ungapped-score 15 --spaced-kmer-mode 1 --threads 4 -v 3 --alignment-mode 0 -e 0.001 --min-seq-id 0 --max-rejected 2147483647 --max-accept 2147483647 --cluster-mode 0 --max-iterations 1000 --similarity-type 2

MMseqs Version: 5ba68d8
Sub Matrix blosum62.out
Sensitivity 4
K-mer size 0
K-score 2147483647
Alphabet size 21
Max. sequence length 32000
Profile false
Max. results per query 300
Offset result 0
Split DB 0
Split mode 2
Coverage threshold 0
Compositional bias 1
Diagonal Scoring 1
Mask Residues 1
Minimum Diagonal score 15
Include identical Seq. Id. false
Spaced Kmer 1
Threads 4
Verbosity 3
Add backtrace false
Alignment mode 0
E-value threshold 0.001
Seq. Id Threshold 0
Target Coverage threshold 0
Realign hit false
Max Reject 2147483647
Max Accept 2147483647
Cluster mode 0
Max depth connected component 1000
Similarity type 2
Cascaded clustering false
Cluster fragments false
Remove Temporary Files false
Sets the MPI runner

Program call:
tmp/toBeClusteredSeparately tmp/aln_redundancy

MMseqs Version: 5ba68d8
Sub Matrix blosum62.out
Alphabet size 3
Seq. Id Threshold 0
Max. sequence length 32000
Threads 4
Verbosity 3

Y -> F
V -> I
M -> L
Q -> E
T -> S
R -> K
S -> A
N -> D
L -> I
H -> E
K -> E
P -> C
E -> D
C -> A
I -> F
G -> A
D -> A
A -> A
Reduced amino acid alphabet:
F W X
Hashing sequences ...
Done.
Compute 169 unique hashes.
Time for merging files: 0 h 0 m 0 s
Program call:
tmp/toBeClusteredSeparately tmp/aln_redundancy tmp/clu_redundancy --cluster-mode 0 --max-seqs 300 -v 3 --max-iterations 1000 --similarity-type 2 --threads 4

MMseqs Version: 5ba68d8
Cluster mode 0
Max. results per query 300
Verbosity 3
Max depth connected component 1000
Similarity type 2
Threads 4

Init...
Opening sequence database...
Opening alignment database...
done.
Clustering mode: Set Cover

Sort entries.

Find missing connections.

Found 7 new connections.

Reconstruct initial order.

Add missing connections.

Time for Read in: 0 m 0s

Writing results...
...done.
Time for clustering: 0 m 0s
Time for merging files: 0 h 0 m 0 s
Total time: 0 m 0s

Size of the sequence database: 176
Size of the alignment database: 176
Number of clusters: 169
Program call:
tmp/order_redundancy tmp/toBeClusteredSeparately tmp/input_step_redundancy

MMseqs Version: 5ba68d8
Verbosity 3

Start writing to file tmp/input_step_redundancy
Time for merging files: 0 h 0 m 0 s
Program call:
tmp/input_step_redundancy tmp/input_step_redundancy tmp/pref --sub-mat blosum62.out -s 4 -k 0 --k-score 2147483647 --alph-size 21 --max-seq-len 32000 --max-seqs 300 --offset-result 0 --split 0 --split-mode 2 -c 0 --comp-bias-corr 1 --diag-score 1 --mask 1 --min-ungapped-score 15 --spaced-kmer-mode 1 --threads 4 -v 3

MMseqs Version: 5ba68d8
Sub Matrix blosum62.out
Sensitivity 4
K-mer size 0
K-score 2147483647
Alphabet size 21
Max. sequence length 32000
Profile false
Max. results per query 300
Offset result 0
Split DB 0
Split mode 2
Coverage threshold 0
Compositional bias 1
Diagonal Scoring 1
Mask Residues 1
Minimum Diagonal score 15
Include identical Seq. Id. false
Spaced Kmer 1
Threads 4
Verbosity 3

Initialising data structures...
Using 4 threads.

Cound not find precomputed index. Compute index.
Query database: tmp/input_step_redundancy(size=169)
Target database: tmp/input_step_redundancy(size=169)
Use kmer size 6 and split 1 using split mode 0
Needed memory (1381292076 byte) of total memory (25769803776 byte)
Substitution matrices...
Time for init: 0 h 0 m 0s

Process prefiltering step 0 of 1

Index table: counting k-mers...

Index table: Masked residues: 166
Index table: fill...
Index table: removing duplicate entries...
Index table init done.

DB statistic
Entries: 30623
DB Size: 686312706 (byte)
Avg Kmer Size: 0.000357052
Top 10 Kmers
GTKRRA 13
NTLRYA 13
RLRRLR 13
RIRRLR 12
GRRANL 11
TWYINL 11
SITLMR 11
GVITGR 10
FSWYAT 10
AELQFV 9
Min Kmer Size: 0
Empty list: 85740284

Time for index table init: 0 h 0 m 1s

k-mer similarity threshold: 103
k-mer match probability: 0

Starting prefiltering scores calculation (step 0 of 1)
Query db start 0 to 169
Target db start 0 to 169

68 k-mers per position.
375 DB matches per sequence.
0 Overflows .
25 sequences passed prefiltering per query sequence.
Median result list size: 21
0 sequences with 0 size result lists.

Time for prefiltering scores calculation: 0 h 0 m 0s
Time for merging files: 0 h 0 m 0 s

Overall time for prefiltering run: 0 h 0 m 2s
Program call:
tmp/input_step_redundancy tmp/input_step_redundancy tmp/pref tmp/aln --sub-mat blosum62.out --alignment-mode 0 -e 0.001 --min-seq-id 0 -c 0 --max-seq-len 32000 --max-seqs 300 --comp-bias-corr 1 --max-rejected 2147483647 --max-accept 2147483647 --threads 4 -v 3

MMseqs Version: 5ba68d8
Sub Matrix blosum62.out
Add backtrace false
Alignment mode 0
E-value threshold 0.001
Seq. Id Threshold 0
Coverage threshold 0
Target Coverage threshold 0
Max. sequence length 32000
Max. results per query 300
Compositional bias 1
Profile false
Realign hit false
Max Reject 2147483647
Max Accept 2147483647
Include identical Seq. Id. false
Threads 4
Verbosity 3

Init data structures...
Compute score only.
Using 4 threads.
Calculation of Smith-Waterman alignments.
Time for merging files: 0 h 0 m 0 s

All sequences processed.

4237 alignments calculated.
4235 sequence pairs passed the thresholds (0.999528 of overall calculated).
25.0592 hits per query sequence.
Time for alignments calculation: 0 h 0 m 0s
Program call:
tmp/input_step_redundancy tmp/aln tmp/clu_step0 --cluster-mode 0 --max-seqs 300 -v 3 --max-iterations 1000 --similarity-type 2 --threads 4

MMseqs Version: 5ba68d8
Cluster mode 0
Max. results per query 300
Verbosity 3
Max depth connected component 1000
Similarity type 2
Threads 4

Init...
Opening sequence database...
Opening alignment database...
done.
Clustering mode: Set Cover

Sort entries.

Find missing connections.

Found 656 new connections.

Reconstruct initial order.

Add missing connections.

Time for Read in: 0 m 0s

Writing results...
...done.
Time for clustering: 0 m 0s
Time for merging files: 0 h 0 m 0 s
Total time: 0 m 0s

Size of the sequence database: 169
Size of the alignment database: 169
Number of clusters: 17
Program call:
tmp/toBeClusteredSeparately tmp/clu tmp/clu_redundancy tmp/clu_step0

MMseqs Version: 5ba68d8
Verbosity 3

List amount 176
Clustering step 1...
Clustering step 2...
Writing the results...
Time for merging files: 0 h 0 m 0 s
...done.

==== Merge the updated clustering together with ===
===== the new clusters ======

Program call:
tmp/updatedClust tmp/newClusters DB_clusterupdate

MMseqs Version: 5ba68d8
Verbosity 3

Time for merging files: 0 h 0 m 0 s
Time for concatenating DBs: 0 h 0 m 0s

LICENSE

Congrats [from the lambda-author] to the preprint of the paper, first of all 👍

Although the contribution of the SeqAn library to MMSeqs2 is rather small, I would still ask you politely to include notice of its use, according to the license:

//     * Redistributions in binary form must reproduce the above copyright
//       notice, this list of conditions and the following disclaimer in the
//       documentation and/or other materials provided with the distribution.

AFAICT the binaries don't do this and the shipped license file doesn't mention them. This likely affects other code-parts used in MMSeqs, as well, I am not sure. In any case, not a big deal, but would be nice if you could change this.
Also of course we depend on people citing the SeqAn library in their academic work if they use it. If this is still possible it would be great, but there is no (legal) requirement to do so.

On an unrelated note: The "non-commercial" header in src/alignment/smith_waterman_sse2.cpp is incompatible with the GPL-license. I don't know if the header is just outdated, but if not, this might cause you trouble with the original author and/or make it impossible for other people to distribute mmseqs2...

mmseqs search not returning proteins present in database as top hits

Hello,

First, thank you for the fantastic work on mmseqs2, its super fast! I think I’ve come across an issue where exact matches arent being detected by the mmseqs2 search algorithm.

Expected Behavior

Exact matches hit eachother

Current Behavior

When a subset of sequences from a reference database are searched against the database, a surprisingly large fraction of proteins do not return themselves as their best hit. The severity of the problem depends on the database size, in the example below 2.8% of proteins do not hit themselves, but when I came across this issue I was generating a larger database where using the same test 31% of proteins were not hitting themselves at all. This result does not change if sensitivity is increased to the maximum (-s 8.5). I find this behaviour concerning given you’d expect an exact matches to be returned in a relatively small database

Steps to Reproduce (for bugs)

If I grab 10 random genomes (genome_proteins.tar.gz) from NCBI and create a reference database using their proteins, and use one of these genomes as the query database:

cat *faa > pooled.faa

mmseqs createdb \
	pooled.faa \
	reference_DB

mmseqs createdb \
	GCF_000352185.1_protein.faa \
	query_DB

Then search query against the reference using default settings, and convert this to a blast-like output:

mkdir tmp 

mmseqs search \
	query_DB \
	reference_DB \
	result_DB \
	tmp 

mmseqs convertalis \
	query_DB \
	reference_DB \
	result_DB \
	result_DB.m8

and grab the top hit for each protein (i.e. first listed):

for i in `cat result_DB.m8 | awk '{print $1}' | uniq `
	do grep -m 1 -w ^$i result_DB.m8
done > result_DB.top_hits.m8

Then, when we check the results in R:

library('data.table')
d = fread('result_DB.top_hits.m8')
# Check if query matches the reference sequence ID
d$self_match = d$V1 == d$V2

# Check percentage of total hits do not hit themselves?
round(table(d$self) / nrow(d) * 100, 1)
                                                                                                                                            
FALSE  TRUE                                                                                                                                 
  2.8  97.2  

# What is the average %ID of each group?
mean(d[self_match == TRUE]$V3)                                                                                                            
[1] 0.9556934                                                                                                                               
mean(d[self_match == FALSE]$V3)                                                                                                           
[1] 0.7872742

Your Environment

Im using mmseqs2 version 2-2.d96d2469

I'm also wondering why very few of the proteins that do hit themselves have a %ID of <100%?

Thank you for any help you can provide,

Joel

Running tests?

Hi there,

I see that there are tests in src/tests, but there is no make test target or mention of tests in the README. Is there a recommended way of running the tests?

Thanks, ben.

Regression between AVX2 and SSE4.1

Expected Behavior

Both versions should have the same ROC1 value in our benchmark.

Current Behavior

AVX2 performes worse than SSE4.1

not really an issue, but a suggestion, not urgent

Expected Behavior

Feedback to user presents bytes, which are very difficult to interpret.

Current Behavior

In my machine (32GB of RAM), I saw some long number starting with a "1", and the RAM also in bytes, stating with a "3". I would not know if I was using 33% of my RAM (1/3), or 3.3% (1/33). The lengths of the bytes are difficult to asses by mere sight.

So, in my personal copy of MMseqs2, I changed the feedback in two files:
./src/prefiltering/Prefiltering.cpp
./src/util/kmermatcher.cpp

I divided the bytes by pow(2,30) to get GB. That's so much better! (I made the changes only in printed feedback. Everything else I left untouched. Calculations are still in bytes, only feedback is given in GB.)

I don't know if there's more files that give feedback to users in bytes, but that was a good start. Now I see my databases being created, and I see that I have more than enough RAM (it was 3.3%).

user guide error and unknown output format

I have tried to follow MMseq2 manual to search. But I find the user manual and the output is inconsistent.

On page 3 of the manual:
$ mmseqs search queryDB targetDB resultDB --use-index
should be
$ mmseqs search queryDB targetDB resultDB tmp --use-index

On the same page, there is no explanation of the output file format for resultDB.m8 and is inconsistent with the output format explained in page 12.
For example, if one line in resultDB.m8 reads:
d1bhea_ J0KT32 0.427 381 160 0 0 374 0 381 7.96E-77 281
I can understand that the fist three columns are query, target, seqID, and the last two columns are evalue and score. But what is the meaning of "381 160 0 0 374 0 381"

finally, when I try
mmseqs formatalignment queryDB targetDB resultDB resultDB.2 --format-mode 1
the output is the same as (--format-mode 1) except for the leading '>'
when I try
mmseqs formatalignment queryDB targetDB resultDB resultDB.2 --format-mode 2
there is no output at all.

Segmentation fault during prefiltering

Expected Behavior

When running with highly reduced databases with the size of lower than 1MB, it works just fine.
Both the complete call for the search workflow ./mmseqs search "query" "target" "result" "tmp"
as well as the explicit call for the prefilter module only ./mmseqs prefilter "query" "target" "result" are running as expected.

Current Behavior

Using bigger files for query and target data, which in no means are unreasonable in my opinion (5.1 MB each), the program exits with a segmentation fault during the prefiltering step. See output below for more detailed information.

Steps to Reproduce (for bugs)

strictly following the steps from your user guide on a freshly installed mmseqs distribution,
compiled from scratch

converted files into mmseqs-readable format from plain fasta files
./mmseqs createdb "querypath" "query"
./mmseqs createdb "targetpath" "target"
created new temp folder on local harddrive
tried to run the whole search workflow
./mmseqs search "query" "target" "result" "tmp"
3b) after failing the search due to the segmentation fault, decided to clean the databases and temp folder
and ran the prefilter module explicitly as this seems to have been the problem
./mmseqs prefilter "query" "target" "result"

MMseqs Output (for bugs)

Search workflow:
Program call:
/local/jelvers/Masterthesis/Testdata/querysample03 /local/jelvers/Masterthesis/Testdata/targetsample03 /local/jelvers/Masterthesis/temp/6581086409424530102/pref_5.7 --sub-mat blosum62.out -k 0 --k-score 2147483647 --alph-size 21 --max-seq-len 32000 --max-seqs 300 --offset-result 0 --split 0 --split-mode 2 -c 0 --cov-mode 0 --comp-bias-corr 1 --diag-score 1 --mask 1 --min-ungapped-score 15 --spaced-kmer-mode 1 --pca 1 --pcb 1.5 --threads 4 -v 3 -s 5.7

MMseqs Version: bcb164e
Sub Matrix blosum62.out
Sensitivity 5.7
K-mer size 0
K-score 2147483647
Alphabet size 21
Max. sequence length 32000
Max. results per query 300
Offset result 0
Split DB 0
Split mode 2
Coverage threshold 0
Coverage Mode 0
Compositional bias 1
Diagonal Scoring 1
Mask Residues 1
Minimum Diagonal score 15
Include identical Seq. Id. false
Spaced Kmer 1
No preload false
Early exit false
Pseudo count a 1
Pseudo count b 1.5
Threads 4
Verbosity 3

Initialising data structures...
Using 4 threads.
Could not find precomputed index. Compute index.
Use kmer size 6 and split 1 using Target split mode.
Needed memory (1414056568 byte) of total memory (8293187584 byte)
Target database: /local/jelvers/Masterthesis/Testdata/targetsample03(Size: 14015)
Substitution matrices...
Query database type: Aminoacid
Target database type: Aminoacid
Time for init: 0 h 0 m 2s

Query database: /local/jelvers/Masterthesis/Testdata/querysample03(size=14015)
Process prefiltering step 1 of 1

Index table: counting k-mers...
/local/jelvers/Masterthesis/temp//6581086409424530102/blastp.sh: line 86: 8824 Segmentation fault (core dumped) $RUNNER $MMSEQS prefilter "$INPUT" "$TARGET" "$TMP_PATH/pref_$SENS" $PREFILTER_PAR -s $SENS
Error: Prefilter died
[Inferior 1 (process 8814) exited with code 01]

Prefilter module call
Program call:
/local/jelvers/Masterthesis/Testdata/querysample03 /local/jelvers/Masterthesis/Testdata/targetsample03 /local/jelvers/Masterthesis/Mmseqs_output/

MMseqs Version: bcb164e
Sub Matrix blosum62.out
Sensitivity 4
K-mer size 0
K-score 2147483647
Alphabet size 21
Max. sequence length 32000
Max. results per query 300
Offset result 0
Split DB 0
Split mode 2
Coverage threshold 0
Coverage Mode 0
Compositional bias 1
Diagonal Scoring 1
Mask Residues 1
Minimum Diagonal score 15
Include identical Seq. Id. false
Spaced Kmer 1
No preload false
Early exit false
Pseudo count a 1
Pseudo count b 1.5
Threads 4
Verbosity 3

Initialising data structures...
Using 4 threads.
Could not find precomputed index. Compute index.
Use kmer size 6 and split 1 using Target split mode.
Needed memory (1414056568 byte) of total memory (8293187584 byte)
Target database: /local/jelvers/Masterthesis/Testdata/targetsample03(Size: 14015)
Substitution matrices...
[New Thread 0x7ffff6210700 (LWP 9087)]
[New Thread 0x7ffff5a0f700 (LWP 9088)]
[New Thread 0x7ffff520e700 (LWP 9089)]
Query database type: Aminoacid
Target database type: Aminoacid
Time for init: 0 h 0 m 2s

Query database: /local/jelvers/Masterthesis/Testdata/querysample03(size=14015)
Process prefiltering step 1 of 1

Index table: counting k-mers...

Thread 3 "mmseqs" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff5a0f700 (LWP 9088)]
0x0000000000542911 in tantan::getProbabilities(char const*, char const*, int, double const* const*, double, double, double, double, double, float*) ()

Context

As already stated above, with smaller databases everything works fine. Even running a small different query set on the target database works just fine, as does the other way round (normal query, reduced target db).
Please tell me if u need the fasta files to recreate this bug.

Your Environment

MMseqs2 Version: bcb164e
Self-compiled build, following your description in "Compile from source"
cmake version 3.5.2
gcc version 4.8.5 (SUSE Linux)
Memory: 7.7 GiB RAM
openSUSE Leap 42.2, Kernel: 4.4.92-18.36-default, 64 bit

Dots output during Smith-Waterman alignments

Expected Behavior

The align command behaves as expected, I would just like to know the quantitative meaning of the dots which are output during the alignment phase, ie is it one dot per 10k queries or something like this, so that I can get an idea of how much longer this phase will require (or if I should kill the job and re-run with lower number of alignments)

Current Behavior

Dots are output at an unkown rate...

Steps to Reproduce (for bugs)

mmseqs search [...] -a true

MMseqs Output (for bugs)

[...]

No preload                      false
Early exit                      false
Threads                         12
Verbosity                       3

Init data structures...
Using 12 threads.
Compute score, coverage and sequence id.
Calculation of Smith-Waterman alignments.
.............

Many thanks!

Missing Similarities and More

Hi there!

The following fasta files result in potential bugs or at least behavior that was unexpected for me. I converted the results to csv format, but what you see are basically just the first three columns of the m8 format output (convertalis).
I'm not sure whether or not those three cases are related, which is why I just opened a single issue. I could also split them into three issues if preferred.

Unexpected Behavior 1:

Fasta:

>0000_A
ABCDEFGHIJKLMNOPQRSTUVWXYZ
>0001_A
ABCDEFGHIJKLMNOPQRSTUVWXYZ
>0002_A
ABCDEFGHIJKLMNOPQRSTUVWXYZ

Result:

source,target,seq_id
0000_A,0000_A,1.000
0001_A,0001_A,1.000
0001_A,0000_A,1.000
0001_A,0002_A,1.000
0002_A,0002_A,1.000
0002_A,0000_A,1.000

I would expect 9 rows, 3 sequences with 3 similarities each. It seems like similarities are treated as undirected edges in this case. But when I replace "MNO" with "AAA" in the first sequence, I get the expected 9 similarities:

Fasta:

>0000_A
ABCDEFGHIJKLAAAPQRSTUVWXYZ
>0001_A
ABCDEFGHIJKLMNOPQRSTUVWXYZ
>0002_A
ABCDEFGHIJKLMNOPQRSTUVWXYZ

Result:

source,target,seq_id
0000_A,0000_A,1.000
0000_A,0002_A,0.885
0000_A,0001_A,0.885
0001_A,0001_A,1.000
0001_A,0002_A,1.000
0001_A,0000_A,0.885
0002_A,0002_A,1.000
0002_A,0001_A,1.000
0002_A,0000_A,0.885

This doesn't happen in all cases, though. Here is another case with two exact copies and a third sequence that has a small change (QU -> AA) that does not lead to 9 similarities:

Fasta

>0000_A
TESTSEAAENCE
>0001_A
TESTSEQUENCE
>0002_A
TESTSEQUENCE

Result:

source,target,seq_id
0000_A,0000_A,1.000
0001_A,0001_A,1.000
0001_A,0002_A,1.000
0001_A,0000_A,0.833
0002_A,0002_A,1.000
0002_A,0000_A,0.833

I wasn't able to deduce what exactly made those two cases behave differently, unfortunately.

Unexpected Behavior 2:

When I use a very similar sequence as I used in the former example (SEQUENCETEST instead of TESTSEQUENCE), something entirely different happens. These three exact copies show up as unrelated.

Fasta:

>0000_A
SEQUENCETEST
>0001_A
SEQUENCETEST
>0002_A
SEQUENCETEST

Result:

source,target,seq_id
0000_A,0000_A,1.000
0001_A,0001_A,1.000
0002_A,0002_A,1.000

In this case it also doesn't matter if I edit (CE -> AA) the first sequence:

Fasta:

>0000_A
SEQUENAATEST
>0001_A
SEQUENCETEST
>0002_A
SEQUENCETEST

Result:

source,target,seq_id
0000_A,0000_A,1.000
0001_A,0001_A,1.000
0002_A,0002_A,1.000

Unexpected Behavior 3:

It seems like there is a minimal length for sequences, but I'm not sure this error is intended.

Fasta:

>0000_A
ABCDEFGHIJ
>0001_A
ABCDEFGHIJ
>0002_A
ABCDEFGHIJ

Error:

World` Size: 4 aaSize: 3
World Size: World Size: 4 aaSize: 3
World Size: 4World Size: 4 aaSize: 3
World Size: World Size: 4 aaSize: 3
World Size: 4Could not open data file /.../tmp/pref_7!
mv: cannot stat ‘/.../tmp/aln_7’: No such file or directory

shebang line in blasp.sh

when I run the mmseqs2 to search sequence, I encounter a an error at blastp.sh:

$ mmseqs createdb queryDB.fasta queryDB
$ mmseqs createdb targetDB.fasta targetDB
$ mmseqs createindex targetDB
$ mkdir -p tmp/
$ mmseqs search queryDB targetDB resultDB tmp --use-index

Program call:
queryDB targetDB resultDB tmp --use-index

MMseqs Version: ef19bf4
Sub Matrix /home/zcx/Program/MMseqs/2.0/data/blosum62.out
Alignment mode 0
E-value threshold 0.001
Coverage threshold 0
Detect fragments false
Compositional bias 1
Seq. Id Threshold 0
Max. sequence length 32000
Max. results per query 300
Max Reject 2147483647
Include identical Seq. Id. false
Nucleotide false
Profile false
Add backtrace false
Realign hit false
Threads 32
Verbosity 3
Sensitivity 4
K-mer size 7
K-score 2147483647
Alphabet size 21
Split DB 0
Split mode 2
Search mode 2
Diagonal Scoring 1
Minimum Diagonal score 30
Spaced Kmer 1
Profile e-value threshold 0.001
Use global sequence weighting false
Maximum sequence identity threshold 0.9
Minimum seq. id. 0
Minimum score per column -20
Minimum coverage 0
Select n most diverse seqs 100
Pseudo count a 1
Pseudo count b 1.5
First sequence as respresentative false
Number search iterations 1
Start sensitivity 4
Sensitivity step size 1
Use index true
Sets the MPI runner

[ -z /home/zcx/Program/MMseqs/2.0 ]
[ 4 -ne 4 ]
[ ! -f queryDB ]
[ ! -f targetDB ]
[ -f resultDB ]
[ ! -d tmp ]
export OMP_PROC_BIND=TRUE
dirname queryDB
cd .
basename queryDB
QUERY_FILE=queryDB
pwd
ABS_QUERY=/home/zcx/Program/MMseqs/2.0/test/queryDB
cd -
/home/zcx/Program/MMseqs/2.0/test
cd tmp
pwd
TMP_PATH=/home/zcx/Program/MMseqs/2.0/test/tmp
cd -
/home/zcx/Program/MMseqs/2.0/test
INPUT=queryDB
TARGET=targetDB
SENS=4
[ 4 -le 4 ]
notExists /home/zcx/Program/MMseqs/2.0/test/tmp/pref_4
[ ! -f /home/zcx/Program/MMseqs/2.0/test/tmp/pref_4 ]
mmseqs prefilter queryDB targetDB.sk7 /home/zcx/Program/MMseqs/2.0/test/tmp/pref_4 --sub-mat /home/zcx/Program/MMseqs/2.0/data/blosum62.out -k 7 --k-score 2147483647 --alph-size 21 --max-seq-len 32000 --max-seqs 300 --split 0 --split-mode 2 --search-mode 2 --comp-bias-corr 1 --diag-score 1 --min-diag-score 30 --spaced-kmer-mode 1 --threads 32 -v 3 -s 4
Program call:
queryDB targetDB.sk7 /home/zcx/Program/MMseqs/2.0/test/tmp/pref_4 --sub-mat /home/zcx/Program/MMseqs/2.0/data/blosum62.out -k 7 --k-score 2147483647 --alph-size 21 --max-seq-len 32000 --max-seqs 300 --split 0 --split-mode 2 --search-mode 2 --comp-bias-corr 1 --diag-score 1 --min-diag-score 30 --spaced-kmer-mode 1 --threads 32 -v 3 -s 4

MMseqs Version: ef19bf4
Sub Matrix /home/zcx/Program/MMseqs/2.0/data/blosum62.out
Sensitivity 4
K-mer size 7
K-score 2147483647
Alphabet size 21
Max. sequence length 32000
Profile false
Nucleotide false
Max. results per query 300
Split DB 0
Split mode 2
Search mode 2
Compositional bias 1
Diagonal Scoring 1
Minimum Diagonal score 30
Include identical Seq. Id. false
Spaced Kmer 1
Threads 32
Verbosity 3

Initialising data structures...
Using 32 threads.

Index version: 774909490
KmerSize: 7
AlphabetSize: 21
Skip: 0
Split: 1
Type: 1
Spaced: 1
Query database: queryDB(size=246)
Target database: targetDB.sk7(size=10000)
Needed memory (14434761936 byte) of total memory (270462795776 byte)
Substitution matrices...
Time for init: 0 h 0 m 3s

Process prefiltering step 0 of 1

Index version: 774909490
KmerSize: 7
AlphabetSize: 21
Skip: 0
Split: 1
Type: 1
Spaced: 1
Copy 1650981 Entries (9905886 byte)
Setup Sizes
Read IndexTable ... Done
k-mer similarity threshold: 115
k-mer match probability: 0

Starting prefiltering scores calculation (step 0 of 1)
Query db start 0 to 246
Target db start 0 to 10000

736 k-mers per position.
448 DB matches per sequence.
553 Double diagonal matches per sequence.
0 Overflows .
25 sequences passed prefiltering per query sequence.
Median result list size: 21
5 sequences with 0 size result lists.

Time for prefiltering scores calculation: 0 h 2 m 18s
Time for merging files: 0 h 0 m 0 s

Overall time for prefiltering run: 0 h 2 m 32s

checkReturnCode Prefilter died
[ 0 -ne 0 ]
notExists /home/zcx/Program/MMseqs/2.0/test/tmp/aln_4
[ ! -f /home/zcx/Program/MMseqs/2.0/test/tmp/aln_4 ]
mmseqs alignment queryDB targetDB /home/zcx/Program/MMseqs/2.0/test/tmp/pref_4 /home/zcx/Program/MMseqs/2.0/test/tmp/aln_4 --sub-mat /home/zcx/Program/MMseqs/2.0/data/blosum62.out --alignment-mode 0 -e 0.001 -c 0 --comp-bias-corr 1 --min-seq-id 0 --max-seq-len 32000 --max-seqs 300 --max-rejected 2147483647 --threads 32 -v 3
Program call:
queryDB targetDB /home/zcx/Program/MMseqs/2.0/test/tmp/pref_4 /home/zcx/Program/MMseqs/2.0/test/tmp/aln_4 --sub-mat /home/zcx/Program/MMseqs/2.0/data/blosum62.out --alignment-mode 0 -e 0.001 -c 0 --comp-bias-corr 1 --min-seq-id 0 --max-seq-len 32000 --max-seqs 300 --max-rejected 2147483647 --threads 32 -v 3

Init data structures...
Compute score only.
Using 32 threads.
Calculation of Smith-Waterman alignments.
Time for merging files: 0 h 0 m 0 s

All sequences processed.

6287 alignments calculated.
6203 sequence pairs passed the thresholds (0.986639 of overall calculated).
25.2154 hits per query sequence.
Time for alignments calculation: 0 h 0 m 1s

checkReturnCode Alignment died
[ 0 -ne 0 ]
[ 4 -gt 4 ]
NEXTINPUT=/home/zcx/Program/MMseqs/2.0/test/tmp/input_step4
[ 4 -lt 4 ]
let SENS=SENS+SENS_STEP_SIZE
/home/zcx/Program/MMseqs/2.0/bin/blastp.sh: 57: /home/zcx/Program/MMseqs/2.0/bin/blastp.sh: let: not found

I am running MMseqs2 on Ubuntu 14.04 (trusty) x86-64. On Ubuntu and Debian, the default shell /bin/sh is dash, not bash. dash does not support "let". I recommend changing the first line of "blastp.sh" from "#!/bin/sh -ex" to "#!/bin/bash -ex"

soedinglab / mmseqs2 Goto Github PK

mmseqs2's Issues

Expected Behavior

Current Behavior

Steps to Reproduce (for bugs)

MMseqs Output (for bugs)

Context

Your Environment

Expected Behavior

Current Behavior

Steps to Reproduce (for bugs)

MMseqs Output (for bugs)

Context

Your Environment

Expected Behavior

Current Behavior

Steps to Reproduce (for bugs)

Your Environment

Expected Behavior

Current Behavior

Steps to Reproduce (for bugs)

Expected Behavior

Current Behavior

Expected Behavior

Current Behavior

Context

Your Environment

Expected Behavior

Current Behavior

Steps to Reproduce (for bugs)

MMseqs Output (for bugs)

Issue

Expected Behavior

Current Behavior

Steps to Reproduce (for bugs)

Context

Your Environment

Steps to Reproduce

Your Environment

Expected Behavior

Current Behavior

Steps to Reproduce (for bugs)

MMseqs Output (for bugs)

Context

Your Environment

Expected Behavior

Current Behavior

Steps to Reproduce (for bugs)

MMseqs Output (for bugs)

Context

Your Environment

=================================================== ====== Filter out the new from old sequences ======

Start writing to file tmp/NEWDB.newSeqs Time for merging files: 0 h 0 m 0 s

=== Update the old clustering with the new keys ===

Mapping keys by file tmp/mappingSeqs Time for merging files: 0 h 0 m 0 s

======= Extract representative sequences ==========

Done. Time for processing: 0 h 0 m 0s

======= Search the new sequences against ========== ========= previous (rep seq of) clusters ==========

Time for merging files: 0 h 0 m 0 s

= Merge found sequences with previous clustering =

Merging the results to tmp/updatedClust Done Time for merging files: 0 h 0 m 0 s Time for merging: 0 h 0 m 0s

=========== Extract unmapped sequences ============

Start writing to file tmp/toBeClusteredSeparately Time for merging files: 0 h 0 m 0 s

===== Cluster separately the alone sequences ======

List amount 176 Clustering step 1... Clustering step 2... Writing the results... Time for merging files: 0 h 0 m 0 s ...done.

==== Merge the updated clustering together with === ===== the new clusters ======

Expected Behavior

Current Behavior

Steps to Reproduce (for bugs)

Your Environment

Expected Behavior

Current Behavior

Expected Behavior

Current Behavior

Expected Behavior

Current Behavior

Steps to Reproduce (for bugs)

MMseqs Output (for bugs)

Context

Your Environment

===================================================
====== Filter out the new from old sequences ======

Start writing to file tmp/NEWDB.newSeqs
Time for merging files: 0 h 0 m 0 s

Mapping keys by file tmp/mappingSeqs
Time for merging files: 0 h 0 m 0 s

Done.
Time for processing: 0 h 0 m 0s

======= Search the new sequences against ==========
========= previous (rep seq of) clusters ==========

Merging the results to tmp/updatedClust
Done
Time for merging files: 0 h 0 m 0 s
Time for merging: 0 h 0 m 0s

Start writing to file tmp/toBeClusteredSeparately
Time for merging files: 0 h 0 m 0 s

List amount 176
Clustering step 1...
Clustering step 2...
Writing the results...
Time for merging files: 0 h 0 m 0 s
...done.

==== Merge the updated clustering together with ===
===== the new clusters ======