ariesliuh / montezuma Goto Github PK
View Code? Open in Web Editor NEWAutomatically exported from code.google.com/p/montezuma
Automatically exported from code.google.com/p/montezuma
What steps will reproduce the problem?
1. Add :store-term-vector :with-positions-offsets to `make-field'
2. index a corpus
Depending on your input, you'll get: "Illegal :UTF-8 character starting at
byte position ..."
Original issue reported on code.google.com by [email protected]
on 5 Sep 2009 at 8:04
Edi Weitz reports that Montezuma hangs when indexing one of his documents.
I /think/, not having looked at Montezuma in detail, that the culprit is the
complex regular
expression in standard-tokenizer.lisp. I don't know what its purpose is but I'm
pretty sure there
are ways to fine-tune it. For example, as you seemingly don't use the register
groups anywhere,
you should replace every "(" with "(?:". Let me know if you want me to spend
some time on
optimizing this regex (/if/ that's the cause for the problem).
I can reproduce the problem using a document consisting of the string
"john.______________________________________________________".
07/27/06 12:18:17: Modified by wiseman
Processing time seems exponential with respect to the number of underscores in
the test string.
07/27/06 12:42:16: Modified by wiseman
Edi's recommended (?:...) optimization didn't make a significant difference
with regard to this
issue, at least.
Original issue reported on code.google.com by [email protected]
on 13 Jun 2008 at 10:47
Using ASDF-BINARY-LOCATIONS will mess the current logic
up by setting it to the FASL directory.
Attached patch fixes it at the cost of hardcoding a bit more
path.
Original issue reported on code.google.com by [email protected]
on 12 Jan 2009 at 9:11
Attachments:
jsnell on #lisp found that after adding 2000 documents to an index he was
unable to search
without getting an error. If he optimized the index, the error went away. So at
least there's a
workaround.
His test code:
(defparameter *index* (make-instance 'montezuma:index
:path "/scratch/src/cl-list/db/index.mz"))
(defun build-index ()
(time
(dolist (file (subseq
(sort (directory "/scratch/src/cl-list/db/split/*")
#'<
:key #'file-index)
0))
(format t "handling ~A~%" file)
(force-output)
(montezuma:search-each *index* "body:foo"
(lambda (doc score)
(print doc)))
(let ((content (slurp file)))
(destructuring-bind (headers body)
(cl-ppcre:split "(?sm:^$)" content :limit 2)
(flet ((header (regex)
(car (cl-ppcre:all-matches-as-strings regex headers))))
(montezuma:add-document-to-index
*index*
`(("id" . ,(file-index file))
("headers" . ,headers)
("subject" . ,(header "(?<=(?mi:^Subject: ))(.*)"))
("from" . ,(header "(?<=(?mi:^From: ))(.*)"))
("date" . ,(header "(?<=(?mi:^Date: ))(.*)"))
("body" . ,body)))))))))
The stack trace:
End of file error while reading #<SB-SYS:FD-STREAM for "file /scratch/src/cl-
list/db/index.mz/_33p.cfs" {1003A9AFA1}>
[Condition of type SIMPLE-ERROR]
Restarts:
0: [ABORT-REQUEST] Abort handling SLIME request.
1: [ABORT] Exit debugger, returning to top level.
Backtrace:
0: ((SB-PCL::FAST-METHOD MONTEZUMA::READ-INTERNAL (MONTEZUMA::FS-INDEX-INPUT T T
T)) (#(NIL 5) . #()) #<unavailable argument> #<MONTEZUMA::FS-INDEX-INPUT
{1003A99001}>
#(116 113 110 115 113 112 104 108 105 109 ...) 1000 1000)
Locals:
SB-DEBUG::ARG-0 = (#(NIL 5) . #())
SB-DEBUG::ARG-1 = :<NOT-AVAILABLE>
SB-DEBUG::ARG-2 = #<MONTEZUMA::FS-INDEX-INPUT {1003A99001}>
SB-DEBUG::ARG-3 = #(116 113 110 115 113 112 104 108 105 109 ...)
SB-DEBUG::ARG-4 = 1000
SB-DEBUG::ARG-5 = 1000
1: ((SB-PCL::FAST-METHOD MONTEZUMA::READ-BYTES (MONTEZUMA::BUFFERED-INDEX-INPUT
T T T)) (#(NIL 3 4 1 2) . #()) #<unavailable argument>
#<MONTEZUMA::FS-INDEX-INPUT
{1003A99001}> #(116 113 110 115 113 112 104 108 105 109 ...) 1000 1000)
Locals:
SB-DEBUG::ARG-0 = (#(NIL 3 4 1 2) . #())
SB-DEBUG::ARG-1 = :<NOT-AVAILABLE>
SB-DEBUG::ARG-2 = #<MONTEZUMA::FS-INDEX-INPUT {1003A99001}>
SB-DEBUG::ARG-3 = #(116 113 110 115 113 112 104 108 105 109 ...)
SB-DEBUG::ARG-4 = 1000
SB-DEBUG::ARG-5 = 1000
2: ((SB-PCL::FAST-METHOD MONTEZUMA::READ-BYTES (MONTEZUMA::BUFFERED-INDEX-INPUT
T T T)) (#(NIL 3 4 1 2) . #()) #<unavailable argument>
#<MONTEZUMA::CS-INDEX-INPUT
{1003BF8BD1}> #(116 113 110 115 113 112 104 108 105 109 ...) 1000 1000)
Locals:
SB-DEBUG::ARG-0 = (#(NIL 3 4 1 2) . #())
SB-DEBUG::ARG-1 = :<NOT-AVAILABLE>
SB-DEBUG::ARG-2 = #<MONTEZUMA::CS-INDEX-INPUT {1003BF8BD1}>
SB-DEBUG::ARG-3 = #(116 113 110 115 113 112 104 108 105 109 ...)
SB-DEBUG::ARG-4 = 1000
SB-DEBUG::ARG-5 = 1000
3: ((SB-PCL::FAST-METHOD MONTEZUMA::GET-NORMS-INTO (MONTEZUMA::SEGMENT-READER
T T T)) (#(NIL 16) . #()) #<unavailable argument> #<MONTEZUMA::SEGMENT-READER
"_33p"
(1000 docs, 0 deleted docs, 6 field infos) {1003A96E81}> "body" #(116 113 110
115 113 112
104 108 105 109 ...) 1000)
Locals:
SB-DEBUG::ARG-0 = (#(NIL 16) . #())
SB-DEBUG::ARG-1 = :<NOT-AVAILABLE>
SB-DEBUG::ARG-2 = #<MONTEZUMA::SEGMENT-READER "_33p" (1000 docs, 0 deleted docs,
6 field infos) {1003A96E81}>
SB-DEBUG::ARG-3 = "body"
SB-DEBUG::ARG-4 = #(116 113 110 115 113 112 104 108 105 109 ...)
SB-DEBUG::ARG-5 = 1000
4: ((LAMBDA (MONTEZUMA::SUB-READER)) #<MONTEZUMA::SEGMENT-READER "_33p" (1000
docs, 0 deleted docs, 6 field infos) {1003A96E81}>)
Locals:
SB-DEBUG::ARG-0 = #<MONTEZUMA::SEGMENT-READER "_33p" (1000 docs, 0 deleted docs,
6 field infos) {1003A96E81}>
5: (SB-IMPL::%MAP-FOR-EFFECT-ARITY-1 #<CLOSURE (LAMBDA (MONTEZUMA::SUB-READER))
{1003BF8659}> #(#<MONTEZUMA::SEGMENT-READER "_1ju" (1000 docs, 0 deleted docs,
6 field
infos) {1003A79981}> #<MONTEZUMA::SEGMENT-READER "_33p" (1000 docs, 0 deleted
docs, 6
field infos) {1003A96E81}>))
Locals:
SB-DEBUG::ARG-0 = 2
SB-DEBUG::ARG-1 = #<CLOSURE (LAMBDA (MONTEZUMA::SUB-READER)) {1003BF8659}>
SB-DEBUG::ARG-2 = #(#<MONTEZUMA::SEGMENT-READER "_1ju" (1000 docs, 0 deleted
docs, 6 field infos) {1003A79981}> #<MONTEZUMA::SEGMENT-READER "_33p" (1000
docs, 0
deleted docs, 6 field infos) {1003A96E81}>)
6: (SB-KERNEL:%MAP NIL #<CLOSURE (LAMBDA (MONTEZUMA::SUB-READER)) {1003BF8659}>
#(#<MONTEZUMA::SEGMENT-READER "_1ju" (1000 docs, 0 deleted docs, 6 field infos)
{1003A79981}> #<MONTEZUMA::SEGMENT-READER "_33p" (1000 docs, 0 deleted docs, 6
field
infos) {1003A96E81}>))
Locals:
SB-DEBUG::ARG-0 = 3
SB-DEBUG::ARG-1 = NIL
SB-DEBUG::ARG-2 = #<CLOSURE (LAMBDA (MONTEZUMA::SUB-READER)) {1003BF8659}>
SB-DEBUG::ARG-3 = #(#<MONTEZUMA::SEGMENT-READER "_1ju" (1000 docs, 0 deleted
docs, 6 field infos) {1003A79981}> #<MONTEZUMA::SEGMENT-READER "_33p" (1000
docs, 0
deleted docs, 6 field infos) {1003A96E81}>)
7: ((SB-PCL::FAST-METHOD MONTEZUMA::GET-NORMS (MONTEZUMA::MULTI-READER T))
(#(NIL 11 9 6) . #()) #<unavailable argument> #<MONTEZUMA::MULTI-READER
{1003AB80C1}>
"body")
Locals:
SB-DEBUG::ARG-0 = (#(NIL 11 9 6) . #())
SB-DEBUG::ARG-1 = :<NOT-AVAILABLE>
SB-DEBUG::ARG-2 = #<MONTEZUMA::MULTI-READER {1003AB80C1}>
SB-DEBUG::ARG-3 = "body"
8: ((SB-PCL::FAST-METHOD MONTEZUMA:SCORER (MONTEZUMA::TERM-WEIGHT T))
#<unavailable argument> #<unavailable argument> #<MONTEZUMA::TERM-WEIGHT query:
#<MONTEZUMA:TERM-QUERY "body":"foo"^1.0 {1003ABAEC1}> {1003ABB141}>
#<MONTEZUMA::MULTI-READER {1003AB80C1}>)
Locals:
SB-DEBUG::ARG-0 = :<NOT-AVAILABLE>
SB-DEBUG::ARG-1 = :<NOT-AVAILABLE>
SB-DEBUG::ARG-2 = #<MONTEZUMA::TERM-WEIGHT query: #<MONTEZUMA:TERM-QUERY
"body":"foo"^1.0 {1003ABAEC1}> {1003ABB141}>
SB-DEBUG::ARG-3 = #<MONTEZUMA::MULTI-READER {1003AB80C1}>
9: ((SB-PCL::FAST-METHOD MONTEZUMA:SEARCH (MONTEZUMA:INDEX-SEARCHER T))
#<unavailable argument> #<unavailable argument> #<MONTEZUMA:INDEX-SEARCHER
{1003AB83A1}> #<MONTEZUMA:BOOLEAN-QUERY with 1 clauses: #<MONTEZUMA:BOOLEAN-
CLAUSE :SHOULD-OCCUR #<MONTEZUMA:TERM-QUERY "body":"foo"^1.0 {1003ABAEC1}>>>
NIL)
Locals:
SB-DEBUG::ARG-0 = :<NOT-AVAILABLE>
SB-DEBUG::ARG-1 = :<NOT-AVAILABLE>
SB-DEBUG::ARG-2 = #<MONTEZUMA:INDEX-SEARCHER {1003AB83A1}>
SB-DEBUG::ARG-3 = #<MONTEZUMA:BOOLEAN-QUERY with 1 clauses:
#<MONTEZUMA:BOOLEAN-CLAUSE :SHOULD-OCCUR #<MONTEZUMA:TERM-QUERY
"body":"foo"^1.0 {1003ABAEC1}>>>
SB-DEBUG::ARG-4 = NIL
10: ((SB-PCL::FAST-METHOD MONTEZUMA:SEARCH-EACH (MONTEZUMA:INDEX T T))
#<unavailable argument> #<unavailable argument> #<MONTEZUMA:INDEX {1002200721}>
"body:foo" #<FUNCTION (LAMBDA (DOC SCORE)) {1003E3DAE9}> NIL)
Locals:
SB-DEBUG::ARG-0 = :<NOT-AVAILABLE>
SB-DEBUG::ARG-1 = :<NOT-AVAILABLE>
SB-DEBUG::ARG-2 = #<MONTEZUMA:INDEX {1002200721}>
SB-DEBUG::ARG-3 = "body:foo"
SB-DEBUG::ARG-4 = #<FUNCTION (LAMBDA (DOC SCORE)) {1003E3DAE9}>
SB-DEBUG::ARG-5 = NIL
07/14/06 09:35:38: Modified by wiseman
I am able to reproduce this.
Original issue reported on code.google.com by [email protected]
on 13 Jun 2008 at 10:45
Which github repository is the "main" montezuma one?
Is it this one?
https://github.com/skypher/montezuma
Original issue reported on code.google.com by alex.rudnick
on 21 Apr 2012 at 8:36
As of revision 342 at least (and probably before that), the following query
fails on the planet lisp
corpus: [html-template !"edi weitz"\.
The error:
{{{
Argument Y is not a NUMBER: NIL
[Condition of type SIMPLE-TYPE-ERROR]
Restarts:
0: [ABORT-REQUEST] Abort handling SLIME request.
1: [ABORT] Exit debugger, returning to top level.
Backtrace:
0: (SB-KERNEL:TWO-ARG-* 1.0 NIL)
1: ((SB-PCL::FAST-METHOD MONTEZUMA:SCORE (MONTEZUMA::PHRASE-SCORER)) (#(NIL 3) .
#()) #<unavailable argument> #<MONTEZUMA::EXACT-PHRASE-SCORER weight:
#<MONTEZUMA::PHRASE-WEIGHT query: #<MONTEZUMA::PHRASE-QUERY field: "text"
terms:
[edi:0 weitz:1]> {12040969}> {120A8AD1}>)
2: ((SB-PCL::FAST-METHOD MONTEZUMA::ADVANCE-AFTER-CURRENT
(MONTEZUMA::DISJUNCTION-SUM-SCORER)) (#(NIL 2 3 5 4 7) . #()) #<unused
argument>
#<MONTEZUMA::COUNTING-DISJUNCTION-SUM-SCORER {120AC491}>)
3: ((SB-PCL::FAST-METHOD MONTEZUMA::NEXT? (MONTEZUMA::DISJUNCTION-SUM-SCORER))
(#(NIL 5 7) . #()) #<unused argument>
#<MONTEZUMA::COUNTING-DISJUNCTION-SUM-SCORER
{120AC491}>)
4: ((SB-PCL::FAST-METHOD MONTEZUMA::NEXT? (MONTEZUMA::BOOLEAN-SCORER)) (#(NIL 1) .
#()) #<unused argument> #<MONTEZUMA::BOOLEAN-SCORER {12079611}>)
5: ((SB-PCL::FAST-METHOD MONTEZUMA::NEXT? (MONTEZUMA::REQ-EXCL-SCORER)) (#(NIL 2 3
1) . #()) #<unused argument> #<MONTEZUMA::REQ-EXCL-SCORER {120AB261}>)
6: ((SB-PCL::FAST-METHOD MONTEZUMA:EACH-HIT (MONTEZUMA::BOOLEAN-SCORER T))
(#(NIL 1) . #()) #<unused argument> #<MONTEZUMA::BOOLEAN-SCORER {12047BF1}>
#<CLOSURE (LAMBDA (MONTEZUMA:DOC MONTEZUMA:SCORE)) {120A9035}>)
7: ((SB-PCL::FAST-METHOD MONTEZUMA:SEARCH (MONTEZUMA:INDEX-SEARCHER T))
#<unavailable argument> #<unavailable argument> #<MONTEZUMA:INDEX-SEARCHER
{11C8E3E9}> #<MONTEZUMA:BOOLEAN-QUERY with 2 clauses: #<MONTEZUMA:BOOLEAN-
CLAUSE :SHOULD-OCCUR #<MONTEZUMA:BOOLEAN-QUERY with 3 clauses:
#<MONTEZUMA:BOOLEAN-CLAUSE :SHOULD-OCCUR #<MONTEZUMA::PHRASE-QUERY #1=field:
"title" terms: [#]>> #<MONTEZUMA:BOOLEAN-CLAUSE :SHOULD-OCCUR
#<MONTEZUMA::PHRASE-QUERY #1#>> #<MONTEZUMA:BOOLEAN-CLAUSE :SHOULD-OCCUR
#<MONTEZUMA::PHRASE-QUERY #1#>>>> #<MONTEZUMA:BOOLEAN-CLAUSE :MUST-NOT-
OCCUR #<MONTEZUMA:BOOLEAN-QUERY with 3 clauses: #<MONTEZUMA:BOOLEAN-CLAUSE
:SHOULD-OCCUR #<MONTEZUMA::PHRASE-QUERY #1#>> #<MONTEZUMA:BOOLEAN-CLAUSE
:SHOULD-OCCUR #<MONTEZUMA::PHRASE-QUERY #1#>> #<MONTEZUMA:BOOLEAN-CLAUSE
:SHOULD-OCCUR #<MONTEZUMA::PHRASE-QUERY #1#>>>>> NIL)
8: ((SB-PCL::FAST-METHOD MONTEZUMA:SEARCH-EACH (MONTEZUMA:INDEX T T))
#<unavailable argument> #<unavailable argument> #<MONTEZUMA:INDEX {11C6D4F9}>
"html-template !\"edi weitz\"" #<CLOSURE (LAMBDA (PLANET-LISP-SEARCH::DOC
PLANET-LISP-
SEARCH::SCORE)) {11FD0D7D}> (NIL))
9: (PLANET-LISP-SEARCH::TIME-THUNK #<CLOSURE (LAMBDA NIL) {11FD0D65}>)
10: (PLANET-LISP-SEARCH:SEARCH-POSTS "html-template !\"edi weitz\"" NIL NIL)
11: (SB-INT:EVAL-IN-LEXENV (PLANET-LISP-SEARCH:SEARCH-POSTS "html-template !\"edi
weitz\"") #<NULL-LEXENV>)
--more--
}}}
And the result:
{{{
1) Error:
test_boolean_query_with_must_not_and_phrase(IndexSearcherTest):
TypeError: nil can't be coerced into Float
./test/../lib/ferret/search/phrase_scorer.rb:72:in `*'
./test/../lib/ferret/search/phrase_scorer.rb:72:in `score'
./test/../lib/ferret/search/disjunction_sum_scorer.rb:130:in `advance_after_current'
./test/../lib/ferret/search/disjunction_sum_scorer.rb:182:in `skip_to'
./test/../lib/ferret/search/boolean_scorer.rb:281:in `skip_to'
./test/../lib/ferret/search/req_excl_scorer.rb:58:in `to_non_excluded'
./test/../lib/ferret/search/req_excl_scorer.rb:37:in `next?'
./test/../lib/ferret/search/boolean_scorer.rb:228:in `each_hit'
./test/../lib/ferret/search/index_searcher.rb:118:in `search'
./test/unit/../unit/analysis/../../unit/document/../../unit/index/../../unit/que
ry_parser/../../uni
t/search/tc_index_searcher.rb:33:in `check_hits'
./test/unit/../unit/analysis/../../unit/document/../../unit/index/../../unit/que
ry_parser/../../uni
t/search/tc_index_searcher.rb:154:in
`test_boolean_query_with_must_not_and_phrase'
}}}
I will now see if the latest version of Ferret still has the bug.
Original issue reported on code.google.com by [email protected]
on 13 Jun 2008 at 10:41
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.