Git Product home page Git Product logo

montezuma's People

Contributors

wiseman avatar

Watchers

James Cloos avatar

montezuma's Issues

:store-term-vector isn't char-safe

What steps will reproduce the problem?

1. Add :store-term-vector :with-positions-offsets to `make-field'
2. index a corpus

Depending on your input, you'll get: "Illegal :UTF-8 character starting at
byte position ..."

Original issue reported on code.google.com by [email protected] on 5 Sep 2009 at 8:04

standard tokenizer hangs on some input

Edi Weitz reports that Montezuma hangs when indexing one of his documents.

I /think/, not having looked at Montezuma in detail, that the culprit is the 
complex regular 
expression in standard-tokenizer.lisp. I don't know what its purpose is but I'm 
pretty sure there 
are ways to fine-tune it. For example, as you seemingly don't use the register 
groups anywhere, 
you should replace every "(" with "(?:". Let me know if you want me to spend 
some time on 
optimizing this regex (/if/ that's the cause for the problem).

I can reproduce the problem using a document consisting of the string 
"john.______________________________________________________".


07/27/06 12:18:17: Modified by wiseman

Processing time seems exponential with respect to the number of underscores in 
the test string.

07/27/06 12:42:16: Modified by wiseman

Edi's recommended (?:...) optimization didn't make a significant difference 
with regard to this 
issue, at least.


Original issue reported on code.google.com by [email protected] on 13 Jun 2008 at 10:47

Index broken after 2000 documents?

jsnell on #lisp found that after adding 2000 documents to an index he was 
unable to search 
without getting an error. If he optimized the index, the error went away. So at 
least there's a 
workaround.

His test code:

(defparameter *index* (make-instance 'montezuma:index
                                     :path "/scratch/src/cl-list/db/index.mz"))

(defun build-index ()
  (time
   (dolist (file (subseq
                  (sort (directory "/scratch/src/cl-list/db/split/*")
                        #'<
                        :key #'file-index)
                  0))
     (format t "handling ~A~%" file)
     (force-output)
     (montezuma:search-each *index* "body:foo"
                            (lambda (doc score)
                              (print doc)))
     (let ((content (slurp file)))
       (destructuring-bind (headers body)
           (cl-ppcre:split "(?sm:^$)" content :limit 2)
         (flet ((header (regex)
                  (car (cl-ppcre:all-matches-as-strings regex headers))))
           (montezuma:add-document-to-index
            *index*
            `(("id" . ,(file-index file))
              ("headers" . ,headers)
              ("subject" . ,(header "(?<=(?mi:^Subject: ))(.*)"))
              ("from" . ,(header "(?<=(?mi:^From: ))(.*)"))
              ("date" . ,(header "(?<=(?mi:^Date: ))(.*)"))
              ("body" . ,body)))))))))

The stack trace:

End of file error while reading #<SB-SYS:FD-STREAM for "file /scratch/src/cl-
list/db/index.mz/_33p.cfs" {1003A9AFA1}>
   [Condition of type SIMPLE-ERROR]

Restarts:
  0: [ABORT-REQUEST] Abort handling SLIME request.
  1: [ABORT] Exit debugger, returning to top level.

Backtrace:
  0: ((SB-PCL::FAST-METHOD MONTEZUMA::READ-INTERNAL (MONTEZUMA::FS-INDEX-INPUT T T 
T)) (#(NIL 5) . #()) #<unavailable argument> #<MONTEZUMA::FS-INDEX-INPUT 
{1003A99001}> 
#(116 113 110 115 113 112 104 108 105 109 ...) 1000 1000)
      Locals:
        SB-DEBUG::ARG-0 = (#(NIL 5) . #())
        SB-DEBUG::ARG-1 = :<NOT-AVAILABLE>
        SB-DEBUG::ARG-2 = #<MONTEZUMA::FS-INDEX-INPUT {1003A99001}>
        SB-DEBUG::ARG-3 = #(116 113 110 115 113 112 104 108 105 109 ...)
        SB-DEBUG::ARG-4 = 1000
        SB-DEBUG::ARG-5 = 1000
  1: ((SB-PCL::FAST-METHOD MONTEZUMA::READ-BYTES (MONTEZUMA::BUFFERED-INDEX-INPUT 
T T T)) (#(NIL 3 4 1 2) . #()) #<unavailable argument> 
#<MONTEZUMA::FS-INDEX-INPUT 
{1003A99001}> #(116 113 110 115 113 112 104 108 105 109 ...) 1000 1000)
      Locals:
        SB-DEBUG::ARG-0 = (#(NIL 3 4 1 2) . #())
        SB-DEBUG::ARG-1 = :<NOT-AVAILABLE>
        SB-DEBUG::ARG-2 = #<MONTEZUMA::FS-INDEX-INPUT {1003A99001}>
        SB-DEBUG::ARG-3 = #(116 113 110 115 113 112 104 108 105 109 ...)
        SB-DEBUG::ARG-4 = 1000
        SB-DEBUG::ARG-5 = 1000
  2: ((SB-PCL::FAST-METHOD MONTEZUMA::READ-BYTES (MONTEZUMA::BUFFERED-INDEX-INPUT 
T T T)) (#(NIL 3 4 1 2) . #()) #<unavailable argument> 
#<MONTEZUMA::CS-INDEX-INPUT 
{1003BF8BD1}> #(116 113 110 115 113 112 104 108 105 109 ...) 1000 1000)
      Locals:
        SB-DEBUG::ARG-0 = (#(NIL 3 4 1 2) . #())
        SB-DEBUG::ARG-1 = :<NOT-AVAILABLE>
        SB-DEBUG::ARG-2 = #<MONTEZUMA::CS-INDEX-INPUT {1003BF8BD1}>
        SB-DEBUG::ARG-3 = #(116 113 110 115 113 112 104 108 105 109 ...)
        SB-DEBUG::ARG-4 = 1000
        SB-DEBUG::ARG-5 = 1000
  3: ((SB-PCL::FAST-METHOD MONTEZUMA::GET-NORMS-INTO (MONTEZUMA::SEGMENT-READER 
T T T)) (#(NIL 16) . #()) #<unavailable argument> #<MONTEZUMA::SEGMENT-READER 
"_33p" 
(1000 docs, 0 deleted docs, 6 field infos) {1003A96E81}> "body" #(116 113 110 
115 113 112 
104 108 105 109 ...) 1000)
      Locals:
        SB-DEBUG::ARG-0 = (#(NIL 16) . #())
        SB-DEBUG::ARG-1 = :<NOT-AVAILABLE>
        SB-DEBUG::ARG-2 = #<MONTEZUMA::SEGMENT-READER "_33p" (1000 docs, 0 deleted docs, 
6 field infos) {1003A96E81}>
        SB-DEBUG::ARG-3 = "body"
        SB-DEBUG::ARG-4 = #(116 113 110 115 113 112 104 108 105 109 ...)
        SB-DEBUG::ARG-5 = 1000
  4: ((LAMBDA (MONTEZUMA::SUB-READER)) #<MONTEZUMA::SEGMENT-READER "_33p" (1000 
docs, 0 deleted docs, 6 field infos) {1003A96E81}>)
      Locals:
        SB-DEBUG::ARG-0 = #<MONTEZUMA::SEGMENT-READER "_33p" (1000 docs, 0 deleted docs, 
6 field infos) {1003A96E81}>
  5: (SB-IMPL::%MAP-FOR-EFFECT-ARITY-1 #<CLOSURE (LAMBDA (MONTEZUMA::SUB-READER)) 
{1003BF8659}> #(#<MONTEZUMA::SEGMENT-READER "_1ju" (1000 docs, 0 deleted docs, 
6 field 
infos) {1003A79981}> #<MONTEZUMA::SEGMENT-READER "_33p" (1000 docs, 0 deleted 
docs, 6 
field infos) {1003A96E81}>))
      Locals:
        SB-DEBUG::ARG-0 = 2
        SB-DEBUG::ARG-1 = #<CLOSURE (LAMBDA (MONTEZUMA::SUB-READER)) {1003BF8659}>
        SB-DEBUG::ARG-2 = #(#<MONTEZUMA::SEGMENT-READER "_1ju" (1000 docs, 0 deleted 
docs, 6 field infos) {1003A79981}> #<MONTEZUMA::SEGMENT-READER "_33p" (1000 
docs, 0 
deleted docs, 6 field infos) {1003A96E81}>)
  6: (SB-KERNEL:%MAP NIL #<CLOSURE (LAMBDA (MONTEZUMA::SUB-READER)) {1003BF8659}> 
#(#<MONTEZUMA::SEGMENT-READER "_1ju" (1000 docs, 0 deleted docs, 6 field infos) 
{1003A79981}> #<MONTEZUMA::SEGMENT-READER "_33p" (1000 docs, 0 deleted docs, 6 
field 
infos) {1003A96E81}>))
      Locals:
        SB-DEBUG::ARG-0 = 3
        SB-DEBUG::ARG-1 = NIL
        SB-DEBUG::ARG-2 = #<CLOSURE (LAMBDA (MONTEZUMA::SUB-READER)) {1003BF8659}>
        SB-DEBUG::ARG-3 = #(#<MONTEZUMA::SEGMENT-READER "_1ju" (1000 docs, 0 deleted 
docs, 6 field infos) {1003A79981}> #<MONTEZUMA::SEGMENT-READER "_33p" (1000 
docs, 0 
deleted docs, 6 field infos) {1003A96E81}>)
  7: ((SB-PCL::FAST-METHOD MONTEZUMA::GET-NORMS (MONTEZUMA::MULTI-READER T)) 
(#(NIL 11 9 6) . #()) #<unavailable argument> #<MONTEZUMA::MULTI-READER 
{1003AB80C1}> 
"body")
      Locals:
        SB-DEBUG::ARG-0 = (#(NIL 11 9 6) . #())
        SB-DEBUG::ARG-1 = :<NOT-AVAILABLE>
        SB-DEBUG::ARG-2 = #<MONTEZUMA::MULTI-READER {1003AB80C1}>
        SB-DEBUG::ARG-3 = "body"
  8: ((SB-PCL::FAST-METHOD MONTEZUMA:SCORER (MONTEZUMA::TERM-WEIGHT T)) 
#<unavailable argument> #<unavailable argument> #<MONTEZUMA::TERM-WEIGHT query: 
#<MONTEZUMA:TERM-QUERY "body":"foo"^1.0 {1003ABAEC1}> {1003ABB141}> 
#<MONTEZUMA::MULTI-READER {1003AB80C1}>)
      Locals:
        SB-DEBUG::ARG-0 = :<NOT-AVAILABLE>
        SB-DEBUG::ARG-1 = :<NOT-AVAILABLE>
        SB-DEBUG::ARG-2 = #<MONTEZUMA::TERM-WEIGHT query: #<MONTEZUMA:TERM-QUERY 
"body":"foo"^1.0 {1003ABAEC1}> {1003ABB141}>
        SB-DEBUG::ARG-3 = #<MONTEZUMA::MULTI-READER {1003AB80C1}>
  9: ((SB-PCL::FAST-METHOD MONTEZUMA:SEARCH (MONTEZUMA:INDEX-SEARCHER T)) 
#<unavailable argument> #<unavailable argument> #<MONTEZUMA:INDEX-SEARCHER 
{1003AB83A1}> #<MONTEZUMA:BOOLEAN-QUERY with 1 clauses: #<MONTEZUMA:BOOLEAN-
CLAUSE :SHOULD-OCCUR #<MONTEZUMA:TERM-QUERY "body":"foo"^1.0 {1003ABAEC1}>>> 
NIL)
      Locals:
        SB-DEBUG::ARG-0 = :<NOT-AVAILABLE>
        SB-DEBUG::ARG-1 = :<NOT-AVAILABLE>
        SB-DEBUG::ARG-2 = #<MONTEZUMA:INDEX-SEARCHER {1003AB83A1}>
        SB-DEBUG::ARG-3 = #<MONTEZUMA:BOOLEAN-QUERY with 1 clauses: 
#<MONTEZUMA:BOOLEAN-CLAUSE :SHOULD-OCCUR #<MONTEZUMA:TERM-QUERY 
"body":"foo"^1.0 {1003ABAEC1}>>>
        SB-DEBUG::ARG-4 = NIL
 10: ((SB-PCL::FAST-METHOD MONTEZUMA:SEARCH-EACH (MONTEZUMA:INDEX T T)) 
#<unavailable argument> #<unavailable argument> #<MONTEZUMA:INDEX {1002200721}> 
"body:foo" #<FUNCTION (LAMBDA (DOC SCORE)) {1003E3DAE9}> NIL)
      Locals:
        SB-DEBUG::ARG-0 = :<NOT-AVAILABLE>
        SB-DEBUG::ARG-1 = :<NOT-AVAILABLE>
        SB-DEBUG::ARG-2 = #<MONTEZUMA:INDEX {1002200721}>
        SB-DEBUG::ARG-3 = "body:foo"
        SB-DEBUG::ARG-4 = #<FUNCTION (LAMBDA (DOC SCORE)) {1003E3DAE9}>
        SB-DEBUG::ARG-5 = NIL

07/14/06 09:35:38: Modified by wiseman

I am able to reproduce this.


Original issue reported on code.google.com by [email protected] on 13 Jun 2008 at 10:45

point to the code on github?

Which github repository is the "main" montezuma one?

Is it this one?
https://github.com/skypher/montezuma

Original issue reported on code.google.com by alex.rudnick on 21 Apr 2012 at 8:36

broken :must-not-occur or phrase query

As of revision 342 at least (and probably before that), the following query 
fails on the planet lisp 
corpus: [html-template !"edi weitz"\.

The error:

{{{
Argument Y is not a NUMBER: NIL
   [Condition of type SIMPLE-TYPE-ERROR]

Restarts:
  0: [ABORT-REQUEST] Abort handling SLIME request.
  1: [ABORT] Exit debugger, returning to top level.

Backtrace:
  0: (SB-KERNEL:TWO-ARG-* 1.0 NIL)
  1: ((SB-PCL::FAST-METHOD MONTEZUMA:SCORE (MONTEZUMA::PHRASE-SCORER)) (#(NIL 3) . 
#()) #<unavailable argument> #<MONTEZUMA::EXACT-PHRASE-SCORER weight: 
#<MONTEZUMA::PHRASE-WEIGHT query: #<MONTEZUMA::PHRASE-QUERY field: "text" 
terms: 
[edi:0 weitz:1]> {12040969}> {120A8AD1}>)
  2: ((SB-PCL::FAST-METHOD MONTEZUMA::ADVANCE-AFTER-CURRENT 
(MONTEZUMA::DISJUNCTION-SUM-SCORER)) (#(NIL 2 3 5 4 7) . #()) #<unused 
argument> 
#<MONTEZUMA::COUNTING-DISJUNCTION-SUM-SCORER {120AC491}>)
  3: ((SB-PCL::FAST-METHOD MONTEZUMA::NEXT? (MONTEZUMA::DISJUNCTION-SUM-SCORER)) 
(#(NIL 5 7) . #()) #<unused argument> 
#<MONTEZUMA::COUNTING-DISJUNCTION-SUM-SCORER 
{120AC491}>)
  4: ((SB-PCL::FAST-METHOD MONTEZUMA::NEXT? (MONTEZUMA::BOOLEAN-SCORER)) (#(NIL 1) . 
#()) #<unused argument> #<MONTEZUMA::BOOLEAN-SCORER {12079611}>)
  5: ((SB-PCL::FAST-METHOD MONTEZUMA::NEXT? (MONTEZUMA::REQ-EXCL-SCORER)) (#(NIL 2 3 
1) . #()) #<unused argument> #<MONTEZUMA::REQ-EXCL-SCORER {120AB261}>)
  6: ((SB-PCL::FAST-METHOD MONTEZUMA:EACH-HIT (MONTEZUMA::BOOLEAN-SCORER T)) 
(#(NIL 1) . #()) #<unused argument> #<MONTEZUMA::BOOLEAN-SCORER {12047BF1}> 
#<CLOSURE (LAMBDA (MONTEZUMA:DOC MONTEZUMA:SCORE)) {120A9035}>)
  7: ((SB-PCL::FAST-METHOD MONTEZUMA:SEARCH (MONTEZUMA:INDEX-SEARCHER T)) 
#<unavailable argument> #<unavailable argument> #<MONTEZUMA:INDEX-SEARCHER 
{11C8E3E9}> #<MONTEZUMA:BOOLEAN-QUERY with 2 clauses: #<MONTEZUMA:BOOLEAN-
CLAUSE :SHOULD-OCCUR #<MONTEZUMA:BOOLEAN-QUERY with 3 clauses: 
#<MONTEZUMA:BOOLEAN-CLAUSE :SHOULD-OCCUR #<MONTEZUMA::PHRASE-QUERY #1=field: 
"title" terms: [#]>> #<MONTEZUMA:BOOLEAN-CLAUSE :SHOULD-OCCUR 
#<MONTEZUMA::PHRASE-QUERY #1#>> #<MONTEZUMA:BOOLEAN-CLAUSE :SHOULD-OCCUR 
#<MONTEZUMA::PHRASE-QUERY #1#>>>> #<MONTEZUMA:BOOLEAN-CLAUSE :MUST-NOT-
OCCUR #<MONTEZUMA:BOOLEAN-QUERY with 3 clauses: #<MONTEZUMA:BOOLEAN-CLAUSE 
:SHOULD-OCCUR #<MONTEZUMA::PHRASE-QUERY #1#>> #<MONTEZUMA:BOOLEAN-CLAUSE 
:SHOULD-OCCUR #<MONTEZUMA::PHRASE-QUERY #1#>> #<MONTEZUMA:BOOLEAN-CLAUSE 
:SHOULD-OCCUR #<MONTEZUMA::PHRASE-QUERY #1#>>>>> NIL)
  8: ((SB-PCL::FAST-METHOD MONTEZUMA:SEARCH-EACH (MONTEZUMA:INDEX T T)) 
#<unavailable argument> #<unavailable argument> #<MONTEZUMA:INDEX {11C6D4F9}> 
"html-template !\"edi weitz\"" #<CLOSURE (LAMBDA (PLANET-LISP-SEARCH::DOC 
PLANET-LISP-
SEARCH::SCORE)) {11FD0D7D}> (NIL))
  9: (PLANET-LISP-SEARCH::TIME-THUNK #<CLOSURE (LAMBDA NIL) {11FD0D65}>)
 10: (PLANET-LISP-SEARCH:SEARCH-POSTS "html-template !\"edi weitz\"" NIL NIL)
 11: (SB-INT:EVAL-IN-LEXENV (PLANET-LISP-SEARCH:SEARCH-POSTS "html-template !\"edi 
weitz\"") #<NULL-LEXENV>)
 --more--
}}}

And the result:

{{{
1) Error:
test_boolean_query_with_must_not_and_phrase(IndexSearcherTest):
TypeError: nil can't be coerced into Float
    ./test/../lib/ferret/search/phrase_scorer.rb:72:in `*'
    ./test/../lib/ferret/search/phrase_scorer.rb:72:in `score'
    ./test/../lib/ferret/search/disjunction_sum_scorer.rb:130:in `advance_after_current'
    ./test/../lib/ferret/search/disjunction_sum_scorer.rb:182:in `skip_to'
    ./test/../lib/ferret/search/boolean_scorer.rb:281:in `skip_to'
    ./test/../lib/ferret/search/req_excl_scorer.rb:58:in `to_non_excluded'
    ./test/../lib/ferret/search/req_excl_scorer.rb:37:in `next?'
    ./test/../lib/ferret/search/boolean_scorer.rb:228:in `each_hit'
    ./test/../lib/ferret/search/index_searcher.rb:118:in `search'

./test/unit/../unit/analysis/../../unit/document/../../unit/index/../../unit/que
ry_parser/../../uni
t/search/tc_index_searcher.rb:33:in `check_hits'

./test/unit/../unit/analysis/../../unit/document/../../unit/index/../../unit/que
ry_parser/../../uni
t/search/tc_index_searcher.rb:154:in 
`test_boolean_query_with_must_not_and_phrase'
}}}

I will now see if the latest version of Ferret still has the bug.


Original issue reported on code.google.com by [email protected] on 13 Jun 2008 at 10:41

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.