whoosh-community / whoosh Goto Github PK

View Code? Open in Web Editor NEW

235.0 235.0 36.0 28.49 MB

Whoosh is a fast, featureful full-text indexing and searching library implemented in pure Python.

License: Other

Python 100.00%

whoosh's People

Contributors

Stargazers

Watchers

whoosh's Issues

Spans raise spans not implemented in when using rebuilded index

Original report by Marcin Kuzminski (Bitbucket: marcinkuzminski, GitHub: marcinkuzminski).

Ability to use spans is a great future, when I build an Index from scratch it works fine,
doing reindexing that builds second index and not merging it with the default raises this exception.
NotImplementedError: spans not implemented in <class 'whoosh.matching.MultiMatcher'>

Is it possible to implement spans in MultiMatcher ?

API to customize tempfile creation in pools.

Original report by Matt Chaput (Bitbucket: mchaput, GitHub: mchaput).

maximum recursion with many OR to QueryParser

Original report by Chris Dent (Bitbucket: cdent, GitHub: cdent).

Parsing a query with a large number of 'OR' statements in it, with the default QueryParser will cause

RuntimeError: maximum recursion depth exceeded while calling a Python object

Here's a basic test case

>>> from whoosh.fields import Schema, TEXT
>>> from whoosh.qparser import QueryParser                                        
>>> schema = Schema(content=TEXT())
>>> parser = QueryParser("content", schema=schema)                   
>>> x = parser.parse(u"1 OR 1 OR 1 OR 1 OR 1 OR 1 OR 1 OR 1 OR 1 OR 1 OR 1 OR 1 OR 1 OR 1 OR 1 OR 1 OR 1 OR 1 OR 1 OR 1 OR 1 OR 1 OR 1 OR 1 OR 1 OR 1 OR 1 OR 1 OR 1 OR 1 OR 1 OR 1 OR 1 OR 1 OR 1 OR 1 OR 1 OR 1 OR 1 OR 1 OR 1 OR 1 OR 1 OR 1 OR 1 OR 1 OR 1 OR 1 OR 1 OR 1 OR 1 OR 1 OR 1 OR 1 OR 1 OR 1 OR 1 OR 1 OR 1 OR 1 OR 1 OR 1 OR 1 OR 1 OR 1 OR 1 OR 1 OR 1 OR 1 OR 1 OR 1 OR 1 OR 1 OR 1 OR 1 OR 1 OR 1 OR 1 OR 1 OR 1 OR 1")

What matters here is not the values being queried between the ORs but the number of ORs. Take one away and recursion depth is not hit. The above number of ORs may seem odd but comes about as the result of automatic query generation (there are many layers of processing in this situation). The actual query is more than this, but the above is the minimum to tickle the problem.

Is this a fundamental limitation in the parser or a fixable bug?

Temp files probably aren't deleted if the user calls writer.cancel() or if there's an exception

Original report by Matt Chaput (Bitbucket: mchaput, GitHub: mchaput).

Lazy load values in FilePostingWriter

Original report by Matt Chaput (Bitbucket: mchaput, GitHub: mchaput).

Don't call _read_values() for a block until someone asks for a value().

long/int conversion error in post reading

Original report by Matt Chaput (Bitbucket: mchaput, GitHub: mchaput).

aback42 on the forum:

In Whoosh 1.0.0b2,
when searching on 8.2G index this error occurs:

Traceback (most recent call last):
  File "main.py", line 241, in miadvsearch_activate_cb
  File "main.py", line 259, in search_stmt
  File "cireader.pyc", line 332, in search
  File "searchengine\searching.pyc", line 29, in search
  File "whoosh\searching.pyc", line 351, in search
  File "whoosh\query.pyc", line 466, in matcher
  File "whoosh\searching.pyc", line 119, in postings
  File "whoosh\reading.pyc", line 465, in postings
  File "whoosh\filedb\filereading.pyc", line 238, in postings
  File "whoosh\filedb\filepostings.pyc", line 252, in __init__
  File "whoosh\filedb\filepostings.pyc", line 396, in _next_block
  File "whoosh\filedb\filepostings.pyc", line 380, in _consume_block
  File "whoosh\filedb\filepostings.pyc", line 353, in _read_values
  File "whoosh\filedb\structfile.pyc", line 110, in __getitem__
OverflowError: long int too large to convert to int

Temp directories are not deleted when indexing is done

Original report by Anonymous.

Improve speed of text highlight.

Original report by Marcin Kuzminski (Bitbucket: marcinkuzminski, GitHub: marcinkuzminski).

I'm using whoosh as an indexer in my app, it scanns over a mercurial repositories. When marking search results using highlight module sometime in can kill my server. For example building an index on few dictionary files and searching one word from this index makes my app freze for 2-4 minutes off 100% cpu usage to highlight the one word from content.

This is how i do it it's not a complicated analyzer or formatter.

#!python
analyzer = RegexTokenizer(expression=r"\w+") | LowercaseFilter()
formatter = HtmlFormatter('span',
	between='\n<span class="break">...</span>\n') 

#how the parts are splitted within the same text part
fragmenter = SimpleFragmenter(200)
#fragmenter = ContextFragmenter(search_items)

for res in results:
	d = {}
	d.update(res)
	hl = highlight(escape(res['content']), search_items,
	    analyzer=analyzer,
            fragmenter=fragmenter,
            formatter=formatter,
            top=5)

It would be awesome if it could be improved.

Open Numeric Range Tests

Original report by Daniel Lindsley (Bitbucket: toastdriven, GitHub: toastdriven).

While struggling with a bug of my own, I fleshed out some open range tests for the NUMERIC type. They pass for me, which means a more complete test suite for you, but sadly don't reveal my bug. Patch included.

Variations's existing_terms(..., reverse=True) returns the empty set.

Original report by Adam Blinkinsop (Bitbucket: blinks, GitHub: blinks).

Looks like it's caused by line 908 of whoosh.query:

#!python
return [word for word in self.words if (fieldname, word) in ixreader]

This won't return anything if the words don't exist in the reader, and if I'm looking for the reversed existing terms (non-existant terms?), I get zip.

simple typo in whoosh.query.Phrase.eq()

Original report by jdubery (Bitbucket: jdubery, GitHub: Unknown).

Function whoosh.query.Phrase.eq() has [and self.words == other.word and] in its return statement; it should have [and self.words == other.words and]

cheers, John

encode_termkey crashes with a unicode field name

Original report by eevee (Bitbucket: eevee, GitHub: eevee).

I ran into this using searcher.search(sorted_by=(u'foo', u'bar')), done mostly out of habit. The actual field names don't contain any non-ASCII characters, but merely naming them with unicodes instead of strs probably shouldn't crash.

Actual problem is that encode_termkey does "%s %s" % (fieldname, encoded_text); having the fieldname be a unicode forces the returned string to also be a unicode, which tries to decode the encoded text as ascii, which generally fails since it's actually encoded UTF-8.

Attached patch fixes this and adds a test. Also fixes the test_random_termkeys test; it would intermittently bomb because it can pick characters from the D800–DFFF range, which are all used for surrogate pairs and aren't real characters, so they don't roundtrip when encoded+decoded as utf8.

Add Reader.docnums_and_fields() method

Original report by Matt Chaput (Bitbucket: mchaput, GitHub: mchaput).

Add a method for iterating through all documents that yields both docnums and fields. This would make e.g. finding out-of-date documents and deleting them easier.

TypeError when calling AsyncWriter().delete_document.

Original report by Collin Anderson (Bitbucket: collinmanderson, GitHub: Unknown).

I want to use AsyncWriter to solve my LockError problems, but I get this error when calling delete_document:

_record() takes exactly 4 arguments (3 given)}}}
{{{_record() takes exactly 4 arguments (3 given)

The problem is that delete_document passes docnum, instead of passing *args and **kwargs like others.

filelock: try harder not to leak fds (patch)

Original report by Stéphane Démurget (Bitbucket: zzrough, GitHub: zzrough).

(I hope creating new issues at bitbucket instead of the old Trac system is okay)

I've encountered http://whoosh.ca/ticket/70 (file lock fd not closed, that you fixed when importing whoosh into bitbucket -- this was the original point of my patch).

I'm opening a writer for each write in a FastCGI application (writes are not so frequent, but still, the system is exhausted of fds as the processes are reused).

Here's a small patch that tries to close the fd even if the flock fail. This also ensures the object state stays in sync (self.locked and self.fd). I'm silencing out eventual close errors not to mask the real exception raised until logging is available (I saw #6 here). I also fixed a missing self.locked update in the msvcrt case.

I do not have access to the box leaking fds on my side ATM but I should get a lsof or /proc/xxx/fd listing soon to ensure it's only the index not properly closed. Still, your fix might only fixes a part of #70 since the lsof output shows the fds of the opened segments (might only be the reporter's program indexing when lsof is launched, thus there is already no more leak).

Unable to search the Tamil (Unicode)

Original report by Anonymous.

When I type the search text (TAMIL) "அம்மா" and do a search the query
search was happening only for a portion of the text "அம" and I am
leading in to a wrong search.

What I could see from the library was from the parser class the data
send where correct. When it call the Term class in the query.py, the
data that got assigned in the self.text has only "அம" instead of
"அம்மா". But in the calling place I can see the full text. Only when
the data is assigned at self.text I am getting this issue.

But when I do a search in English, I am getting the correct result.

How do I rectify this issue, any updated would be appreciated.

Thanks,
Veera

Deleted documents still appear in wildcard searches.

Original report by Stavros Korokithakis (Bitbucket: Stavros, GitHub: Stavros).

Deleted documents are still returned when the query is just "*". Is this by design? I couldn't find anything in the docs. How can I get rid of these results?

File descriptors being never closed

Original report by Nicolas Vandamme (Bitbucket: nvandamme, GitHub: nvandamme).

Hi,

Running in a Pylons environment, I open the indexer on every search requests on my controller. However, on each request, the indexer never close its file descriptors, whenever i call ix.close() or not, leading to an IOError: [Errno 24] Too many open files. Is there a way to close manually all index files ?

data file should cross-platform

Original report by Anonymous.

I create some data file on OS X 64bit, and use them on WinXP 32bit, then get following error.

ERROR:root:Uncaught exception GET /search?query=tcp&index_name=seed (127.0.0.1)
HTTPRequest(protocol='http', host='127.0.0.1:10000', method='GET', uri='/search?
query=tcp&index_name=seed', version='HTTP/1.1', remote_ip='127.0.0.1', remote_ip
='127.0.0.1', body='', headers={'Accept-Language': 'zh-CN,zh;q=0.8', 'Accept-Enc
oding': 'gzip,deflate,sdch', 'Host': '127.0.0.1:10000', 'Accept': 'application/x
ml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5',
'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/534.3
 (KHTML, like Gecko) Chrome/6.0.472.59 Safari/534.3 ChromePlus/1.4.2.0alpha1', '
Accept-Charset': 'GBK,utf-8;q=0.7,*;q=0.3', 'Connection': 'keep-alive', 'Referer
': 'http://127.0.0.1:10000/'})
Traceback (most recent call last):
  File "C:\Python26\lib\site-packages\tornado-1.1-py2.6.egg\tornado\web.py", lin
e 810, in _stack_context
    yield
  File "C:\Python26\lib\site-packages\tornado-1.1-py2.6.egg\tornado\stack_contex
t.py", line 77, in StackContext
    yield
  File "C:\Python26\lib\site-packages\tornado-1.1-py2.6.egg\tornado\web.py", lin
e 827, in _execute
    getattr(self, self.request.method.lower())(*args, **kwargs)
  File "E:\lee\luna\aio\core\search_manager.py", line 57, in get
    page_length = page_length)
  File "E:\lee\luna\aio\core\indexer_searcher.py", line 55, in esearch_seed
    page_length = page_length)
  File "E:\lee\luna\aio\core\_search_whoosh_backend.py", line 80, in esearch
    ix = _index.open_dir(datapath, indexname)
  File "C:\Python26\lib\site-packages\whoosh-1.0.0-py2.6.egg\whoosh\index.py", l
ine 97, in open_dir
    return storage.open_index(indexname)
  File "C:\Python26\lib\site-packages\whoosh-1.0.0-py2.6.egg\whoosh\filedb\files
tore.py", line 49, in open_index
    return FileIndex(self, schema=schema, indexname=indexname)
  File "C:\Python26\lib\site-packages\whoosh-1.0.0-py2.6.egg\whoosh\filedb\filei
ndex.py", line 218, in __init__
    _read_toc(self.storage, self._schema, self.indexname)
  File "C:\Python26\lib\site-packages\whoosh-1.0.0-py2.6.egg\whoosh\filedb\filei
ndex.py", line 126, in _read_toc
    check_size("long", _LONG_SIZE)
  File "C:\Python26\lib\site-packages\whoosh-1.0.0-py2.6.egg\whoosh\filedb\filei
ndex.py", line 123, in check_size
    raise IndexError("Index was created on different architecture: saved %s = %s
, this computer = %s" % (name, sz, target))
IndexError: Index was created on different architecture: saved long = 4, this co
mputer = 8
ERROR:root:500 GET /search?query=tcp&index_name=seed (127.0.0.1) 62.00ms

whoosh b13 query.py import problem

Original report by Marcin Kuzminski (Bitbucket: marcinkuzminski, GitHub: marcinkuzminski).

On whoosh/query.py on b13 version

#!python
from whoosh.matching import (AndMaybeMatcher, DisjunctionMaxMatcher,
                             ListMatcher, IntersectionMatcher, InverseMatcher,
                             NullMatcher, PhraseMatcher, RequireMatcher,
                             UnionMatcher, WrappingMatcher)

PhraseMatcher is imported, and commented later on in matching.py
It raises ImportError

Make query.Phrase use position boosts

Original report by Matt Chaput (Bitbucket: mchaput, GitHub: mchaput).

query.Phrase should use "position boost" instead of "position" and use the boosts to calculate the score of the phrase.

Right now the scorer's _poses() method returns a list of positions using scorer.value_as("positions"), so this would have to be changed to use a list of (position, score) tuples.

AsyncWriter fails when writer is locked

Original report by Olexiy Strashko (Bitbucket: olexiy_strashko, GitHub: Unknown).

Hi, I will describe problem and fix here:

Environment: haystack 1.1.0a and whoosh 0.3.18.

Problem: when writer is locked - AsyncWriter raise exception:

#!python
  File "/home/bogushtime/prod/third_party_apps/haystack/indexes.py", line 152, in update_object
    self.backend.update(self, [instance])

  File "/home/bogushtime/prod/third_party_apps/haystack/backends/whoosh_backend.py", line 161, in update
    writer.update_document(**doc)

  File "/home/bogushtime/prod/third_party_apps/whoosh/writing.py", line 215, in update_document
    self._record("update_document", *args, **kwargs)

  File "/home/bogushtime/prod/third_party_apps/whoosh/writing.py", line 194, in _record
    self.events.add(method, args, kwargs)

AttributeError: 'list' object has no attribute 'add'

I've investigated the code, and I found the problem:

Code from AsyncWriter:

self.events is initialized as list: self.events = []

when writer is locked, it is initialized as None

in the _record method there is a code:

#!python
self.events.add(method, args, kwargs)

That's the problem. List has no add method and it should be **tuple **added.

**
So, the fix should be this change in the _record method: **

#!python
self.events.append((method, args, kwargs))

When no locking appear - everything works great.
Thanks for whoosh!

With respect, Olexiy.

Cannot import FileStorage

Original report by Anonymous.

Tried both easy_install and trunk install. Same result:

In [2]: from whoosh.store import FileStorage

ImportError Traceback (most recent call last)

/home/martin/Downloads/ in ()

ImportError: cannot import name FileStorage

Thanks!

exception in positings.py

Original report by jdubery (Bitbucket: jdubery, GitHub: Unknown).

Hi,
just had the following exception generated in whoosh code (whoosh 0.3.18) ...

File "c:\python25\lib\site-packages\whoosh\searching.py", line 266, in search
    scored_list = sorter.order(self, query.docs(self), reverse=reverse)
  File "C:\Python25\lib\site-packages\whoosh\query.py", line 184, in docs
    return self.scorer(searcher).all_ids()
  File "C:\Python25\lib\site-packages\whoosh\query.py", line 665, in scorer
    return InverseScorer(scorer, reader.doc_count_all(), reader.is_deleted)
  File "C:\Python25\lib\site-packages\whoosh\postings.py", line 871, in __init__
    self._find_next()
  File "C:\Python25\lib\site-packages\whoosh\postings.py", line 879, in _find_next
    while self.id == self.scorer.id and not self.is_deleted(self.id):
TypeError: is_deleted() takes exactly 1 argument (2 given)

... looks like a whoosh typo
John

Spell-checking is now considerably dumber

Original report by eevee (Bitbucket: eevee, GitHub: eevee).

SpellChecker.suggest() runs a query, then sorts the results and returns the top suggestions.

But now searcher.search() defaults limit to 10, and suggest()'s query doesn't sort in any particularly useful order. So it's getting back 10 more-or-less arbitrary results, then sorting //those// in spell-check order. For words with a lot of common n-grams and big dictionaries, this gives total garbage results, and suggestions_and_scores() is seriously gimped.

Adding limit=5000 to the s.search(q) call would restore 0.3's behavior. A better solution might be to also rewrite suggest()'s sorting as a custom weighter and do the querying and sorting at the same time.

Bug in whoosh.searching.ResultsPage

Original report by Paul Davis (Bitbucket: davisp, GitHub: davisp).

The math for calculating the total number of pages is off by one if the number of results is exactly divisible by the page length.

http://bitbucket.org/mchaput/whoosh/src/tip/src/whoosh/searching.py#cl-714

Should be something like:

#!python

self.pagecount = self.total / pagelen
if self.total % pagelen == 0:
    self.pagecount += 1

Bug in SimpleParser

Original report by ollyc (Bitbucket: ollyc, GitHub: ollyc).

I'm using the SimpleParser in Whoosh 0.3.18 and I get the following exception. It seems to occur whenever there is a stop word in a query.

#!python


>>> from whoosh.fields import Schema, TEXT
>>> from whoosh.qparser import SimpleParser
>>> schema = Schema(content=TEXT())
>>> parser = SimpleParser("content", schema=schema)
>>> parser.parse(u"sound the trumpets")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/tmp/ve/lib/python2.6/site-packages/whoosh/qparser/simple.py", line 118, in parse
    opts = [make_clause(text) for text in opts]
  File "/tmp/ve/lib/python2.6/site-packages/whoosh/qparser/simple.py", line 107, in make_clause
    return self.make_basic_clause(self.fieldname, text, boost=boost)
  File "/tmp/ve/lib/python2.6/site-packages/whoosh/qparser/simple.py", line 104, in make_basic_clause
    return self.termclass(fieldname, parts[0], boost=boost)
IndexError: list index out of range

Traceback when no segments exist in index

Original report by Peter Hansen (Bitbucket: microcode, GitHub: microcode).

The following code demonstrates a failure with 1.0.0b9 (and perhaps earlier) when you create an index but don't add any documents and then try to search it. The problem seems to come from code in fileindex.SegmentSet.reader() which doesn't handle the no-segment case gracefully.

from whoosh import index
from whoosh.fields import Schema, TEXT

schema = Schema(text=TEXT())

ix = index.create_in('.', schema)
search = ix.searcher()
search.find('text', u'foo')

This gives a traceback ending with this:

  File ".....\whoosh\searching.py", line 264, in find
    qp = QueryParser(defaultfield, schema=self.ixreader.schema)
AttributeError: 'MultiReader' object has no attribute 'schema'

CompatibilityScorer init sets self.method instead of self.scoremethod

Original report by encukou (Bitbucket: encukou, GitHub: encukou).

CompatibilityScorer was introduced here: http://bitbucket.org/mchaput/whoosh/changeset/cc65bdd6d89d#chg-src/whoosh/scoring.py_newline315

In __init__(), "self.method" is set, but in score(), "self.scoremethod" is used.

Whoosh crashes on me with the obvious AttributeError: 'CompatibilityScorer' object has no attribute 'scoremethod'.

When I change the line in __init__ line to "self.scoremethod = scoremethod", Whoosh starts working again for me.

Float division error on WOLWeighting function

Original report by Marcin Kuzminski (Bitbucket: marcinkuzminski, GitHub: marcinkuzminski).

It happend to me several times that the
dfl function returned 0 when searching and caused a FloatDivision error on returning the value

The thing is i can't reproduce this bug, but it happened when i started searching during my daemon was indexing, and also when i renamed index dir, and renamed it back to original name.

I think a simple try except will fix that issue or rewrite the default on doc_field_length to always return the default.

#!python

class WOLWeighting(Weighting):
    """Abstract middleware class for weightings that can use
    "weight-over-length" (WOL) as an approximate quality rating.
    """
    
    def quality_fn(self, searcher, fieldname, text):
        dfl = searcher.doc_field_length
        def fn(m):
            return m.weight() / dfl(m.id(), fieldname, 1) #here
        return fn
    
    def block_quality_fn(self, searcher, fieldname, text):
        def fn(m):
            return m.blockinfo.maxwol
        return fn

Update:
since it happens to me more often i rewrite the doc_field_length to sth like this:

#!python

    @protected

    def doc_field_length(self, docnum, fieldname, default=0):

        if self.fieldlengths is None: return default

        fl = self.fieldlengths.get(docnum, fieldname, default=default)

        if fl == 0:return default

        return fl

There does not seem to be a way to retrieve a document's ID from the results.

Original report by Stavros Korokithakis (Bitbucket: Stavros, GitHub: Stavros).

I would like to delete a document according to search results, but there does not appear to be a way to get some sort of immutable ID so I can delete it with delete_document().

Do documents have some sort of immutable ID (perhaps a hash)? Otherwise, the simple use case of deleting documents by search results is impossible...

Write tutorial on using Weighting.final()

Original report by Matt Chaput (Bitbucket: mchaput, GitHub: mchaput).

When you finally get around to writing tutorials, include one about using the Weighting.final() method to influence document scores based on a document popularity factor stored somewhere else.

#!python

class MyWeighting(scoring.BM25F):
 def final(searcher, docnum, score):
   # Let's say your model associates a document ID with the hit count
   # for each document, and the document ID is in the "id" stored field.

   # First, get the contents of the "id" field for this document
   docid = searcher.stored_fields(docnum)["id"]

   # Look up the document's hit count in my model
   maxhits = mymodel.max_hits()
   hitcount = mymodel.get_hits(docid)

   # Multiply the computed score for this document by the popularity
   return score * (hitcount / maxhits)

test_multipool fails on build chroots

Original report by Anonymous.

Hello,

I am packaging python-whoosh for Debian. Since, build chroots usually do not mount /dev/shm as tmpfs, that results in one of the tests (test_multipool in
s/test_indexing.py) to fail, I added a patch for that test to check if Queue() can be run before attempting the test. Please find the patch attached.

Add support for "reverted index"

Original report by Matt Chaput (Bitbucket: mchaput, GitHub: mchaput).

This blog post describes an interesting "champion list"-like extension to the inverted index, called a "reverted index".

http://palblog.fxpal.com/?p=4550

This would be very straightforward to implement in Whoosh as a one-off, but it would be better to look for a way to generalize the creation of the inverted index, term vectors, and reverted index, which are essentially variations on the same storage mechanisms.

BOOLEAN field type initialization

Original report by Stéphane Démurget (Bitbucket: zzrough, GitHub: zzrough).

When I try to use a BOOLEAN field, the initializer fails when invoking self.format = Existence(), as existing at least needs an analyzer parameter.

You did not specify the BOOLEAN field in the documentation so maybe you think it's not ready to be used yet? Please close if that's the case.

Whoosh dies when including a space after a semicolon.

Original report by Stavros Korokithakis (Bitbucket: Stavros, GitHub: Stavros).

If I query, say, for "url: something", whoosh dies with an exception.

Query parser: don't always simplify 1-word phrases

Original report by jdubery (Bitbucket: jdubery, GitHub: Unknown).

The present query parsers simplify a 1-word phrase to a term in all cases.

I propose that this should not occur when a query is parsed with normalize=False. The effect of this is that the original query structure is retained, and the parsed query can be postprocessed in a way that differentiates between word and "word".

This can be done simply by removing the "if len(texts) == 1:" clause from qparser\default.py, and the corresponding clause from qparser\simple.py. That changes the behaviour when parsing with normalize=False. The behaviour with normalize=True will remain unchanged because the normalize operation will simplify any 1-word phrase to a term.

I am currently using a patched whoosh to provide this facility; I would like to remove the need for patching.

NotImplementedError: spans not implemented in <class 'whoosh.matching.ExcludeMatcher'>

Original report by Collin Anderson (Bitbucket: collinmanderson, GitHub: Unknown).

Same as #40, except ExcludeMatcher. Would a django traceback help?

Phrase search results in "List index out of range" error

Original report by tcrombez (Bitbucket: tcrombez, GitHub: tcrombez).

My index and search are working great, except for phrase search, which is vital to my project.
This is the output:

#!python

>>> s.search(u'"nous avons"')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "whoosh_indexer.py", line 77, in search
    results = searcher.search(query, sortedby=sortedby)
  File "build/bdist.macosx-10.3-i386/egg/whoosh/searching.py", line 369, in search
  File "build/bdist.macosx-10.3-i386/egg/whoosh/searching.py", line 310, in sort_query
  File "build/bdist.macosx-10.3-i386/egg/whoosh/scoring.py", line 491, in order
  File "build/bdist.macosx-10.3-i386/egg/whoosh/spans.py", line 194, in all_ids
  File "build/bdist.macosx-10.3-i386/egg/whoosh/spans.py", line 184, in next
  File "build/bdist.macosx-10.3-i386/egg/whoosh/spans.py", line 169, in _find_next
  File "build/bdist.macosx-10.3-i386/egg/whoosh/spans.py", line 343, in _get_spans
  File "build/bdist.macosx-10.3-i386/egg/whoosh/matching.py", line 404, in spans
IndexError: list index out of range

SpellChecker ignores minscore

Original report by eevee (Bitbucket: eevee, GitHub: eevee).

SpellChecker has accepted a minscore param for as long as I can remember, but the code doesn't actually do anything with it.

Patch is fairly trivial and is attached, with a test that fails against trunk.

Normalizing should take into account subquery boosts

Original report by Matt Chaput (Bitbucket: mchaput, GitHub: mchaput).

Consider this query:

#!python

And([Term(), And([Term(), Term()], boost=2.0)])

Normalizing will Merge the two Ands, but that changes the meaning of the query.

Normalize should either not merge CompoundQueries with different boosts, or should apply the subquery's boost to its members before hoisting them into the parent query.

bug in filestore?

Original report by Alexander Clausen (Bitbucket: alexc, GitHub: alexc).

Using whoosh 0.3.18 together with django-haystack, deployed on mod_wsgi. I'm getting errors when searching that look suspiciously like those when Whoosh was not thread safe:

Traceback (most recent call last):

  File "/usr/local/pythonenv/flussinfo/lib/python2.5/site-packages/django/core/handlers/base.py", line 101, in get_response
    response = callback(request, *callback_args, **callback_kwargs)

  File "/usr/local/pythonenv/flussinfo/lib/python2.5/site-packages/haystack/views.py", line 131, in search_view
    return view_class(*args, **kwargs)(request)

  File "/usr/local/pythonenv/flussinfo/lib/python2.5/site-packages/haystack/views.py", line 45, in __call__
    return self.create_response()

  File "/usr/local/pythonenv/flussinfo/lib/python2.5/site-packages/haystack/views.py", line 117, in create_response
    (paginator, page) = self.build_page()

  File "/usr/local/pythonenv/flussinfo/lib/python2.5/site-packages/haystack/views.py", line 99, in build_page
    page = paginator.page(self.request.GET.get('page', 1))

  File "/usr/local/pythonenv/flussinfo/lib/python2.5/site-packages/django/core/paginator.py", line 37, in page
    number = self.validate_number(number)

  File "/usr/local/pythonenv/flussinfo/lib/python2.5/site-packages/django/core/paginator.py", line 28, in validate_number
    if number > self.num_pages:

  File "/usr/local/pythonenv/flussinfo/lib/python2.5/site-packages/django/core/paginator.py", line 60, in _get_num_pages
    if self.count == 0 and not self.allow_empty_first_page:

  File "/usr/local/pythonenv/flussinfo/lib/python2.5/site-packages/django/core/paginator.py", line 48, in _get_count
    self._count = self.object_list.count()

  File "/usr/local/pythonenv/flussinfo/lib/python2.5/site-packages/haystack/query.py", line 377, in count
    return len(clone)

  File "/usr/local/pythonenv/flussinfo/lib/python2.5/site-packages/haystack/query.py", line 53, in __len__
    self._result_count = self.query.get_count()

  File "/usr/local/pythonenv/flussinfo/lib/python2.5/site-packages/haystack/backends/__init__.py", line 408, in get_count
    self.run()

  File "/usr/local/pythonenv/flussinfo/lib/python2.5/site-packages/haystack/backends/__init__.py", line 363, in run
    results = self.backend.search(final_query, **kwargs)

  File "/usr/local/pythonenv/flussinfo/lib/python2.5/site-packages/haystack/backends/__init__.py", line 52, in wrapper
    return func(obj, query_string, *args, **kwargs)

  File "/usr/local/pythonenv/flussinfo/lib/python2.5/site-packages/haystack/backends/whoosh_backend.py", line 298, in search
    narrow_searcher = self.index.searcher()

  File "build/bdist.linux-x86_64/egg/whoosh/index.py", line 329, in searcher
    return Searcher(self.reader(), **kwargs)

  File "build/bdist.linux-x86_64/egg/whoosh/filedb/fileindex.py", line 291, in reader
    return self.segments.reader(self.storage, self.schema)

  File "build/bdist.linux-x86_64/egg/whoosh/filedb/fileindex.py", line 422, in reader
    for segment in segments]

  File "build/bdist.linux-x86_64/egg/whoosh/filedb/filereading.py", line 73, in __init__
    self.termtable = open_terms(storage, segment)

  File "build/bdist.linux-x86_64/egg/whoosh/filedb/filereading.py", line 34, in open_terms
    termfile = storage.open_file(segment.term_filename)

  File "build/bdist.linux-x86_64/egg/whoosh/filedb/filestore.py", line 56, in open_file
    f = StructFile(open(self._fpath(name), "rb"), *args, **kwargs)

IOError: [Errno 2] No such file or directory: u'/usr/local/pythonenv/flussinfo/share/flussinfo/whoosh_index/_MAIN_7.tiz'

and yes, they seem to go away when switching to threads=1 in the WSGIDaemonProcess. Strangely the site worked fine for almost a month with threads enabled.

MultiMatcher.spans() IndexError

Original report by Collin Anderson (Bitbucket: collinmanderson, GitHub: Unknown).

Attached is a django traceback

Add logging to Whoosh

Original report by Matt Chaput (Bitbucket: mchaput, GitHub: mchaput).

Outside of tight loops, Whoosh should using logging to log useful information. By default, the log would be thrown away, but the user could redirect the log to see the information.

For example, when the posting pool writes out a run.

Mixed up paranthesis in IntersectionMatcher.spans()

Original report by Marcin Kuzminski (Bitbucket: marcinkuzminski, GitHub: marcinkuzminski).

There's a typo in spans implementation.
currently it looks like that.

#!python

    def spans(self):
        return sorted(set(self.a.spans() | set(self.b.spans())))

as You can see the first set is not closed, this is causing TypeError when using queries with two or more search terms

Regards

QueryParser should raise ParseException for unknown field name

Original report by Matt Chaput (Bitbucket: mchaput, GitHub: mchaput).

Apparently it currently raises a KeyError.

Temp files are not deleted when indexing is done

Original report by Matt Chaput (Bitbucket: mchaput, GitHub: mchaput).

Pagination is broken

Original report by Sardar Yumatov (Bitbucket: sardarnl, GitHub: sardarnl).

If no sorting is used, then search will always return results in reverse order. This that means current page will be the first one in the resultset (this is strange, highest scoring results should go first...). The ResultsPage doesn't know about reverse sorting, so it always fetches the tail of resultset, which is the first page actually. So ResultsPage is always the first page.

Steps to reproduce:

#!python
page = searcher.search_page(query, 2, pagelen = 5)

for r in page:
   print r

Try different page numbers, the result is always the first page.

Work around:

#!python
results = searcher.search(query, limit=page * pagelen)
page = results[0:pagelen]

Reduce index size

Original report by Matt Chaput (Bitbucket: mchaput, GitHub: mchaput).

Compress ids, weights, values in posting block using zlib <<changeset 715eb93e2663>>.
Encode term keys using a short code for the field name, saving the fieldname-code map with the term file.
Encode term info using bytes for tf and df and int for offset when possible. When tf == df == 1 just store the offset.
Encode stored fields as a list instead of a dict, storing the fieldname-pos map with the file.

Syntax error on 1.0.0b9 query.py in python 2.5

Original report by jdubery (Bitbucket: jdubery, GitHub: Unknown).

Using python 2.5 (the most recent on my machine) I get a syntax error on line 1218 of revision 1.0.0b9 query.py: return self.__class__(*self.subqueries, boost=self.boost). This can be cleared by changing to the following code: return self.__class__(*self.subqueries, **{"boost":self.boost}).

whoosh-community / whoosh Goto Github PK

whoosh's People

Contributors

Stargazers

Watchers

Forkers

whoosh's Issues

self.events is initialized as list: self.events = []

when writer is locked, it is initialized as None

in the _record method there is a code:

In [2]: from whoosh.store import FileStorage

ImportError: cannot import name FileStorage

The present query parsers simplify a 1-word phrase to a term in all cases.

I propose that this should not occur when a query is parsed with normalize=False. The effect of this is that the original query structure is retained, and the parsed query can be postprocessed in a way that differentiates between word and "word".

Recommend Projects

Recommend Topics

Recommend Org