quickwit-oss / tantivy-py Goto Github PK

View Code? Open in Web Editor NEW

244.0 20.0 62.0 505 KB

Python bindings for Tantivy

License: MIT License

Python 34.29% Rust 65.53% Shell 0.18%

tantivy-py's Introduction

tantivy-py

Python bindings for Tantivy the full-text search engine library written in Rust.

Installation

The bindings can be installed using from pypi using pip:

pip install tantivy

If no binary wheel is present for your operating system the bindings will be build from source, this means that Rust needs to be installed before building can succeed.

Documentation

Please see the documentation for more information.

tantivy-py's People

Contributors

Stargazers

Watchers

tantivy-py's Issues

Document.from_dict doesn't has type info.

Test with below code:

class TestUnsignedField:
    def test_query_from_unsigned_field(self):
        schema = SchemaBuilder().add_unsigned_field("order", indexed=True).build()

        index = Index(schema)

        writer = index.writer()

        doc = Document.from_dict({"order": 1})

        writer.add_document(doc)
        writer.commit()
        index.reload()

        query = index.parse_query("order:1", ["order"])
        result = index.searcher().search(query, 1)
        assert len(result.hits) == 1

The extract_value in document.rs will try to interpret it as i64 at first while it should be interpreted as u64. We may need to pass schema into from_dict.

I also met this issue when I try to support json field where it interpreted the field as String at first.

[Doc Update] Tutorial about Building and Executing Queries with Objects

Currently we have a lot of implemented methods regarding the issue #20 and I think we should have a proper documentation for parameters accepted and usecases.

yep agree 💯

should we add a new section under Building and Executing Queries

Your suggestions lead me in the direction of:

we rename the existing Building and Executing Queries to Building and Executing Queries with the Query Parser
add a new tutorial called Building and Executing Queries with the Query Objects, where we can add tutorial content using the new objects.

What do you think?

We can merge this PR first anyway, and do such docs in a different PR.

Originally posted by @cjrh in #250 (comment)

Support ordering search results by a field.

Tantivy supports ordering the search results from a TopCollector by a field.
This used to be done with the TopDocsByField collector and a patch was written to partially utilize this matrix-org/tantivy@85b8d0c.

The latest Tantivy release dropped TopDocsByfield and uses the generic Collector trait instead.

This sadly makes it impossible to modify the mentioned patch since it uses the Any type and down-casting and down-casting trait objects is impossible:

 error: the `downcast_ref` method cannot be invoked on a trait object

There are tricks to work around of this but support for this would need to land in Tantivy.

Question: how can you escape double quotes in search queries?

I have a query containing double quotes, e.g. a measurement in inches - '36"'. Right now I'm using tantivy.Index.query_parser(query) to parse the query. I have two questions:

What is the default behavior of the analyzer in this case for double quotes? (can probably answer myself with a bit more digging)
I haven't been able to find a way to parse this without hitting SyntaxError, is this a bug?

---> 21     query = self._index.parse_query(q.query)  # Search all fields
     22     search_results = self._tantivy_results_to_docs(
     23         self._searcher.search(query, q.limit).hits
     24     )
     25     return [
     26         SearchResultFromSystem(
     27             result=SearchResult(query=q, result=self._tantivy_doc_to_dict(doc)),
   (...)
     32         for idx, (score, doc) in enumerate(search_results)
     33     ]

ValueError: Syntax Error: fawkes 36" blue vanity

The approaches I've tried, all of which result in the same error above:
Escape the double quote: 36\"
Double or triple escape the quote: 36\\\" or 36\\"
Encode: '36"'.encode('utf-8')
Unicode from here: "36\\u{FF02}"
Raw python string: r'36"'
Raw python string w/ escapes from above: r'36\"'...

What should I do to get Tantivy to interpret this not as a field search, but as a literal double quote?

Release

Milestone to track issues that should be resolved before our initial release.

Failing to open test_index dir when using tantivy/master IndexReader

@karlicoss told me he tried to install tantivy-py on his machine last night and got this error, so I decided to investigate tonight and found a repro.

petr_tik@merluza:~/Coding/tantivy-py$ RUST_BACKTRACE=1 python3 -m pytest --capture=no
================================================= test session starts =================================================
platform linux -- Python 3.6.8, pytest-5.0.1, py-1.7.0, pluggy-0.12.0
rootdir: /home/petr_tik/Coding/tantivy-py
collected 15 items

tests/tantivy_test.py .......thread '<unnamed>' panicked at 'attempt to subtract with overflow', /home/petr_tik/.cargo/git/checkouts/tantivy-cd4bdf03c5df22c3/4c39417/src/directory/footer.rs:73:55
stack backtrace:
   0: std::sys::unix::backtrace::tracing::imp::unwind_backtrace
             at src/libstd/sys/unix/backtrace/tracing/gcc_s.rs:39
   1: std::sys_common::backtrace::_print
             at src/libstd/sys_common/backtrace.rs:71
   2: std::panicking::default_hook::{{closure}}
             at src/libstd/sys_common/backtrace.rs:59
             at src/libstd/panicking.rs:197
   3: std::panicking::default_hook
             at src/libstd/panicking.rs:211
   4: std::panicking::rust_panic_with_hook
             at src/libstd/panicking.rs:474
   5: std::panicking::continue_panic_fmt
             at src/libstd/panicking.rs:381
   6: rust_begin_unwind
             at src/libstd/panicking.rs:308
   7: core::panicking::panic_fmt
             at src/libcore/panicking.rs:85
   8: core::panicking::panic
             at src/libcore/panicking.rs:49
   9: tantivy::directory::footer::Footer::from_bytes
             at /home/petr_tik/.cargo/git/checkouts/tantivy-cd4bdf03c5df22c3/4c39417/src/directory/footer.rs:73
  10: tantivy::directory::footer::Footer::extract_footer
             at /home/petr_tik/.cargo/git/checkouts/tantivy-cd4bdf03c5df22c3/4c39417/src/directory/footer.rs:85
  11: <tantivy::directory::managed_directory::ManagedDirectory as tantivy::directory::directory::Directory>::open_read
             at /home/petr_tik/.cargo/git/checkouts/tantivy-cd4bdf03c5df22c3/4c39417/src/directory/managed_directory.rs:250
  12: tantivy::core::segment::Segment::open_read
             at /home/petr_tik/.cargo/git/checkouts/tantivy-cd4bdf03c5df22c3/4c39417/src/core/segment.rs:80
  13: tantivy::core::segment_reader::SegmentReader::open
             at /home/petr_tik/.cargo/git/checkouts/tantivy-cd4bdf03c5df22c3/4c39417/src/core/segment_reader.rs:149
  14: core::ops::function::FnMut::call_mut
             at /rustc/50a0defd5a93523067ef239936cc2e0755220904/src/libcore/ops/function.rs:148
  15: core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &mut F>::call_once
             at /rustc/50a0defd5a93523067ef239936cc2e0755220904/src/libcore/ops/function.rs:279
  16: core::option::Option<T>::map
             at /rustc/50a0defd5a93523067ef239936cc2e0755220904/src/libcore/option.rs:416
  17: <core::iter::adapters::Map<I,F> as core::iter::traits::iterator::Iterator>::next
             at /rustc/50a0defd5a93523067ef239936cc2e0755220904/src/libcore/iter/adapters/mod.rs:575
  18: <<core::result::Result<V,E> as core::iter::traits::collect::FromIterator<core::result::Result<A,E>>>::from_iter::Adapter<Iter,E> as core::iter::traits::iterator::Iterator>::next
             at /rustc/50a0defd5a93523067ef239936cc2e0755220904/src/libcore/result.rs:1250
  19: <&mut I as core::iter::traits::iterator::Iterator>::next
             at /rustc/50a0defd5a93523067ef239936cc2e0755220904/src/libcore/iter/traits/iterator.rs:2607
  20: <alloc::vec::Vec<T> as alloc::vec::SpecExtend<T,I>>::from_iter
             at /rustc/50a0defd5a93523067ef239936cc2e0755220904/src/liballoc/vec.rs:1819
  21: <alloc::vec::Vec<T> as core::iter::traits::collect::FromIterator<T>>::from_iter
             at /rustc/50a0defd5a93523067ef239936cc2e0755220904/src/liballoc/vec.rs:1731
  22: <core::result::Result<V,E> as core::iter::traits::collect::FromIterator<core::result::Result<A,E>>>::from_iter
             at /rustc/50a0defd5a93523067ef239936cc2e0755220904/src/libcore/result.rs:1267
  23: core::iter::traits::iterator::Iterator::collect
             at /rustc/50a0defd5a93523067ef239936cc2e0755220904/src/libcore/iter/traits/iterator.rs:1465
  24: tantivy::reader::InnerIndexReader::reload
             at /home/petr_tik/.cargo/git/checkouts/tantivy-cd4bdf03c5df22c3/4c39417/src/reader/mod.rs:128
  25: tantivy::reader::IndexReaderBuilder::try_into
             at /home/petr_tik/.cargo/git/checkouts/tantivy-cd4bdf03c5df22c3/4c39417/src/reader/mod.rs:71
  26: tantivy::core::index::Index::reader
             at /home/petr_tik/.cargo/git/checkouts/tantivy-cd4bdf03c5df22c3/4c39417/src/core/index.rs:201
  27: tantivy::index::Index::new
             at src/index.rs:185
  28: tantivy::index::__init15304180475000063864::__init15304180475000063864::__wrap::{{closure}}
             at src/index.rs:155
  29: tantivy::index::__init15304180475000063864::__init15304180475000063864::__wrap
             at src/index.rs:155
  30: <unknown>
  31: _PyObject_FastCallKeywords
  32: <unknown>
  33: _PyEval_EvalFrameDefault
  34: _PyFunction_FastCallDict
  35: <unknown>
  36: PyObject_Call
  37: _PyEval_EvalFrameDefault
  38: <unknown>
  39: <unknown>
  40: PyObject_Call
  41: _PyEval_EvalFrameDefault
  42: <unknown>
  43: <unknown>
  44: <unknown>
  45: _PyEval_EvalFrameDefault
  46: <unknown>
  47: <unknown>
  48: <unknown>
  49: _PyEval_EvalFrameDefault
  50: <unknown>
  51: <unknown>
  52: _PyEval_EvalFrameDefault
  53: <unknown>
  54: _PyFunction_FastCallDict
  55: <unknown>
  56: <unknown>
  57: _PyObject_FastCallKeywords
  58: <unknown>
  59: _PyEval_EvalFrameDefault
  60: <unknown>
  61: <unknown>
  62: _PyEval_EvalFrameDefault
  63: <unknown>
  64: <unknown>
  65: PyObject_Call
  66: _PyEval_EvalFrameDefault
  67: <unknown>
  68: <unknown>
  69: <unknown>
  70: _PyEval_EvalFrameDefault
  71: <unknown>
  72: <unknown>
  73: <unknown>
  74: _PyEval_EvalFrameDefault
  75: <unknown>
  76: <unknown>
  77: _PyEval_EvalFrameDefault
  78: <unknown>
  79: _PyFunction_FastCallDict
  80: <unknown>
  81: <unknown>
  82: PyObject_Call
  83: _PyEval_EvalFrameDefault
  84: <unknown>
  85: <unknown>
  86: <unknown>
  87: _PyEval_EvalFrameDefault
  88: <unknown>
  89: <unknown>
  90: <unknown>
  91: _PyEval_EvalFrameDefault
  92: <unknown>
  93: <unknown>
  94: PyObject_Call
  95: _PyEval_EvalFrameDefault
  96: <unknown>
  97: <unknown>
  98: <unknown>
  99: _PyEval_EvalFrameDefault
fatal runtime error: failed to initiate panic, error 5
Fatal Python error: Aborted

Current thread 0x00007fea645c6740 (most recent call first):
  File "/home/petr_tik/Coding/tantivy-py/tests/tantivy_test.py", line 148 in test_opens_from_dir

line 148 points to:

        index = Index(schema(), PATH_TO_INDEX, reuse=True)

which relies on IndexReader opening the directory. Our test_index was built with tantivy 0.10, before the introduction of directory footer. Since then tantivy-py switched to tracking the master branch of tantivy.

This shows the value of having integration tests with a serialised index.

I suggested keeping the same version numbers with core tantivy for ease of administration, but it must have been skipped since.

Support for all Tantivy query methods

Will there be support for other query method such as PhraseQuery, FuzzyTermQuery, RangeQuery?

Field Boost

I can see a contributor has already done work on exposing the field boost features. Is there a plan to add that to a release?

Add API reference to documentation site

Now we have .pyi file (#167), we can generate API docs by pdoc.

For example, the following commands generate static HTML files in apidoc dir if you have tantivy module in your python path.

pip install pdoc
pdoc tantivy -o ./apidocs

Generated HTML looks very neat because PyO3 exposes comments in .rs files as docstrings of tantivy module, then pdoc collects all of them throughout the stubs and type hints!

Perhaps we can publish the API doc to https://tantivy-py.readthedocs.io/en/latest/.
I'm not familiar with the CI pipeline to publish the docs site. Is there any suggestions about that?

Bump to PyO3 0.19

While Tantivy was bumped to 0.20 recently, PyO3 is still at 0.18 whereas 0.19 is current. I think it would be good to have both major dependencies at the current stable versions before making the next release. The migration guide at https://pyo3.rs/v0.19.1/migration.html?highlight=migration#from-018-to-019 contains some hints on what to look out for when making the change.

Date Range query produces "ValueError: Syntax Error".

Hi,
While trying to filter using date ranges, I get a Syntax Error. I have gone through all the queryParser docs in tantivy to see if I had a formatting issue. The following code demonstrates the problem.
Simply copy and paste the python code to reproduce:

from datetime import datetime
import os
import tantivy

schema_builder = tantivy.SchemaBuilder()
schema_builder.add_text_field("title", stored=True)
schema_builder.add_date_field("date_published", stored=True, indexed=True)
schema_builder.add_text_field("body", stored=True)
schema = schema_builder.build()

# Creating our index (in current working directory)
index = tantivy.Index(schema, path=os.getcwd() + '/index')

# Adding all the data.
writer = index.writer()
date = datetime(2022, 8, 2)
writer.add_document(tantivy.Document(
    date_published = date,
    title="The Old Man and the Sea",
    body="He was an old man who fished alone in a skiff in \
    the Gulf Stream and he had gone eighty-four days \
    now without taking a fish."
))
writer.commit()

index.reload()
searcher = index.searcher()

query = index.parse_query('date_published:[2002-10-02T15:00:00Z TO 2023-10-02T18:00:00Z]', ['date_published'])
print(query)
result = searcher.search(query, count=True, limit=5)

This produces the following error:

To confirm that I didn't have a formatting issue for query string, I recreated the code in rust, and it worked fine.

use tantivy::schema::*;
use tantivy::collector::TopDocs;
use tantivy::doc;
use tantivy::Index;
use tantivy::query::QueryParser;
use tantivy::Score;
use tantivy::{DocAddress, DateTime};
use std::env::current_dir;

fn main() {
    let mut schema_builder = Schema::builder();
    let title = schema_builder.add_text_field("title", TEXT | STORED);
    let body = schema_builder.add_text_field("body", TEXT);
    let num_options: NumericOptions = NumericOptions::default();
    let date = schema_builder.add_date_field("date_created", num_options | STORED | INDEXED);
    let schema = schema_builder.build();
    let mut index_path = current_dir().unwrap();
    index_path.push("index");
    let index = Index::create_in_dir(&index_path, schema.clone()).unwrap();

    let mut index_writer = index.writer(100_000_000).unwrap();

    let mut doc = doc!(
        title => "The Old Man and the Sea",
        body => "He was an old man who fished alone in a skiff in \
                the Gulf Stream and he had gone eighty-four days \
                now without taking a fish.",
    );

    doc.add_date(date, DateTime::from_unix_timestamp(1659423709));
    index_writer.add_document(doc).unwrap();

    index_writer.commit().unwrap();

    let reader = index.reader().unwrap();
    let searcher = reader.searcher();
    let query_parser = QueryParser::for_index(&index, vec![title, body]);

    let query = query_parser.parse_query("date_created:[2002-10-02T15:00:00Z TO 2023-10-02T18:00:00Z]").unwrap();
    let top_docs: Vec<(Score, DocAddress)> =
    searcher.search(&query, &TopDocs::with_limit(10)).unwrap();

    for (_score, doc_address) in top_docs {
        let retrieved_doc = searcher.doc(doc_address).unwrap();
        println!("{}", schema.to_json(&retrieved_doc));
    }
}

This has the output as :

Not sure if this is a bug, or some error from my side,
Would really appreciate some help here.

Stay on par with tantivy crates.io version

The setup file lists version as "0.9.1"

https://github.com/tantivy-search/tantivy-py/blob/master/setup.py#L11

while the core tantivy is on "0.10.1"
https://github.com/tantivy-search/tantivy/blob/master/Cargo.toml#L3

I think it's best if we stick to keeping version numbers the same to make API usage and debugging consistent.

Fix up the docs based on recent README fixes

Based on this PR that changed the README, docs should also be fixed up: #203

replace setuptools-rust with pyo3-pack

According to the pyo3-pack README

This project is meant as a zero configuration replacement for setuptools-rust and milksnake. It supports building wheels for python 3.5+ on windows, linux and mac, can upload them to pypi and has basic pypy support.

Replacing setuptools should make testing, deploying and installing easier

[feature request] Adding boolean query method

The existing boolean query feature could be done from the index.parse_query, as long as we type the correct characters like +, - for must and must_not respectively.

However, there could be cases that users would like to create their inner query dynamically, or for the sake readability that they would like a container for their other query types like FuzzyTermQuery and PhraseQuery.

Currently the rust tantivy package allows creating the boolean query from the Struct tantivy::query::BooleanQuery. Will tantivy-py also have the boolean_query staticmethod for the Query class?

API simplification

The current API sticks very closely to tantivy API.

Some of the choice of the current API are strongly specific Rust and do not necessarily make sense in python. Also, some choice of the current are actually not necessarily very smart even in the context of rust.

I suspect we could simplify our API a bit.
I explored differnet changes in #3, but I would like us to discuss them somewhere.

Fields

Getting Field as objects obtained either at the creation of the Schema or by querying the Schema can be cumbersome. I suggest we let the user identify them as string all the way along.

Note this bit may change in rust's tantivy as well.

Following the readme.

builder = tantivy.SchemaBuilder()
title = builder.add_text_field("title", stored=True)
body = builder.add_text_field("body")
schema = builder.build()

We could remove saving the title, and body and even introduce a fluent interface. (note I do not know how to return self with PyO3. It might be impossible.)

schema = tantivy.SchemaBuilder()
   .add_text_field("title", stored=True)
   .add_text_field("body")
   .build()

When referring to fields then, users could always directly pass in strings instead of the field objects.

Documents

Conceptually, documents are just maps of field to a list of values.
We could remove the notion of document entirely and use dictionaries instead.

It might make onboarding much straightforward, and leave us with an arguably more appealing README.

That being said -as stated above-, tantivy allows more than one value per field.
Within tantivy json serializer, deserializer, the choice was made to allow for the single value format as an input, so that both

{ "title": "The old man and the sea"}

and

{ "title": ["The old man and the sea"]}

are valid documents.

On deserialization, we always deserialize as {"title": ["The old man and the sea"]}
This could lead to confusion within users, so there could be value in keeping a structured Document object.

We could define a __getitem__ , __len__ to make the user 's life easier while leaving him aware that the field can be multivalued.

Reader

We could have the reader be part of the Index.
The searcher would then be acquired directly from the index.

In a similar spirit, the query parsing could be directly methods of Index.
Also we could have a helper to perform a vanilla top-K search with or without count of documents directly from the index.

The README sample could become...

reader = index.searcher()
searcher = reader.searcher()

query = index.parse_query("sea whale", default_fields=[title, body])


search_results = searcher.search(query, nhits=10, count=True)
print(search_result.count)
(_score, doc_address) = search_results.hits()[0]
searched_doc = searcher.doc(doc_address)
assert searched_doc["title"] == ["The Old Man and the Sea"]

Outdated PyPI package

Hi,

The package on PyPI (0.13.2) seems to be outdated (created in 2020).
I was wondering if you are planning on creating a new release that works with tantivy 0.17.

Thanks

Different versions for nox and maturin build

I have found that the python interpreter versions for maturin and nox are in different versions:

# Makefile

build:
	maturin build --interpreter python3.7 python3.8 python3.9 python3.10 python3.11

#noxfile.py

@nox.session(python=["3.8", "3.9", "3.10", "3.11", "3.12", "3.13"])
def test(session):
    session.install("-rrequirements-dev.txt")
    session.install("-e", ".", "--no-build-isolation")
    session.run("pytest", *session.posargs)

Will this lead to unexpected behavior?

tantivy-py throws syntax error if query has `:`

Steps to replicate:

import tantivy

# Declaring our schema.
schema_builder = tantivy.SchemaBuilder()
schema_builder.add_text_field("title", stored=True)
schema_builder.add_text_field("body", stored=True)
schema_builder.add_integer_field("doc_id",stored=True)
schema = schema_builder.build()

# Creating our index (in memory)
index = tantivy.Index(schema)
writer = index.writer()
writer.add_document(tantivy.Document(
	doc_id=1,
    title=["The Old Man and the Sea"],
    body=["""He was an old man who fished alone in a skiff in the Gulf Stream and he had gone eighty-four days now without taking a fish."""],
))
# ... and committing
writer.commit()

# Reload the index to ensure it points to the last commit.
index.reload()
searcher = index.searcher()

query = index.parse_query("fish days:", ["title", "body"])
(best_score, best_doc_address) = searcher.search(query, 3).hits[0]
best_doc = searcher.doc(best_doc_address)
assert best_doc["title"] == ["The Old Man and the Sea"]
print(best_doc)

Output:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File <command-3495552511963062>:25
     22 index.reload()
     23 searcher = index.searcher()
---> 25 query = index.parse_query("fish days:", ["title", "body"])
     26 (best_score, best_doc_address) = searcher.search(query, 3).hits[0]
     27 best_doc = searcher.doc(best_doc_address)

Support lenient parser in bindings

Now that quickwit-oss/tantivy#2129 has landed, it'd be nice to start thinking about lenient parsing support in the bindings. Obviously, a new version of the crate has to be released but starting to think about this.

I've started working on this, but there are a few points I'd like to iron out how we'd expose the QueryParserError enum. It'd be nice to have types for the errors, but QueryParserError is currently a Rust enum.

An (strawman) idea I had was to have separate classes for each error type. Then, optionally, organize them in a submodule underneath tantivy, i.e. tantivy.query_parser_error or something. Then, the interface could look something like this from the Python side:

@dataclass
class ParseQueryLenientResult:
    query: Query
    errors: List[Union[query_parser_error.SyntaxError, query_parser_error.UnsupportedQuery, ...]]

def parse_query_lenient(query: str, default_field_names: Optional[List[str]]) -> ParseQueryLenientResult:
    ...

cc @adamreichold @cjrh if y'all have thoughts

switch to GitHub actions for ci

travis-ci has been dead to oss since June 15th, 2021 which means ci hasn't been working here for a half year.

use GitHub actions.

addressed in #38

Get a list of indexed documents.

Hi! Is there a plan to add the feature to list all documents that are indexed?

It would be great to have to implement incremental indexing of new documents.
We can check if the document is already present in the index before passing the contents for indexing.

Publish a package for Python 3.10

Would it be possible to publish a build of the package for Python 3.10?

Currently, the latest release (0.13.2) stops at Python 3.9: https://pypi.org/project/tantivy/#files

Support IndexWriter.delete_term()

Currently, there is no way to delete a document from the index except by calling delete_all_documents() and re-creating the entire index from scratch with the deleted document omitted.

Some documentation issues

I've encountered several documentation issues.

In tantivy.py, the comment documentation for the Snippet class is incorrect.

Additionally, after implementing a piece of code for adding indexes based on the example, I faced a "ValueError: An error occurred in a thread: 'An index writer was killed.. A worker thread encountered an error (io::Error most likely) or panicked.'" after calling index writer.commit() tens of thousands of times. However, the issue was resolved after following the example in the tests directory, where writer.wait_merging_threads() is called after multiple commits. I am not sure if this is an issue on my end, but if it's not, then the documentation should be updated to clarify this process. Thank you for your work!

Support pickling of certain types

I've found the need for this feature at times for things like Document.

Maybe using something like bincode and doing:

pub fn __setstate__(&mut self, state: &PyBytes) -> PyResult<()> {
    *self = bincode::deserialize(state.as_bytes()).map_err(...);
    Ok(())
}

pub fn __getstate__<'py>(&self, py: Python<'py>) -> PyResult<&'py PyBytes> {
    Ok(PyBytes::new(py, bincode::serialize(&self).map_err(...))
}

Thoughts?

Panic when there is CJK character in the document

When trying the demo code with CJK characters,

writer.add_document(tantivy.Document(
    title=["老人与海"],
    body=[""" ..."""],
))

the thread panic at

thread '' panicked at 'assertion failed: self.is_char_boundary(new_len)', /rustc/04488afe34512aa4c33566eb16d8c912a3ae04f9\src\libcore\macros\mod.rs:10:9

Python package

Could we push to pypi a python package with this bindings? @poljar

Build fails on Windows

When installing the pip package on Windows, the following error is raised:

C:\Users\username>pip3 --no-cache install tantivy
Defaulting to user installation because normal site-packages is not writeable
Collecting tantivy
  Downloading tantivy-0.13.2.tar.gz (25 kB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
    Preparing wheel metadata ... error
    ERROR: Command errored out with exit status 1:
     command: 'c:\program files\python39\python.exe' 'c:\program files\python39\lib\site-packages\pip\_vendor\pep517\in_process\_in_process.py' prepare_metadata_for_build_wheel 'C:\Users\username\AppData\Local\Temp\tmp010ay0hm'
         cwd: C:\Users\username\AppData\Local\Temp\pip-install-bp6ntpre\tantivy_d322f29400f14ea9996d158ee0138aa3
    Complete output (6 lines):
    ðŸ’¥ maturin failed
      Caused by: Failed to parse Cargo.toml at Cargo.toml
      Caused by: invalid type: sequence, expected a map for key `package.metadata.maturin.project-url` at line 23 column 1
    Error running maturin: Command '['maturin', 'pep517', 'write-dist-info', '--metadata-directory', 'C:\\Users\\username\\AppData\\Local\\Temp\\pip-modern-metadata-zoq_clff', '--interpreter', 'c:\\program files\\python39\\python.exe']' returned non-zero exit status 1.
    Checking for Rust toolchain....
    Running `maturin pep517 write-dist-info --metadata-directory C:\Users\username\AppData\Local\Temp\pip-modern-metadata-zoq_clff --interpreter c:\program files\python39\python.exe`
    ----------------------------------------
WARNING: Discarding https://files.pythonhosted.org/packages/db/9d/84e1888f90680da4e0279dc1df4113287e60d2dd58901bc29fcbe623eed2/tantivy-0.13.2.tar.gz#sha256=5cd63f5664e2df431a43a4c439d6af7d3fd7c8a760dd7303486333d94f6a20c8 (from https://pypi.org/simple/tantivy/) (requires-python:>=3.7). Command errored out with exit status 1: 'c:\program files\python39\python.exe' 'c:\program files\python39\lib\site-packages\pip\_vendor\pep517\in_process\_in_process.py' prepare_metadata_for_build_wheel 'C:\Users\username\AppData\Local\Temp\tmp010ay0hm' Check the logs for full command output.
ERROR: Could not find a version that satisfies the requirement tantivy (from versions: 0.11.0rc7, 0.11.0rc8, 0.12.0rc1, 0.12.0rc2, 0.13.1rc1, 0.13.2)
ERROR: No matching distribution found for tantivy

The core of the issue seems to be this line:

Caused by: invalid type: sequence, expected a map for key `package.metadata.maturin.project-url` at line 23 column 1

Maturin, which is throwing the error, has had many new releases since the last release of this package, so my hunch is that the format it expects has changed, and that the .toml file needs updating to be compatible with newer versions. But I don't know enough about Rust or Cargo to really know where to go from there...

Accessing QueryParser

For fuzzy search or boosting fields, I need to access QueryParser.

Is this possible with Tantivy Py? This article seems to thinks so, but it doesn't work (ImportError, I also checked and see that QueryParse isn't available in the top level anyway).

If not, how can I do fuzzy searching?

Term Query is not tokenized (?)

I'm testing tantivy-py, which I'm finding pretty great. However, I bumped into what seems to be an issue with the Python package: it seems that term queries are not tokenized when using the searcher.search(query, ..) method, so I can't really use the en_stem tokenizer (since it's not exposed for me to tokenize the query, only the indexing of documents).

I'm testing tavinty-py with the Simple Wikipedia Example Set from Cohere and here's what I see with a few sample queries:

Australia monarchy --> no good hits unless I change it to Australia monarch
Titanic sink --> no good hits unless I change it to Titan sink

Is this a "feature" or a "bug"? I don't mind tokenizing the query myself before calling the search method, but tokenizers are not exposed in the Python bindings.

Any suggestions?

How to use SnippetGenerator and Snippets?

Would it be possible to provide some examples for reference? I would be very grateful.

Support phrase query as a direct API

Would it be possible to do something like index.phrase_query(<>) so that we don't have to manually validate the quoting etc to make it work?

Supporting tokenizer register

Currently, the tokenizer is hard-coded to default, it would be better to include some configurable tokenizer for Chinese (tantivy-jieba and cang-jie), Japanese (lindera and tantivy-tokenizer-tiny-segmente) and Korean (lindera + lindera-ko-dic-builder)

https://github.com/tantivy-search/tantivy-py/blob/4ecf7119ea2fc5b3660f38d91a37dfb9e71ece7d/src/schemabuilder.rs#L85

How to use slog and prefix operator in tantivy-py

Hi all,

Thanks first of all for the amazing repo!

I have been searching for additional valid query formats in the documentation (https://docs.rs/tantivy/latest/tantivy/query/struct.QueryParser.html) and I was wondering how I can use the slop operator (e.g., "big wolf"~1 will return documents containing the phrase "big bad wolf") and the prefix operator (e.g., "big bad wo"* will match "big bad wolf") in the tantivy-py implementation.

Are these operators already implemented? If yes, could you provide an example for both cases?

Thanks so much!

Index on Filesystem

The only doc available is the README.md if I'm not wrong. Quotting from there

# Creating our index (in memory, but filesystem is available too)
index = tantivy.Index(schema)

But I don't see any docs for storing the index in filesystem. Any help? 🙏

Please support PreTokenized String

How can I add pre tokenized string using the library? It seems doesn't support add such data type.

Adding boost/weight to fields

Hi,

Is there a way to add custom weight score to a specific field in tantivy-py. For example if our schema has both title and description fields. And we want the item that matches title to prioritize in the list, we can add weight to title

In whoosh, we can achieve this via field_boost:

title = whoosh.fields.TEXT(stored=True, field_boost=5.0)
description = whoosh.fields.TEXT(stored=True)

Tantivy natively supports boosting:
https://github.com/quickwit-oss/tantivy/blob/db1836691ef9b4f963070bfd9ef13c6d44d2a074/src/query/query_parser/query_parser.rs#L164
https://docs.rs/tantivy/0.16.0/tantivy/query/struct.QueryParser.html#method.set_field_boost

How to close a index?

i want to delete it's files.

compilation fails due to missing tantivy crate feature

error: enum variants on type aliases are experimental
--> /home/maciej/.cargo/git/checkouts/tantivy-cd4bdf03c5df22c3/cde9b78/src/directory/footer.rs:104:13
|
104 | Self::V0(crc) => {
| ^^^^^^^^^^^^^
|
= help: add #![feature(type_alias_enum_variants)] to the crate attributes to enable

I am not sure if I should report it with tantivy
Following the advice works.

add_bytes

I am trying to add a binary file to the index but having few issues

    schema_builder = tantivy.SchemaBuilder()
    schema_builder.add_bytes_field("embedding")
    schema = schema_builder.build()

    tdoc = tantivy.Document(id=idx,
                                vector=np.frombuffer(emb, dtype=np.uint8).tolist(),
                            )

error:

 894/3555 [00:03<00:11, 235.92it/s]thread 'thrd-tantivy-index11' panicked at 'Bytes field contained non-Bytes Value!. Field Field(3) = FieldValue { field: Field(3), value: I64(162) }', /Users/travis/.cargo/registry/src/github.com-1ecc6299db9ec823/tantivy-0.13.2/src/fastfield/bytes/writer.rs:57:21

Am I missing something?

[feature request] Expose means of configuring/providing a custom Object Store extension

Hi there, I'm a user of lancedb, which leverages tantivy-py for full text search indices (see https://lancedb.github.io/lancedb/fts). A current shortcoming of the lancedb FTS support is that it's only supported for local file paths (something like S3 is not supported).

The lance maintainers have authored an Object Store extension , but as I understand it, there's no means of specifying/providing this extension within tantivy-py. Would love if this could be supported!

Slow writer performance with the current default heap size

@adamreichold Circling back to this discussion.

While upgrading another application to use current head tantivy-py, I am finding that the default heap limit of 3000000 seems to cause very frequent commits while adding documents. It just doesn't seem large enough. I can improve performance by increasing the heap size, but I'm thinking the current default is going cause surprisingly poor performance for a lot of people once they upgrade.

What are your thoughts on this? Is there a more typical "good" value to use as a default? I am not familiar with the tantivy work between 0.19.2 and 0.20.1 that led to this apparent change in behaviour.

Consider building abi3 wheels

To reduce build effort and improve portability, PyO3 supports building in "abi3" mode where Python's stable ABI/API is used which makes the resulting binary wheels forward-compatible beginning at a specified Python version, e.g. wheels using the abi3-py38 feature are compatible with Python 3.8 and later.

This can cost some performance if there is a lot of back and forth between Rust and Python code, but whether that cost is significant for this extension is hard to say without measuring it.

Can't Install

I failed to install while pip installing on a venv on

os: MacOS 12.3
rustc: 1.59.0 (9d1b2106e 2022-02-23)
python: 3.9.10
pip: 21.3.1

$ pip install tantivy
Collecting tantivy
  Downloading tantivy-0.13.2.tar.gz (25 kB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... error
  ERROR: Command errored out with exit status 1:
   command: ///app/py/venv/bin/python3.9 ///app/py/venv/lib/python3.9/site-packages/pip/_vendor/pep517/in_process/_in_process.py prepare_metadata_for_build_wheel /var/folders/kp/wqrkz9nj2b72v_j3nn_l1zvr0000gn/T/tmpil48bddf
       cwd: /private/var/folders/kp/wqrkz9nj2b72v_j3nn_l1zvr0000gn/T/pip-install-5kyggg9_/tantivy_6f911278379c4eebbc4dd392080549dd
  Complete output (6 lines):
  💥 maturin failed
    Caused by: Failed to parse Cargo.toml at Cargo.toml
    Caused by: invalid type: sequence, expected a map for key `package.metadata.maturin.project-url`
  Error running maturin: Command '['maturin', 'pep517', 'write-dist-info', '--metadata-directory', '/private/var/folders/kp/wqrkz9nj2b72v_j3nn_l1zvr0000gn/T/pip-modern-metadata-kep_vd8f', '--interpreter', '///app/py/venv/bin/python3.9']' returned non-zero exit status 1.
  Checking for Rust toolchain....
  Running `maturin pep517 write-dist-info --metadata-directory /private/var/folders/kp/wqrkz9nj2b72v_j3nn_l1zvr0000gn/T/pip-modern-metadata-kep_vd8f --interpreter ///app/py/venv/bin/python3.9`
  ----------------------------------------
WARNING: Discarding https://files.pythonhosted.org/packages/db/9d/84e1888f90680da4e0279dc1df4113287e60d2dd58901bc29fcbe623eed2/tantivy-0.13.2.tar.gz#sha256=5cd63f5664e2df431a43a4c439d6af7d3fd7c8a760dd7303486333d94f6a20c8 (from https://pypi.org/simple/tantivy/) (requires-python:>=3.7). Command errored out with exit status 1: ///app/py/venv/bin/python3.9 ///app/py/venv/lib/python3.9/site-packages/pip/_vendor/pep517/in_process/_in_process.py prepare_metadata_for_build_wheel /var/folders/kp/wqrkz9nj2b72v_j3nn_l1zvr0000gn/T/tmpil48bddf Check the logs for full command output.
ERROR: Could not find a version that satisfies the requirement tantivy (from versions: 0.11.0rc7, 0.11.0rc8, 0.12.0rc1, 0.12.0rc2, 0.13.1rc1, 0.13.2)
ERROR: No matching distribution found for tantivy
WARNING: You are using pip version 21.3.1; however, version 22.0.4 is available.
You should consider upgrading via the '///app/py/venv/bin/python3.9 -m pip install --upgrade pip' command.

Support Python stub files (.pyi)

Currently tantivy-py does not include .pyi file, so developers have to investigate the source code to figure out what classes/methods are supported in this package.

PyO3 documentation says:

Yet, for a better user experience, Python libraries should provide typing hints and documentation for all public entities, so that IDEs can show them during development and type analyzing tools such as mypy can use them to properly verify the code.

Currently the best solution for the problem is to manually maintain *.pyi files and ship them along with the package.

It would be great to have tantivy.pyi and py.typed in the Python source dir so that developers get the benefits of powerful IDE suggestions and validate their code by linters such as mypy.
Does this make sense?

Bonus: We could generate API documentation using pdoc by providing .pyi with type stubs.
PyO3/pyo3#2330

Does someone want to maintain tantivy-py?

Me first and Quickwit in general, has been a very bad steward for the tantivy-py project. Does someone with better python skills want to take over maintaining the project?

Add a CI pipeline

Happy to take this. Please assign me to the ticket

Add a travis script
add a badge to readme to establish trust with potential users

Adding document based scoring

Hi,

Is there a way to score some documents higher than others? Something like weight in whoosh?

For example:
In an ecommerce site, we can have a popularity field for each items. And when user searches for something, the popular documents appear on top.

This is already implemented in tantivy core repo:
https://docs.rs/tantivy/0.10.3/tantivy/collector/struct.TopDocs.html#example

I understand that bringing the callback from rust to python may make it slower. Therefore the python interface should be pretty simple. E.g. searcher.search(query, weight_by='popularity')

Can not find merge policy.

As in the Rust documentation of tantivy, it tells about the merge policy which I can not find in Python. The Issue is that when we keep committing, there comes a time when number of files created by tantivy as search documents becomes too large in python which I want to restrict. Is there any way in python to merge the documents when they reach a certain number ? TIA.