moonlitesolutions / solrclient Goto Github PK

SolrClient is a simple python library for Solr; built in python3 with support for latest features of Solr.

License: Apache License 2.0

Python 100.00%

solrclient's Introduction

SolrClient

SolrClient 0.2.2

SolrClient is a simple python library for Solr; built in python3 with support for latest features of Solr 5 and 6. Development is heavily focused on indexing as well as parsing various query responses and returning them in native python data structures. Several helper classes will be built to automate querying and management of Solr clusters.

Enhancements in version 0.2.0:

Basic parsing for json.facet output
Better support for grouped results (SolrResponse)
Other minor enhancements to SolrClient
Fixed SolrClient.index method

Planned enhancements in version 0.3.0:

Solr node query routing (by @ddorian)
Streaming Expressions Support

Requirements

python 3.3+
requests library (http://docs.python-requests.org/en/latest/)
Solr
kazoo for working with zookeeper (optional)

Features

Flexible and simple query mechanism
Response Object to easily extract data from Solr Response
Cursor Mark support
Indexing (raw JSON, JSON Files, gzipped JSON)
Specify multiple hosts/IPs for SolrCloud for redundancy
Basic Managed Schema field management
IndexManager for storing indexing documents off-line and batch indexing them

Getting Started

Installation:

pip install SolrClient

Basic usage:

>>> from SolrClient import SolrClient
>>> solr = SolrClient('http://localhost:8983/solr')
>>> res = solr.query('SolrClient_unittest',{
            'q':'product_name:Lorem',
            'facet':True,
            'facet.field':'facet_test',
    })
>>> res.get_results_count()
4
>>> res.get_facets()
{'facet_test': {'ipsum': 0, 'sit': 0, 'dolor': 2, 'amet,': 1, 'Lorem': 1}}
>>> res.get_facet_keys_as_list('facet_test')
['ipsum', 'sit', 'dolor', 'amet,', 'Lorem']
>>> res.docs
[{'product_name_exact': 'orci. Morbi ipsum
..... all the docs ....
 'consectetur Mauris dolor Lorem adipiscing'}]

See, easy.... you just need to know the Solr query syntax.

Roadmap

Better test coverage
Solr Streaming

Contributing

I've realized that that there isn't really a well maintained Solr Python library I liked so I put this together. Contributions (code, tests, documentation) are definitely welcome; if you have a question about development please open up an issue on github page. If you have a pull request, please make sure to add tests and that all of them pass before submitting. See tests README for testing resources.

Documentation: http://solrclient.readthedocs.org/en/latest/

solrclient's People

Contributors

Stargazers

Watchers

solrclient's Issues

start , rows in Solr.query

how can i achieve this 'start' in query ? i've already tried:
res = solr.query('library',{ 'q':'*:*', 'start':9, 'rows':15, })
but still after i used res.get_result_count() it returned 15 instead 6.

Create a helper class for re-indexing collections

Test performance is very(extremly) slow

I have an i3 laptop with 8GB ram & SSD, and it was running very slow (taking up all ram & probably ). Any idea how we can:

make it faster
or
separate, so don't run tests for everything ?
or both ?

Update document Support SolrClient

I have reviewed SolrClient and found that I does not have an option to update some part of document in Solr. It would be greate if this feature could be added in it.

Command to run single test

example:
python run_tests.py -py 3.5 -solr 6.3.0 -test test_client.ClientTestIndexing.test_down_solr_exception

To put it somewhere in docs ? Maybe at SolrVagrant?

Make IndexQ add method thread safe

multiple facets

Hi,

How can i query for multiple facets? It looks like SolrClient doesn't know how to do that but it should be possible

Regards,
Hans

Using IndexQ with finalize=True can "overwhelm" todo directory

Been using SolrClient in my latest project and it is by far the cleanest SOLR api for python out there, so kudos for that! Hoping it will keep being developed and supported for the foreseeable future.

One thing I see that could be improved is that indexing multi-million set of documents using IndexQ and the finalize=True argument will lead to the todo/ directory being overwhelmed (impossible to 'ls' normally nor 'rm -rf' due to size of directory listing metadata). I was thinking about creating subfolders under todo/ and put data there instead, capping it at certain number (0.5M files, for ex.).

I understand that it is bad to proceed like this when you understand how it all works, but for first time users and non-experts this could actually prevent this unfortunate case.

Add more tests for Solr Response

Grouped Queries
Facets

Please update the PyPi package to reflect the latest changes

Hi,

My application code depends on the latest commit#a75912f73730d3aaaf64ca5684ab035c2666eedc where integer conversion step is removed. But, i see the PyPi package of SolrClient doesn't have that issue fixed. Could you please update the PyPi package with the latest version.

Thanks,
Abhay

Need ~nice way to specify dynamic 'params' that works on every request/function

Currently all params (query string) must go on the "params={}" keyword argument.
But this sometimes works and sometimes doesn't. Example where it doesn't is the "query()". You can't set currently a "route='ay'" in there, unless you put it in the query, which sucks.

So, how to do that ?
My idea would be to have stuff like route,min_rf, etc in kwargs, that then can be included in the "params" right before making a request. Makes sense ?

And if there are kwargs-keys that are needed on other parts (ex: a router may need a "prefer_leader", it can pop it from the kwargs) ?

Makes sense ? Or maybe a nicer way ?

Multiple faceted fields?

Hello I am using SolrClient in my project.
I want to make faceted fields multiple by using res.get_facet_keys_as_list, I mean I wanna see more fields.
However, it is not correlated with Solr syntax.
In Solr, it is used like this:
"fl":"genres,movie_title,title_year"
Is there any possibility to make it ?
Thanks

Add more unit tests to reindexer

fix travic-ci , seems broken?

read title

Add HTTPS support for requests

Add Methods for more Collections API Actions

cursor Mark query in Client

Add method for easy cursor mark paging through collection.

Automatic Paging

Add a method in main SolrClient that easily pages through the data in Solr.

Deleting document by IDs that have a space results in an error

{
"responseHeader":{
"status":400,
"QTime":2},
"error":{
"msg":"undefined field text",
"code":400}}

Support Timeout on requests to solr

Hello community, first thank you for creating this wonderful SolrClient application, lately I am experiencing that sometimes when I perform querys or high volume commits it stays in indeterminate standby mode, in which I think at least from the client I should be able to handle this problem. wait time. I see at the level of connection to solr it does it with requests and I think that right there you can add the timeout parameter. That is, you could send something like: data_solr = conectionSolr.query (collection, query, timeout = 80) and inside the _send function it would be something like: res = self.session.request (method, url, params = params, data = data , headers = headers, verify = False, timeout = param_timeOut). How feasible would it be to integrate this function? For my part, I still feel like a novice to make a change like this ..., thank you very much for everything, I will be waiting for your contributions

PD: In the screenshot I send the line where I think you can add the timeOut option

Tests failing

Some errors in the log. I stopped it after some time.

solrtestslog.txt

What's needed for new release

Can you please make a list so I can create pull requests and we get a new release on pypi ?

Change on index() ?

I need the .index() command to NOT return True/False but return the full json.

Should I create a new index_raw(?) or just edit the .index(if it's not in any release yet) ? Since this changes the return type.

Reason, is I need to know the "rf" parameter when I set "min_rf".

Tag version 0.2.1

SolrClient is at version 0.2.1 on PyPi, but the GitHub repository is stuck at version 0.2.0. It would be nice to have a tagged version here that corresponded with what's available on PyPi. Thanks!

Index calls itself recursively

https://github.com/moonlitesolutions/SolrClient/blob/master/SolrClient/solrclient.py#L130-L142.

My guess is that the code is intended to be:

data = json.dumps(docs)
return self.index_json(collection, data, params, **kwargs)

Add Methods for Collection Alias Management

get_field_values_as_list for empty fields

Hello,
I have recently worked with your package, and noticed that it has a somewhat strange behaviour when encountering fields that are only present in some of the documents.
In my example, I had let's say 2500 documents with an unique "id" field, but only 2000 or so would have another field called "links".
Iterating over both the "id" list and "links" list is not possible, since they have different lengths, since get_field_values_as_list("links") would return only a list of length 2000, and I cannot discern which pages return empty.
For this, I simply altered the function to:
return [doc[field] if field in doc else [] for doc in self.docs]
which solved the problem in my case. It might not always be needed, but possibly this could be included as get_field_values_as_full_list, or something like this.

Best regards,
Dennis

_route_ to correct shard ?

Example, I was thinking of also adding:

method getting cluster state
keep track of hash-->shard-->host linking
when route is specified, route directly to correct node(with failover to replicas)
3.1. route directly to leader for create,update,delete
prefer_leader=True will always contact leader of shard first and then replicas.
prefer_leader=False will contact leader last
prefer_leader=None will random.choice(replicas)
when multiple route are added "tenant1,tenant2,tenant3", pick random replica to contact

Makes sense ?

The whole point is to contact minimum amount of nodes necessary, reducing network hops/traffic, lowering cpu.

"Invalid syntax" on Python 3.7.0

SolrClient 0.2.1 fails to run on Python 3.7.0:

$ python test.py 
Traceback (most recent call last):
  File "test.py", line 1, in <module>
    from SolrClient import SolrClient
  File "/home/aorth/.local/share/virtualenvs/solr-pipenv-nV7koRrN/lib/python3.7/site-packages/SolrClient/__init__.py", line 1, in <module>
    from .solrclient import SolrClient
  File "/home/aorth/.local/share/virtualenvs/solr-pipenv-nV7koRrN/lib/python3.7/site-packages/SolrClient/solrclient.py", line 10, in <module>
    from .zk import ZK
  File "/home/aorth/.local/share/virtualenvs/solr-pipenv-nV7koRrN/lib/python3.7/site-packages/SolrClient/zk.py", line 9, in <module>
    from kazoo.client import KazooClient
  File "/home/aorth/.local/share/virtualenvs/solr-pipenv-nV7koRrN/lib/python3.7/site-packages/kazoo/client.py", line 62, in <module>
    from kazoo.recipe.partitioner import SetPartitioner
  File "/home/aorth/.local/share/virtualenvs/solr-pipenv-nV7koRrN/lib/python3.7/site-packages/kazoo/recipe/partitioner.py", line 193
    self._child_watching(self._allocate_transition, async=True)
                                                        ^
SyntaxError: invalid syntax

Virtual environment was created with pipenv.

facets and facet ranges don't preserve order

In the SolrResponse class, get_facets and get_facets_ranges return ordinary dict objects, so ordered results as returned by Solr get lost. Should just need to be returned as an OrderedDict.

This looks easy enough to fix, I'll see if I can create a pull request for it.

Facet query with "facet.sort": "count" not returning sorted facets

Hello,
I'm trying to submit this query to Solr Core with SolrClient, but the expected result object doesn't match the query since keys are not sorted by facet count.
Here the query object passed to SolrClient:

onto_suggest = solr.query('merged', {
    "q": "*:*",
    "rows": 0,
    "facet": True,
    "facet.field": "onto_suggest",
    "facet.sort": "count",
    "facet.limit": -1,
    "facet.mincount": 2
})

I've also tried to use "facet.count" to sort, same result.
Here an excerpt of the result

{
  "onto_suggest": {
    "Arthrobacter siderocapsulatus": 433,
    "Enterobacter aglomerans": 39,
    "Genetic endocrine tumor": 32,
    "ETOP": 15,
    "lymphoid lineage restricted progenitor cell": 47,
    "Escherichia coli O157:H7 EDL933": 127,
    "deferoxaminum": 2,
    "Bacillus francki": 472,
    "srf": 9,
    "Malignant Peripheral Nerve Sheath Tumors": 9,
    "Pancreas Cancer": 122,