marten-de-vries / chairdb Goto Github PK

View Code? Open in Web Editor NEW

6.0 6.0 3.0 1.06 MB

A small CouchDB-compatible database with sync support

License: Apache License 2.0

Python 98.01% Shell 1.00% CSS 1.00%

chairdb's People

Contributors

Stargazers

Watchers

Forkers

blauret datadavev kouchdb

chairdb's Issues

Implement Mango queries

Write a compatibility layer on top of map/reduce.

sql.py parallelize attachment reading in _create_doc_ptr

Doing it serially (as is currently the case) might block, as only a single attachment is guaranteed to be readable at any time. (That's to make an HTTPDatabase implementation possible). See for an example how the in-memory database handles it.

Support non-inline attachments in the HTTP API

The first step would be to extend doc_to_couchdb_json to handle this. Using inline attachments makes sense when the total attachment size for a document is not too big (<10kb? <100kb? Might be worth to look up what CouchDB does), but otherwise it should send all attachments using multipart with follow=true. My suggestion would be to make _single_response work first, and then to make multi_response call _single_response internally multiple times and wrap the output in a multipart response.

Bonus points if it doesn't require multipart and can also return a JSON array like CouchDB when the requested content type is application/json.

Implement AttachmentSelector.names for HTTPDatabase.read()

By firing of a request to /doc/attachment. Preferably in parallel to the 'main' retrieval.

Currently there's just assert att_names is None

Implement efficient range queries

There's basic map/reduce support, but it implements aggregation using a table scan. That doesn't scale.

It would be nice to optimize it. The 'overlay indexes' described by Pennino, Pizzonia and Papi (2019) seem like a nice fit. See for example code: https://github.com/kdbtree/kdbtree/ .

A 'simple' skiplist might be easier, but it's not as elegant, as it requires more roundtrips to the backend database.

Diego Pennino, Maurizio Pizzonia, Alessio Papi. Overlay Indexes: Efficiently Supporting Aggregate Range Queries and Authenticated Data Structures in Off-the-Shelf Databases. IEEE Access. 7:175642-175670. 2019.

Follow-up issue of #8.

HTTP api attachment range requests

This would be nice to support. As a first step, it could simply not send data that the client doesn't request. But it would be nice to extend that to also not retrieving such data from disk. That might require a bit more work, though.

HTTP 2/3: PUT instead of _bulk_docs?

Might be interesting to investigate whether the performance is similar. If so, then replacing _bulk_docs calls in HTTPDatabase with PUTs everywhere would probably simplify things. And also, implementing _bulk_get sometime would not be necessary.

Better batching & using _global_changes

For the 'local' databases, rows are processed when they arrive as write() input, so batching is irrelevant. But for HTTP APIs, the single monster-call to _bulk_docs needs to be replaced with multiple ones during a big replication to prevent the request size from growing too much. Also, we don't want bulk_docs to wait for _changes() when it's hanging just because continuous=true.

Solving the former issue is relatively simple: just break up the input stream based on some size criterium. The latter issue is more interesting. A nice approach might be to remove the continuous parameter from the changes() function, and instead provide the user with an event that notifies a database change. A little less user-friendly for users that are not the replicator, but it could easily be wrapped in a higher-level API and would add flexibility.

Additionally, it would allow the HTTPDatabase to listen to the _global_changes endpoint instead of the _changes endpoint (when available). When lots of HTTPDatabases are opened, they could share a single changes listener. It should scale much better. (It's the main trick of spiegel.)

Finally, it might be good to check whether batching is required for _revs_diff too. And to handle timeouts for _changes()/_global_changes()

Implement _purge

Surprisingly easy to do with the current rev tree implementation and probably useful. It's probably the most-asked open feature for PouchDB, so I think it's worth doing.

rev & latest http API

The current implementation does what's most convenient, but long term, this should mirror whatever CouchDB decides to do:

apache/couchdb#3362

Better error handling

Make HTTPDatabase retry requests that failed. Make sure the replicator bubbles up any persisting errors, but make sure it handles forbidden/unauthorized etc. errors correctly.

Make sure SQLDatabase, InMemoryDatabase and HTTPRemote return errors in the same style.

Convert errors to the right JSON errors in chairdb.server.

Finally, count_docs in the replicator currently has a weird check required for skimdb. That's worth investigating further.

Implement views

sqlite: context manager for DB creation?

Think about when to set the database schema. create() similar to remote()? Or just whenever a method is called? Or legalize the current decision to misuse context managers?