Git Product home page Git Product logo

Comments (12)

ShaneHarvey avatar ShaneHarvey commented on June 14, 2024

Hi @CarstVaartjes, thanks for your interest in using python-bsonjs! I have two questions about your use-case before I give you some advice on using it.

So normally if I put something into gridfs, i use bson.json_util.dumps to do a conversion:
dict -> json string -> bson binary.

  1. So you're using GridFS to store arbitrarily large BSON documents to workaround the 16MB document size limit?

  2. If you want to encode a Python dict to bson, I don't think you want to go from dict -> JSON -> BSON. PyMongo lets you encode a dict, or any mapping type, directly to bson using bson.BSON.encode:

    >>> from bson import BSON
    >>> raw_bson = BSON.encode({'my': 'dict'})
    >>> raw_bson
    b'\x12\x00\x00\x00\x02my\x00\x05\x00\x00\x00dict\x00\x00'

from python-bsonjs.

CarstVaartjes avatar CarstVaartjes commented on June 14, 2024

Thanks for your answer!

1: Yes, basically we have large, complex nested dicts that run over 16MB, we separate the bulk part of the dict into gridfs to work around and keep header information in the normal collection to do lookups/filters etc.
2A. For the collection part: we just use basic pymongo (inserting the dicts). It seems that we could use python-bsonjs there which could speed up things, but next to insert_ones and find_ones, we also have other update_one, update_many and delete_ones. And i'm not sure how I can use that with python-bsonjs
2B: For the gridfs part: we use the bson.json_util loads/dumps but now are switching to ujson with manual conversions of objectids to strings to make sure we do not run into issues; this is saving a lot of time there (still not super-fast, but quite a bit faster then before), i'm not sure of whether bson conversion actually is a significant performance thing here or not (as it's a straight string that gridfs saves)

Thanks you so much for your answer!

from python-bsonjs.

ShaneHarvey avatar ShaneHarvey commented on June 14, 2024

2B: For the gridfs part: we use the bson.json_util loads/dumps but now are switching to ujson with manual conversions of objectids to strings to make sure we do not run into issues; this is saving a lot of time there (still not super-fast, but quite a bit faster then before), i'm not sure of whether bson conversion actually is a significant performance thing here or not (as it's a straight string that gridfs saves)

So it sounds like your data is represented in memory as a Python dict and you converting that into JSON strings to store into GridFS. Is this the rough process?:

from bson import json_util

# Load JSON document from GridFS
json_str = gridfs_lookup_doc()
large_dict = json_util.loads(json_str)
# Update large dict...

# Store JSON document into GridFS
json_str = json_util.dumps(large_dict)
gridfs_insert_doc(json_str)

from python-bsonjs.

CarstVaartjes avatar CarstVaartjes commented on June 14, 2024

Hi, I just see that I never answered; in the end we used a conversion to a json string with ujson and a manual replacement of objectids (we know where to find them); that was really fast in the end.
However, I just also saw this in pymongo 3.6: http://api.mongodb.com/python/current/api/pymongo/collection.html?highlight=find_raw#pymongo.collection.Collection.find_raw_batches

I can use the raw batches to escape the overhead of the pymongo cursor (with no kidding, around 50% of the time spend in pymongo goes to the cursor itself), but that also piqued my interest in this project again. I see you stopped updating it, but is it still alive? If not, do you know of alternatives?

from python-bsonjs.

behackett avatar behackett commented on June 14, 2024

It's still alive. We just haven't had time to work on it compared to other priorities. We at least want to update it to the latest version of libbson, to support the final version of the extended JSON spec:

https://github.com/mongodb/specifications/blob/master/source/extended-json.rst

The raw batches methods were added for use in the bson-numpy project, to avoid needing to decode BSON to Python dict before building an array.

from python-bsonjs.

CarstVaartjes avatar CarstVaartjes commented on June 14, 2024

thanks! it would be really interesting, next to the bson translation the cursor itself is a major bottleneck in pymongo (which is a bit weird, as i would expect a generator to be faster than a list operation)

from python-bsonjs.

ShaneHarvey avatar ShaneHarvey commented on June 14, 2024

I can use the raw batches to escape the overhead of the pymongo cursor (with no kidding, around 50% of the time spend in pymongo goes to the cursor itself)

next to the bson translation the cursor itself is a major bottleneck in pymongo

Can you exand on this a bit more? Are you saying the the Cursor class is spending a lot of time doing something other than network I/O and BSON decoding? That would be surprising.

from python-bsonjs.

CarstVaartjes avatar CarstVaartjes commented on June 14, 2024

Hi,

{edited it with a nicer example}

it's not as bad as it used to be with older pymongo versions, but still significant. Python 2.7, Mongodb 3.4 (non-sharded, non-replicated) and pymongo 3.6.0. My example code:

from bson import decode_all

def normal_example(db_table, qc=None, qf=None, skip=0, limit=0):
    if not qf:
        qf = {}
    cursor = db_table.find(filter=qf, projection=qc, skip=skip, limit=limit, batch_size=999999999)
    output_list = list(cursor)

def raw_example(db_table, qf=None, qc=None, skip=0, limit=0):
    if not qf:
        qf = {}
    cursor = db_table.find_raw_batches(filter=qf, projection=qc, skip=skip, limit=limit, batch_size=999999999)
    output_list = []
    while True:
        try:
            output_list.extend([x for x in decode_all(cursor.next())])
        except StopIteration:
            break

print(db_table.count())
%timeit normal_example(db_table, qc=qc, skip=0, limit=20000)
%timeit raw_example(db_table, qc=qc, skip=0, limit=20000)

which gives for a small table with large documents (between 1mb-10mb) and getting a single key/value from deep down nested dicts (the qc arg) ->

66284
1 loop, best of 3: 3.66 s per loop
1 loop, best of 3: 3.1 s per loop

Using prun on the 'normal' loop:

100280 function calls (100278 primitive calls) in 6.336 seconds

Ordered by: internal time

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   16    6.166    0.385    6.166    0.385 {method 'recv' of '_socket.socket' objects}
    1    0.093    0.093    0.093    0.093 {bson._cbson.decode_all}
 20001    0.033    0.000    6.318    0.000 cursor.py:1172(next)
 20000    0.012    0.000    0.012    0.000 database.py:402(_fix_outgoing)
    1    0.010    0.010    6.336    6.336 <string>:1(<module>)
    1    0.008    0.008    6.326    6.326 <ipython-input-13-d31c1c9e48e4>:3(normal_example)
 20004    0.005    0.000    0.005    0.000 collection.py:305(database)
 20026    0.004    0.000    0.004    0.000 {len}
 20000    0.003    0.000    0.003    0.000 {method 'popleft' of 'collections.deque' objects}
    1    0.001    0.001    6.261    6.261 cursor.py:897(__send_message)
    2    0.000    0.000    6.166    3.083 network.py:166(_receive_data_on_socket)
    1    0.000    0.000    0.000    0.000 message.py:953(unpack)
    2    0.000    0.000    6.261    3.131 cursor.py:1059(_refresh)
    1    0.000    0.000    0.000    0.000 cursor.py:112(__init__)

In what we saw before the cursor.py gives significant overhead compared to the raw_batch and decode_all with a list (the third entry in the prun!). This especially happens when we read larger tables with smaller documents (the bson decoding becomes less of a hassle, but the relative performance impact of the cursor can become high). We have seen examples between 0% (no difference) and 50% slower. It's only relevant for larger collections though

from python-bsonjs.

behackett avatar behackett commented on June 14, 2024

Hi. You might want to try again with PyMongo 3.7.0 (just released last week). We made some changes to the networking code that may result in a large performance increase for you.

from python-bsonjs.

CarstVaartjes avatar CarstVaartjes commented on June 14, 2024

thanks @behackett !! is this about the find_raw_batches or the general find() with the cursor bottleneck?

from python-bsonjs.

ShaneHarvey avatar ShaneHarvey commented on June 14, 2024

The issue Bernie mentioned is https://jira.mongodb.org/browse/PYTHON-1513. The fix improves PyMongo's performance of reading large messages off of sockets, including find and find_raw_batches. The fix was primarily for Python 3 but Python 2 performance should be better as well. Would you be able to run the benchmark again and post the results comparing PyMongo 3.6.1 and 3.7.0?

from python-bsonjs.

ShaneHarvey avatar ShaneHarvey commented on June 14, 2024

There hasn't been any recent activity so I'm closing this. Thanks for reaching out! Please feel free to reopen this if we've missed something.

from python-bsonjs.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.