I experience that retrieving and storing large Blobs using this driver is quite slow,

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

In <a class="commit-link" data-hovercard-type="commit" data-hovercard-url="https://git

Improve performance of Blob-operations about pgjdbc-ng HOT 13 CLOSED

impossibl commented on May 24, 2024

Improve performance of Blob-operations

from pgjdbc-ng.

Comments (13)

kdubb commented on May 24, 2024

You can control the buffer size yourself when working with blobs. Through Blob.setBytes/getBytes you provide your own buffer. The same goes for InputStream/OutputStream, whatever size buffers you pass to read/write get written straight to the server. In both cases loread/lowrite are used, which I am pretty sure are the same functions used in psql (lo_import/lo_export is a special case in the libpq client that ends up using loread/lowrite).

Try sending really large buffers, see how that affects performance and report back.

from pgjdbc-ng.

andreak commented on May 24, 2024

I'm experiencing that it takes much longer to write data to a Blob than it takes uploading the same data to a web-form using POST and wonder if there is something obvious I can do to optimize.
I will test different sizes of buffers. Thanks

from pgjdbc-ng.

andreak commented on May 24, 2024

For reference: Copying a 410MB file went from 76 000 ms to 51 000 ms by using a 1024_1024-size byte[] instead of default (1024_4). I'm using org.apache.commons.io.IOUtils.copyLarge to copy the stream so it hasn't got anything to do with the byte-buffer settings in PGBlob-related classes.
Still, 51s. is quite long for writing 410MB I think, but it's maybe out of the driver-writers hands to fix?

from pgjdbc-ng.

brettwooldridge commented on May 24, 2024

Pardon me butting in. I'm not fully up to speed on the pgsql-ng codebase, but it looks to me like each write() on the Blob is calling connection.executeForFirstResultValue() which calls executeForFirstResult() which calls executeForFirstResultBatch() which calls BasicContext.queryBatch(), which ultimately prepares the "select lowrite($1,$2)" call, binds the parameters, and then executes.

Does it really need to be prepared on every write? I would image this is causing a large amount of overhead that can be avoided if the already created PrepareCommand object is reused on every write (of the same blob) -- simply binding new parameters before executing.

Maybe caching PrepareCommand objects based on the queryTxt in BasicContext.queryResults(). I remember contributing a similar optimization to the jTDS driver for MS SQL about 9 years ago, where they were continually preparing statements to get the metadata for binding. Caching these objects on a short-term basis provided huge speed-ups (like several hundred percent).

from pgjdbc-ng.

kdubb commented on May 24, 2024

@brettwooldridge This is good insight and these queries can definitely be cached. Although, he tested using a 1MB buffer which means there is only 410 prepared calls. Prepares of simple queries takes milliseconds so this shouldn't add more than a second or two at most (unless there is something very weird with these queries).

@andreak To save me from debugging a non-issue... can you try it with a 50 or 100MB buffer? Obviously only for testing purposes but this should narrow down the problem area.

from pgjdbc-ng.

andreak commented on May 24, 2024

Hm, with 100MB buffer the time spent is up at 73s, going back down to 51s when setting to 1MB.

from pgjdbc-ng.

kdubb commented on May 24, 2024

I did some tests on my own...

Using localhost for client and server, my psql times were very comparable with my driver times. I then ran a "select lo_import('...')" query; which runs the import entirely inside the server (file must be accessible by the server). I get the exact nearly the exact same times as psql and driver.

I encourage you to try this yourself and confirm it but it seems this is an issue with how fast postgres updates blobs.

Note: I forgot to mention... I was using a 1MB buffer for driver test. Also, my driver times were a bit slower at first (only a couple of seconds). I did switch it to using prepared queries to get the times to match psql.

from pgjdbc-ng.

brettwooldridge commented on May 24, 2024

As a hack, I put caching in the PrepareCommand class, to avoid the overhead of the additional prepare calls, and the difference was basically nil.

(Not that this invalidates the idea of a cache in PrepareCommand in general, some extremely complex SQL can actually take minutes for a database to prepare. But it does rule out the overhead for simple cases like "select lowrite($1,$2)")

from pgjdbc-ng.

kdubb commented on May 24, 2024

In 6e71e22, I changed LargeObject to use prepared queries. This was facilitated by a core change that allows any code to prepare a "utility" query and use that later.

As @brettwooldridge mentioned I don't believe this will have any real net effect on loading large objects but it was low hanging fruit so I added it.

from pgjdbc-ng.

kdubb commented on May 24, 2024

So... I did more testing...

I am now positive the problem you are experiencing is with Postgres itself. Blobs get split up and stored in separate rows of the pg_largeobject table and use a block size of 2kb (read about it here http://www.postgresql.org/docs/9.3/static/catalog-pg-largeobject.html). This results in 230400 rows being inserted into the database for a 450MB blob; this is the issue!

I then tried inserting a 450MB file into a single "bytea" field of a row. It took less than half the time! 22 seconds down from around 47 seconds.

It looks like a daunting task but you should try to increase the "LOBLKSIZE" of your Postgres installation. Even then it might not speed it up that much. Postgres's scheme optimizes for random read/write at the expense of initial storage speed.

Toast fields can be ~4GB I believe but you have to read/write them as one atomic item. The driver streams send data; so updating from a file is not a memory concern. Reading the data back will be an issue. The way the server sends data will require it to be in memory all at once.

from pgjdbc-ng.

brettwooldridge commented on May 24, 2024

Wow, surprising and somewhat disappointing. More disappointing is that LOBLKSIZE can only be changed by recompiling PostgreSQL, which makes deploying a product with simple 'yum' dependency on impossible. And equally disappointing is that LOBLKSIZE is not independent, but is calculated from BLCKSIZE which affects all tables in the system. So if you want a target LOBLKSIZE of 16KB, you're going to be looking at a simple block size for all tables of 64Kb.

from pgjdbc-ng.

kdubb commented on May 24, 2024

I agree somewhat but really for Blobs of this size save a URL and stick the file on the file system, S3 or someplace similar. Better yet... get creative and use a FDW to store it in one of those places.

Either way... I am closing this for now as I think the PGBlob functions as fast as it can at the moment.

from pgjdbc-ng.

andreak commented on May 24, 2024

Nice analysis, thanks. I need my Blobs to participate in transactions, and to stream both in and out (so no bytea), so storing on external files is no option. Hopefully PG will improve over time in this area.

from pgjdbc-ng.

Improve performance of Blob-operations about pgjdbc-ng HOT 13 CLOSED

Comments (13)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent