Git Product home page Git Product logo

Comments (3)

sbrandtb avatar sbrandtb commented on May 24, 2024 1

I stumbled across this issue several times with PyAthena and I believe it should not have been closed.

No offense, but I suspect that there is a fundamental misunderstanding here:

If you do not acquire a lock, you should not share the cursor object in multithreaded.

However, @mateush states:

[...] despite the fact that they have separate Cursor instances

The thing is that the lock of synchronized() is shared globally across all cursor instances, so only one execute() or any other decorated method can be called at a time in a single Python process, independent from cursor instance. Here is an almost minimal example using your decorator:

import functools
import threading
import time
from multiprocessing.pool import ThreadPool
from time import sleep


def synchronized(wrapped):
    _lock = threading.RLock()

    @functools.wraps(wrapped)
    def _wrapper(*args, **kwargs):
        with _lock:
            return wrapped(*args, **kwargs)

    # Added for demonstration purposes. Does not change runtime behavior
    _wrapper.__lock = _lock

    return _wrapper


class A:
    @synchronized
    def do_stuff(self, n):
        print(time.time(), 'starting to do stuff', n)
        sleep(1)
        print(time.time(), 'finished doing stuff', n)

    @synchronized
    def do_other_stuff(self):
        pass


def do_a_stuff(n):
    A().do_stuff(n)


def demo_concurrency():
    p = ThreadPool(processes=10)
    p.map(do_a_stuff, range(10))


def demo_references():
    a1 = A()
    a2 = A()
    print('id a1 do_stuff', id(a1.do_stuff.__lock))
    print('id a2 do_stuff', id(a2.do_stuff.__lock))
    print('id a1 do_other_stuff', id(a1.do_other_stuff.__lock))
    print('id a2 do_other_stuff', id(a2.do_other_stuff.__lock))


def run_all():
    demo_references()
    print('---')
    demo_concurrency()


if __name__ == '__main__':
    run_all()

Output:

id a1 do_stuff 140487096646080
id a2 do_stuff 140487096646080
id a1 do_other_stuff 140487096535360
id a2 do_other_stuff 140487096535360
---
1668505773.485564 starting to do stuff 0
1668505774.4865115 finished doing stuff 0
1668505774.486622 starting to do stuff 1
1668505775.4867237 finished doing stuff 1
1668505775.4868734 starting to do stuff 2
1668505776.4879422 finished doing stuff 2
1668505776.4880366 starting to do stuff 3
1668505777.4890823 finished doing stuff 3
1668505777.4891577 starting to do stuff 4
1668505778.4903514 finished doing stuff 4
1668505778.4906769 starting to do stuff 5
1668505779.4918444 finished doing stuff 5
1668505779.4919217 starting to do stuff 6
1668505780.4932797 finished doing stuff 6
1668505780.493365 starting to do stuff 7
1668505781.49372 finished doing stuff 7
1668505781.4939387 starting to do stuff 8
1668505782.4944823 finished doing stuff 8
1668505782.4945972 starting to do stuff 9
1668505783.4958422 finished doing stuff 9

Although the example code is creating a new instance of A with every call to do_a_stuff() in threads, the method lock is global and each new start of method waits for the previous run to finish.

Moreover, we see that the code creates a lock per method, not per class or object, which means that while only one instance of execute() can be called at any time on any object on any thread in a process, execute() and any other method decorated could be called in parallel. Normally, this is what we would want to prevent when locking an object. I can not easily make up an example where this could go wrong, but classic example is when one thread in one method changes some internal variable of the object while another thread calls another method on the same object which modifies the same internal variables.

On top of that, making cursors thread safe barely makes sense at all. In DBAPI2.0, you call execute() on a cursor and then reuse the same cursor instance to fetch results via i.e. fetchall(). Locking methods does not prevent:

  • Thread1 calls execute()
  • Thread2 is queuing up execute() with different query
  • Thread1 finishes, Thread2 runs its execute

Now, if we were using a lock per cursor object:

  • Thread1 needs to wait for Thread2's execute()
  • Thread2's execute() finishes
  • Thread1 fetches Thread2's results, not its own

In PyAthena's case, since different methods lock on different locks, anything wild could happen.

Please also refer to psycopg2's explanation of thread safety, which is IMO, as often with Postgres, from a software engineering perspective the most desirable.

Proposal

  • Remove synchronized decorator completely
  • Document thread safety like Postgres does
  • Optional: Make sure that connections actually are thread safe

I am planning to open a merge request that fixes at least the first two, if not all three points, soon.

Mitigation

Everyone stumbling upon this can temporarily fix it by running following code before your first import of pyathena, i.e. in your program's root module's __init__.py or before main's imports:

import pyathena.util

pyathena.util.synchronized = lambda x: x

It effectively makes the synchronisation decorator a no-op. Python caches the imported module and will re-use the monkey patched version.

from pyathena.

laughingman7743 avatar laughingman7743 commented on May 24, 2024

It does not matter that boto3 is thread safe and that the cursor object is thread safe.
If you do not acquire a lock, you should not share the cursor object in multithreaded.

Since the original implementation is JDBC version, accessing the cursor object with multithreading will cause the JVM to crash, so I am implementing it with multithreaded access control.

Perhaps cursor objects will behave strangely unless you acquire locks in Boto3 too.

I think that you should use AsyncCursor.
https://github.com/laughingman7743/PyAthena#asynchronouscursor

from pyathena.

laughingman7743 avatar laughingman7743 commented on May 24, 2024

I see, the problem is that the Synchronized decorator takes a global lock, so it is not possible to send multiple requests to Athena with multiple cursor objects at the same time.

I'll reopen this issue for now.

from pyathena.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.