Comments (3)
I stumbled across this issue several times with PyAthena and I believe it should not have been closed.
No offense, but I suspect that there is a fundamental misunderstanding here:
If you do not acquire a lock, you should not share the cursor object in multithreaded.
However, @mateush states:
[...] despite the fact that they have separate Cursor instances
The thing is that the lock of synchronized()
is shared globally across all cursor instances, so only one execute()
or any other decorated method can be called at a time in a single Python process, independent from cursor instance. Here is an almost minimal example using your decorator:
import functools
import threading
import time
from multiprocessing.pool import ThreadPool
from time import sleep
def synchronized(wrapped):
_lock = threading.RLock()
@functools.wraps(wrapped)
def _wrapper(*args, **kwargs):
with _lock:
return wrapped(*args, **kwargs)
# Added for demonstration purposes. Does not change runtime behavior
_wrapper.__lock = _lock
return _wrapper
class A:
@synchronized
def do_stuff(self, n):
print(time.time(), 'starting to do stuff', n)
sleep(1)
print(time.time(), 'finished doing stuff', n)
@synchronized
def do_other_stuff(self):
pass
def do_a_stuff(n):
A().do_stuff(n)
def demo_concurrency():
p = ThreadPool(processes=10)
p.map(do_a_stuff, range(10))
def demo_references():
a1 = A()
a2 = A()
print('id a1 do_stuff', id(a1.do_stuff.__lock))
print('id a2 do_stuff', id(a2.do_stuff.__lock))
print('id a1 do_other_stuff', id(a1.do_other_stuff.__lock))
print('id a2 do_other_stuff', id(a2.do_other_stuff.__lock))
def run_all():
demo_references()
print('---')
demo_concurrency()
if __name__ == '__main__':
run_all()
Output:
id a1 do_stuff 140487096646080
id a2 do_stuff 140487096646080
id a1 do_other_stuff 140487096535360
id a2 do_other_stuff 140487096535360
---
1668505773.485564 starting to do stuff 0
1668505774.4865115 finished doing stuff 0
1668505774.486622 starting to do stuff 1
1668505775.4867237 finished doing stuff 1
1668505775.4868734 starting to do stuff 2
1668505776.4879422 finished doing stuff 2
1668505776.4880366 starting to do stuff 3
1668505777.4890823 finished doing stuff 3
1668505777.4891577 starting to do stuff 4
1668505778.4903514 finished doing stuff 4
1668505778.4906769 starting to do stuff 5
1668505779.4918444 finished doing stuff 5
1668505779.4919217 starting to do stuff 6
1668505780.4932797 finished doing stuff 6
1668505780.493365 starting to do stuff 7
1668505781.49372 finished doing stuff 7
1668505781.4939387 starting to do stuff 8
1668505782.4944823 finished doing stuff 8
1668505782.4945972 starting to do stuff 9
1668505783.4958422 finished doing stuff 9
Although the example code is creating a new instance of A
with every call to do_a_stuff()
in threads, the method lock is global and each new start of method waits for the previous run to finish.
Moreover, we see that the code creates a lock per method, not per class or object, which means that while only one instance of execute()
can be called at any time on any object on any thread in a process, execute()
and any other method decorated could be called in parallel. Normally, this is what we would want to prevent when locking an object. I can not easily make up an example where this could go wrong, but classic example is when one thread in one method changes some internal variable of the object while another thread calls another method on the same object which modifies the same internal variables.
On top of that, making cursors thread safe barely makes sense at all. In DBAPI2.0, you call execute()
on a cursor and then reuse the same cursor instance to fetch results via i.e. fetchall()
. Locking methods does not prevent:
- Thread1 calls
execute()
- Thread2 is queuing up
execute()
with different query - Thread1 finishes, Thread2 runs its execute
Now, if we were using a lock per cursor object:
- Thread1 needs to wait for Thread2's
execute()
- Thread2's
execute()
finishes - Thread1 fetches Thread2's results, not its own
In PyAthena's case, since different methods lock on different locks, anything wild could happen.
Please also refer to psycopg2's explanation of thread safety, which is IMO, as often with Postgres, from a software engineering perspective the most desirable.
Proposal
- Remove
synchronized
decorator completely - Document thread safety like Postgres does
- Optional: Make sure that connections actually are thread safe
I am planning to open a merge request that fixes at least the first two, if not all three points, soon.
Mitigation
Everyone stumbling upon this can temporarily fix it by running following code before your first import of pyathena, i.e. in your program's root module's __init__.py
or before main
's imports:
import pyathena.util
pyathena.util.synchronized = lambda x: x
It effectively makes the synchronisation decorator a no-op. Python caches the imported module and will re-use the monkey patched version.
from pyathena.
It does not matter that boto3 is thread safe and that the cursor object is thread safe.
If you do not acquire a lock, you should not share the cursor object in multithreaded.
Since the original implementation is JDBC version, accessing the cursor object with multithreading will cause the JVM to crash, so I am implementing it with multithreaded access control.
Perhaps cursor objects will behave strangely unless you acquire locks in Boto3 too.
I think that you should use AsyncCursor.
https://github.com/laughingman7743/PyAthena#asynchronouscursor
from pyathena.
I see, the problem is that the Synchronized decorator takes a global lock, so it is not possible to send multiple requests to Athena with multiple cursor objects at the same time.
I'll reopen this issue for now.
from pyathena.
Related Issues (20)
- Support for Python 3.12
- Implement all fsspec specs in the s3 file system HOT 2
- Mypy Error When using Connection.cursor method to instantiate cursor HOT 2
- Add custom filesystem object to arrow engine HOT 2
- Compatibility issue with SQLAlchemy<1.4 HOT 2
- `UUID` in a query gets garbled HOT 3
- Add support for Spark calculations HOT 8
- Add Endpoint_URL param to SQLAlchemy HOT 2
- SQLAlchemy dialect uses deprecated dbapi() method HOT 1
- Create documents in Sphinx and publish them on GitHub Pages HOT 1
- Breaking change in the release between 3.0.10 and 3.1.0 HOT 6
- Okta authentication support HOT 1
- Integer variant types incorrectly rendered in DDL HOT 5
- Support for Iceberg FOR SYSTEM_VERSION AS OF HOT 7
- AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reached HOT 2
- FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas.
- Latest PyAthena no longer compatible with SQLAlchemy 1.4
- [Warning] SADeprecationWarning: The dbapi() classmethod on dialect classes has been renamed to import_dbapi(). HOT 1
- pyathena.error.DatabaseError: An error occurred (InvalidRequestException) when calling the StartQueryExecution operation: line 1:3242: mismatched input 'OFFSET'. Expecting: <EOF> HOT 2
- If a value for a partition key is None, to_sql doesn't warn you and no data is written
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pyathena.