Comments (8)
@mckeown12 Have at examples here
a typical example is
CREATE TABLE new_table
WITH (
format='PARQUET',
external_location='s3://my_bucket/',
partitioned_by = ARRAY['key1']
) AS
SELECT ...
I might need to do something in this direction. Maybe we can add as feature to this project.
from pyathena.
@rogerganga can you elaborate more on that? I am running on the same issue as @mckeown12 because I am using PandasCursor and the memory usage keeps growing as it reads the result set from S3 until it raises a MemoryError. So your solution is to create a table in Athena by saving the result as parquet in S3 and then reading it from there using Dask?
Hey @Lucas-Cerqueira. You are right (I am unsure about Dask). So select Athena query will retrieve the data from S3 and store it in pandas cursor (your local machine) in order to view it later in a pandas dataframe. This is not a good idea for large datasets as your local machine might not have enough memory/storage to store all info.
On the other hand, running a CTAS query will run the query operation in cloud and store it in S3 which can be later viewed via Athena. No data is stored in your local machine.
You can check @rpanai code above or click this link to know more on how to create a CTAS query :)
from pyathena.
I think that it would be nice to create a DaskCursor similar to PandasCusor.
A cursor that returns Dask's Dataframe instead of Pandas' Dataframe.
https://docs.dask.org/en/latest/dataframe.html
Since I have not created any other forums, there is no problem with the issue of GitHub.
from pyathena.
@mckeown12 Isn't better to save the query to parquet in S3 (possibly nicely partitioned) and then read from S3 with dask?
from pyathena.
@rpanai I was unaware that there was a way to save query results as parquet, or partition them. Do you have a reference/code example? If that is possible, I would be very interested!
from pyathena.
I work on large datasets with Athena. CTAS (Create Table AS) queries is the way to go as you can store the results in S3 with the right format.
from pyathena.
@rogerganga can you elaborate more on that? I am running on the same issue as @mckeown12 because I am using PandasCursor and the memory usage keeps growing as it reads the result set from S3 until it raises a MemoryError. So your solution is to create a table in Athena by saving the result as parquet in S3 and then reading it from there using Dask?
from pyathena.
There were dependencies conflicts in my environment that did not allow me to use Dask. Hence, as an additional note, it is possible to use a combination of CTAS, AVRO, boto:
- Embed the SELECT query in a CTAS query using AVRO format.
- Retrieve the temporary Athena table S3 location.
- Download all the AVRO partitions in parallel (boto).
- Read each partition sequentially (fastavro) and trigger the garbage collector.
- Use pandas to concatenate the read partition to a CSV
This process is meant to optimise for memory but not storage, and the output goes directly to a CSV. It was not necessary to load the whole result set in memory to perform additional calculations.
I found this process interesting because, for a 25GB CSV, the resulting behaviour for the memory utilisation looked as follows:
Why AVRO?
Parquet does not accept empty columns, so having a result set with an empty column will trigger an exception executing CTAS. AVRO does not have such limitation.
Why not using the S3 stream with fastavro?
The connection could fail and recovering the state is another problem to solve.
References:
https://skeptric.com/export-athena/
https://stackoverflow.com/a/49214514/1705127
https://towardsdatascience.com/python-garbage-collection-article-4a530b0992e3
from pyathena.
Related Issues (20)
- Sqlalchemy Create ICEBERG table wrong Partition clause HOT 3
- Python 3.10: No module named 'distutils.util' HOT 8
- PandasCursor interfering with pandas.DataFrame writing to s3 HOT 8
- Cannot specify Athena workgroup when using sqlalchemy HOT 2
- Add LIMIT as argument for execute HOT 3
- Support `json_serializer` and `json_deserializer` engine/dialect paremeters HOT 1
- SyntaxError when importing AthenaDialect HOT 5
- pyathena hijacks pandas s3fs HOT 2
- result_reuse_enable and result_reuse_minutes HOT 2
- Athena Reuse Query Results HOT 2
- Supporting FILTER and other similar operations HOT 2
- Support for Python 3.12
- Implement all fsspec specs in the s3 file system HOT 4
- Mypy Error When using Connection.cursor method to instantiate cursor HOT 2
- Add custom filesystem object to arrow engine HOT 2
- Compatibility issue with SQLAlchemy<1.4 HOT 2
- `UUID` in a query gets garbled HOT 3
- Add support for Spark calculations HOT 8
- Add Endpoint_URL param to SQLAlchemy HOT 2
- SQLAlchemy dialect uses deprecated dbapi() method HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pyathena.