I'm hoping to use Athena with Dask, performing queries which return 10-30 GB and then

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

larger than memory queries? about pyathena HOT 8 OPEN

laughingman7743 commented on May 24, 2024 3

larger than memory queries?

from pyathena.

Comments (8)

rpanai commented on May 24, 2024 2

@mckeown12 Have at examples here

a typical example is

CREATE TABLE new_table
WITH (
 format='PARQUET',
 external_location='s3://my_bucket/',
 partitioned_by = ARRAY['key1']
) AS
SELECT ...

I might need to do something in this direction. Maybe we can add as feature to this project.

from pyathena.

rogerganga commented on May 24, 2024 1

@rogerganga can you elaborate more on that? I am running on the same issue as @mckeown12 because I am using PandasCursor and the memory usage keeps growing as it reads the result set from S3 until it raises a MemoryError. So your solution is to create a table in Athena by saving the result as parquet in S3 and then reading it from there using Dask?

Hey @Lucas-Cerqueira. You are right (I am unsure about Dask). So select Athena query will retrieve the data from S3 and store it in pandas cursor (your local machine) in order to view it later in a pandas dataframe. This is not a good idea for large datasets as your local machine might not have enough memory/storage to store all info.

On the other hand, running a CTAS query will run the query operation in cloud and store it in S3 which can be later viewed via Athena. No data is stored in your local machine.

You can check @rpanai code above or click this link to know more on how to create a CTAS query :)

from pyathena.

laughingman7743 commented on May 24, 2024

I think that it would be nice to create a DaskCursor similar to PandasCusor.
A cursor that returns Dask's Dataframe instead of Pandas' Dataframe.
https://docs.dask.org/en/latest/dataframe.html

Since I have not created any other forums, there is no problem with the issue of GitHub.

from pyathena.

rpanai commented on May 24, 2024

@mckeown12 Isn't better to save the query to parquet in S3 (possibly nicely partitioned) and then read from S3 with dask?

from pyathena.

mckeown12 commented on May 24, 2024

@rpanai I was unaware that there was a way to save query results as parquet, or partition them. Do you have a reference/code example? If that is possible, I would be very interested!

from pyathena.

rogerganga commented on May 24, 2024

I work on large datasets with Athena. CTAS (Create Table AS) queries is the way to go as you can store the results in S3 with the right format.

from pyathena.

Lucas-Cerqueira commented on May 24, 2024

@rogerganga can you elaborate more on that? I am running on the same issue as @mckeown12 because I am using PandasCursor and the memory usage keeps growing as it reads the result set from S3 until it raises a MemoryError. So your solution is to create a table in Athena by saving the result as parquet in S3 and then reading it from there using Dask?

from pyathena.

ericson-cepeda commented on May 24, 2024

There were dependencies conflicts in my environment that did not allow me to use Dask. Hence, as an additional note, it is possible to use a combination of CTAS, AVRO, boto:

Embed the SELECT query in a CTAS query using AVRO format.
Retrieve the temporary Athena table S3 location.
Download all the AVRO partitions in parallel (boto).
Read each partition sequentially (fastavro) and trigger the garbage collector.
Use pandas to concatenate the read partition to a CSV

This process is meant to optimise for memory but not storage, and the output goes directly to a CSV. It was not necessary to load the whole result set in memory to perform additional calculations.

I found this process interesting because, for a 25GB CSV, the resulting behaviour for the memory utilisation looked as follows:

Why AVRO?
Parquet does not accept empty columns, so having a result set with an empty column will trigger an exception executing CTAS. AVRO does not have such limitation.

Why not using the S3 stream with fastavro?
The connection could fail and recovering the state is another problem to solve.

References:
https://skeptric.com/export-athena/
https://stackoverflow.com/a/49214514/1705127
https://towardsdatascience.com/python-garbage-collection-article-4a530b0992e3

from pyathena.

larger than memory queries? about pyathena HOT 8 OPEN

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent