Git Product home page Git Product logo

Comments (8)

rpanai avatar rpanai commented on May 24, 2024 2

@mckeown12 Have at examples here

a typical example is

CREATE TABLE new_table
WITH (
 format='PARQUET',
 external_location='s3://my_bucket/',
 partitioned_by = ARRAY['key1']
) AS
SELECT ...

I might need to do something in this direction. Maybe we can add as feature to this project.

from pyathena.

rogerganga avatar rogerganga commented on May 24, 2024 1

@rogerganga can you elaborate more on that? I am running on the same issue as @mckeown12 because I am using PandasCursor and the memory usage keeps growing as it reads the result set from S3 until it raises a MemoryError. So your solution is to create a table in Athena by saving the result as parquet in S3 and then reading it from there using Dask?

Hey @Lucas-Cerqueira. You are right (I am unsure about Dask). So select Athena query will retrieve the data from S3 and store it in pandas cursor (your local machine) in order to view it later in a pandas dataframe. This is not a good idea for large datasets as your local machine might not have enough memory/storage to store all info.

On the other hand, running a CTAS query will run the query operation in cloud and store it in S3 which can be later viewed via Athena. No data is stored in your local machine.

You can check @rpanai code above or click this link to know more on how to create a CTAS query :)

from pyathena.

laughingman7743 avatar laughingman7743 commented on May 24, 2024

I think that it would be nice to create a DaskCursor similar to PandasCusor.
A cursor that returns Dask's Dataframe instead of Pandas' Dataframe.
https://docs.dask.org/en/latest/dataframe.html

Since I have not created any other forums, there is no problem with the issue of GitHub.

from pyathena.

rpanai avatar rpanai commented on May 24, 2024

@mckeown12 Isn't better to save the query to parquet in S3 (possibly nicely partitioned) and then read from S3 with dask?

from pyathena.

mckeown12 avatar mckeown12 commented on May 24, 2024

@rpanai I was unaware that there was a way to save query results as parquet, or partition them. Do you have a reference/code example? If that is possible, I would be very interested!

from pyathena.

rogerganga avatar rogerganga commented on May 24, 2024

I work on large datasets with Athena. CTAS (Create Table AS) queries is the way to go as you can store the results in S3 with the right format.

from pyathena.

Lucas-Cerqueira avatar Lucas-Cerqueira commented on May 24, 2024

@rogerganga can you elaborate more on that? I am running on the same issue as @mckeown12 because I am using PandasCursor and the memory usage keeps growing as it reads the result set from S3 until it raises a MemoryError. So your solution is to create a table in Athena by saving the result as parquet in S3 and then reading it from there using Dask?

from pyathena.

ericson-cepeda avatar ericson-cepeda commented on May 24, 2024

There were dependencies conflicts in my environment that did not allow me to use Dask. Hence, as an additional note, it is possible to use a combination of CTAS, AVRO, boto:

  1. Embed the SELECT query in a CTAS query using AVRO format.
  2. Retrieve the temporary Athena table S3 location.
  3. Download all the AVRO partitions in parallel (boto).
  4. Read each partition sequentially (fastavro) and trigger the garbage collector.
  5. Use pandas to concatenate the read partition to a CSV

This process is meant to optimise for memory but not storage, and the output goes directly to a CSV. It was not necessary to load the whole result set in memory to perform additional calculations.

I found this process interesting because, for a 25GB CSV, the resulting behaviour for the memory utilisation looked as follows:
image

Why AVRO?
Parquet does not accept empty columns, so having a result set with an empty column will trigger an exception executing CTAS. AVRO does not have such limitation.

Why not using the S3 stream with fastavro?
The connection could fail and recovering the state is another problem to solve.

References:
https://skeptric.com/export-athena/
https://stackoverflow.com/a/49214514/1705127
https://towardsdatascience.com/python-garbage-collection-article-4a530b0992e3

from pyathena.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.