Git Product home page Git Product logo

Comments (11)

hakenmt avatar hakenmt commented on July 22, 2024 3

I have this problem as well. My source bucket has path style partitions, and I want to continue to use those properties as partitions after an ApplyMapping, but I end up with the HIVE_DEFAULT_PARTITION as the value in the S3 path.

from aws-glue-samples.

rkarato avatar rkarato commented on July 22, 2024 3

Guys, have you tried with this one https://aws.amazon.com/blogs/big-data/work-with-partitioned-data-in-aws-glue/ ? Section "Writing out partitioned data" seems to have what you're looking for - I haven't tried it myself yet, though

I got the partitions to be created successfully using:

glueContext.write_dynamic_frame.from_options(frame = dropnullfields3, connection_type = "s3", connection_options = {"path": "s3://path/", "partitionKeys": ["year","month","day"]}, format = "parquet"}

from aws-glue-samples.

wilywily avatar wilywily commented on July 22, 2024 1

Would you try to modify 2 part in default ETL script?
1.ApplyMapping.apply: add ("partition_0", "string", "partition_0", "string") in the column
2.glueContext.write_dynamic_frame.from_options: add partitionKey with your partition such as (frame = dropnullfields3, connection_type = "s3", connection_options = {"path": $path,"partitionKeys": ["partition_0","partition_1"]}

The first one should make the Dynamic frame to mapping the partition.
Second one is newly add method in aws document
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-partitions.html

from aws-glue-samples.

dogenius01 avatar dogenius01 commented on July 22, 2024

I have same situation, but want parquet.
My glue job does not recognize partitioned table.
It shows columns except partitions.
Do you find any solutions?

from aws-glue-samples.

mohitsax avatar mohitsax commented on July 22, 2024

For a more detailed example on how to work with Hive-style s3 partitions with AWS Glue, please refer:

https://aws.amazon.com/blogs/big-data/work-with-partitioned-data-in-aws-glue/

from aws-glue-samples.

ujjwalit avatar ujjwalit commented on July 22, 2024

@wilywily I am also having the same problem , have tried the mentioned steps in part 2 of the ETL , added the partition column name in the mapping and added them in partitionKeys but when i run the job it gets stuck in running for a long time , when i run the job without partition it gets completed in 2-3 min but when i try with the partition its running for around 30 min and not giving any error or log

from aws-glue-samples.

nicolasdij avatar nicolasdij commented on July 22, 2024

Guys, have you tried with this one https://aws.amazon.com/blogs/big-data/work-with-partitioned-data-in-aws-glue/ ? Section "Writing out partitioned data" seems to have what you're looking for - I haven't tried it myself yet, though

from aws-glue-samples.

aahmed-vzw avatar aahmed-vzw commented on July 22, 2024

I tried above link but its not generating partition correctly. What I got

year=HIVE_DEFAULT_PARTITION/month=HIVE_DEFAULT_PARTITION/day=HIVE_DEFAULT_PARTITION/part-00069-5f49a46d-cc10-424c-a95a-a4a40ad3252a.c000.snappy.orc

Before pushing the data to sink I used the printSchema and show methods just to make sure data is coming properly from data store.

I tried two approaches to write data into partitions.

  • glueContext.getSinkWithFormat(connectionType = "s3", options = JsonOptions("""{"path": "<S3_PATH>", "partitionKeys": ["year","month","day"]}"""), transformationContext = "datasink4", format = "orc").writeDynamicFrame(dropnullfields3)

and second converting the DynamicFrame into DataFrame.

val df = applymapping1.toDF()
df.write.mode(SaveMode.Append).format("orc").partitionBy("year", "month", "day").save("<S3_PATH>")

from aws-glue-samples.

adarmanto avatar adarmanto commented on July 22, 2024

I had same issue and fixed by changing the type of partition field to string.

from aws-glue-samples.

moomindani avatar moomindani commented on July 22, 2024

As it is already explained, write_dynamic_frame.from_options has partitionKeys option to keep partition structure at the output location.
See details in this blog post. https://aws.amazon.com/blogs/big-data/work-with-partitioned-data-in-aws-glue/

Resolving.

from aws-glue-samples.

ffgarciam avatar ffgarciam commented on July 22, 2024

Would you try to modify 2 part in default ETL script? 1.ApplyMapping.apply: add ("partition_0", "string", "partition_0", "string") in the column 2.glueContext.write_dynamic_frame.from_options: add partitionKey with your partition such as (frame = dropnullfields3, connection_type = "s3", connection_options = {"path": $path,"partitionKeys": ["partition_0","partition_1"]}

The first one should make the Dynamic frame to mapping the partition. Second one is newly add method in aws document https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-partitions.html

Your post save my day. Thanks

ApplyMapping_node2 = ApplyMapping.apply(
    frame=DataCatalogtable_node,
    mappings=[
        ("partition_0", "string", "partition_0", "string"),
        ("partition_1", "string", "partition_1", "string"),
....

from aws-glue-samples.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.