I have original files in S3 with folder structure like: /data/year=2017/month=1/da

Guys, have you tried with this one <a href="https://aws.amazon.com/blogs/

Would you try to modify 2 part in default ETL ? 1.ApplyMapping.apply: add ("

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Guys, have you tried with this one <a href="https://aws.amazon.com/blogs/big-data/work

How to keep the partition structure of the folder after ETL? about aws-glue-samples HOT 11 CLOSED

aws-samples commented on July 22, 2024 7

How to keep the partition structure of the folder after ETL?

from aws-glue-samples.

Comments (11)

hakenmt commented on July 22, 2024 3

I have this problem as well. My source bucket has path style partitions, and I want to continue to use those properties as partitions after an ApplyMapping, but I end up with the HIVE_DEFAULT_PARTITION as the value in the S3 path.

from aws-glue-samples.

rkarato commented on July 22, 2024 3

Guys, have you tried with this one https://aws.amazon.com/blogs/big-data/work-with-partitioned-data-in-aws-glue/ ? Section "Writing out partitioned data" seems to have what you're looking for - I haven't tried it myself yet, though

I got the partitions to be created successfully using:

glueContext.write_dynamic_frame.from_options(frame = dropnullfields3, connection_type = "s3", connection_options = {"path": "s3://path/", "partitionKeys": ["year","month","day"]}, format = "parquet"}

from aws-glue-samples.

wilywily commented on July 22, 2024 1

Would you try to modify 2 part in default ETL script?
1.ApplyMapping.apply: add ("partition_0", "string", "partition_0", "string") in the column
2.glueContext.write_dynamic_frame.from_options: add partitionKey with your partition such as (frame = dropnullfields3, connection_type = "s3", connection_options = {"path": $path,"partitionKeys": ["partition_0","partition_1"]}

The first one should make the Dynamic frame to mapping the partition.
Second one is newly add method in aws document
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-partitions.html

from aws-glue-samples.

dogenius01 commented on July 22, 2024

I have same situation, but want parquet.
My glue job does not recognize partitioned table.
It shows columns except partitions.
Do you find any solutions?

from aws-glue-samples.

mohitsax commented on July 22, 2024

For a more detailed example on how to work with Hive-style s3 partitions with AWS Glue, please refer:

https://aws.amazon.com/blogs/big-data/work-with-partitioned-data-in-aws-glue/

from aws-glue-samples.

ujjwalit commented on July 22, 2024

@wilywily I am also having the same problem , have tried the mentioned steps in part 2 of the ETL , added the partition column name in the mapping and added them in partitionKeys but when i run the job it gets stuck in running for a long time , when i run the job without partition it gets completed in 2-3 min but when i try with the partition its running for around 30 min and not giving any error or log

from aws-glue-samples.

nicolasdij commented on July 22, 2024

Guys, have you tried with this one https://aws.amazon.com/blogs/big-data/work-with-partitioned-data-in-aws-glue/ ? Section "Writing out partitioned data" seems to have what you're looking for - I haven't tried it myself yet, though

from aws-glue-samples.

aahmed-vzw commented on July 22, 2024

I tried above link but its not generating partition correctly. What I got

year=HIVE_DEFAULT_PARTITION/month=HIVE_DEFAULT_PARTITION/day=HIVE_DEFAULT_PARTITION/part-00069-5f49a46d-cc10-424c-a95a-a4a40ad3252a.c000.snappy.orc

Before pushing the data to sink I used the printSchema and show methods just to make sure data is coming properly from data store.

I tried two approaches to write data into partitions.

glueContext.getSinkWithFormat(connectionType = "s3", options = JsonOptions("""{"path": "<S3_PATH>", "partitionKeys": ["year","month","day"]}"""), transformationContext = "datasink4", format = "orc").writeDynamicFrame(dropnullfields3)

and second converting the DynamicFrame into DataFrame.

val df = applymapping1.toDF()
df.write.mode(SaveMode.Append).format("orc").partitionBy("year", "month", "day").save("<S3_PATH>")

from aws-glue-samples.

adarmanto commented on July 22, 2024

I had same issue and fixed by changing the type of partition field to string.

from aws-glue-samples.

moomindani commented on July 22, 2024

As it is already explained, write_dynamic_frame.from_options has partitionKeys option to keep partition structure at the output location.
See details in this blog post. https://aws.amazon.com/blogs/big-data/work-with-partitioned-data-in-aws-glue/

Resolving.

from aws-glue-samples.

ffgarciam commented on July 22, 2024

Would you try to modify 2 part in default ETL script? 1.ApplyMapping.apply: add ("partition_0", "string", "partition_0", "string") in the column 2.glueContext.write_dynamic_frame.from_options: add partitionKey with your partition such as (frame = dropnullfields3, connection_type = "s3", connection_options = {"path": $path,"partitionKeys": ["partition_0","partition_1"]}

The first one should make the Dynamic frame to mapping the partition. Second one is newly add method in aws document https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-partitions.html

Your post save my day. Thanks

ApplyMapping_node2 = ApplyMapping.apply(
    frame=DataCatalogtable_node,
    mappings=[
        ("partition_0", "string", "partition_0", "string"),
        ("partition_1", "string", "partition_1", "string"),
....

from aws-glue-samples.

How to keep the partition structure of the folder after ETL? about aws-glue-samples HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent