Comments (11)
I have this problem as well. My source bucket has path style partitions, and I want to continue to use those properties as partitions after an ApplyMapping, but I end up with the HIVE_DEFAULT_PARTITION as the value in the S3 path.
from aws-glue-samples.
Guys, have you tried with this one https://aws.amazon.com/blogs/big-data/work-with-partitioned-data-in-aws-glue/ ? Section "Writing out partitioned data" seems to have what you're looking for - I haven't tried it myself yet, though
I got the partitions to be created successfully using:
glueContext.write_dynamic_frame.from_options(frame = dropnullfields3, connection_type = "s3", connection_options = {"path": "s3://path/", "partitionKeys": ["year","month","day"]}, format = "parquet"}
from aws-glue-samples.
Would you try to modify 2 part in default ETL script?
1.ApplyMapping.apply: add ("partition_0", "string", "partition_0", "string") in the column
2.glueContext.write_dynamic_frame.from_options: add partitionKey with your partition such as (frame = dropnullfields3, connection_type = "s3", connection_options = {"path": $path,"partitionKeys": ["partition_0","partition_1"]}
The first one should make the Dynamic frame to mapping the partition.
Second one is newly add method in aws document
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-partitions.html
from aws-glue-samples.
I have same situation, but want parquet.
My glue job does not recognize partitioned table.
It shows columns except partitions.
Do you find any solutions?
from aws-glue-samples.
For a more detailed example on how to work with Hive-style s3 partitions with AWS Glue, please refer:
https://aws.amazon.com/blogs/big-data/work-with-partitioned-data-in-aws-glue/
from aws-glue-samples.
@wilywily I am also having the same problem , have tried the mentioned steps in part 2 of the ETL , added the partition column name in the mapping and added them in partitionKeys but when i run the job it gets stuck in running for a long time , when i run the job without partition it gets completed in 2-3 min but when i try with the partition its running for around 30 min and not giving any error or log
from aws-glue-samples.
Guys, have you tried with this one https://aws.amazon.com/blogs/big-data/work-with-partitioned-data-in-aws-glue/ ? Section "Writing out partitioned data" seems to have what you're looking for - I haven't tried it myself yet, though
from aws-glue-samples.
I tried above link but its not generating partition correctly. What I got
year=HIVE_DEFAULT_PARTITION/month=HIVE_DEFAULT_PARTITION/day=HIVE_DEFAULT_PARTITION/part-00069-5f49a46d-cc10-424c-a95a-a4a40ad3252a.c000.snappy.orc
Before pushing the data to sink I used the printSchema and show methods just to make sure data is coming properly from data store.
I tried two approaches to write data into partitions.
- glueContext.getSinkWithFormat(connectionType = "s3", options = JsonOptions("""{"path": "<S3_PATH>", "partitionKeys": ["year","month","day"]}"""), transformationContext = "datasink4", format = "orc").writeDynamicFrame(dropnullfields3)
and second converting the DynamicFrame into DataFrame.
val df = applymapping1.toDF()
df.write.mode(SaveMode.Append).format("orc").partitionBy("year", "month", "day").save("<S3_PATH>")
from aws-glue-samples.
I had same issue and fixed by changing the type of partition field to string.
from aws-glue-samples.
As it is already explained, write_dynamic_frame.from_options
has partitionKeys
option to keep partition structure at the output location.
See details in this blog post. https://aws.amazon.com/blogs/big-data/work-with-partitioned-data-in-aws-glue/
Resolving.
from aws-glue-samples.
Would you try to modify 2 part in default ETL script? 1.ApplyMapping.apply: add ("partition_0", "string", "partition_0", "string") in the column 2.glueContext.write_dynamic_frame.from_options: add partitionKey with your partition such as (frame = dropnullfields3, connection_type = "s3", connection_options = {"path": $path,"partitionKeys": ["partition_0","partition_1"]}
The first one should make the Dynamic frame to mapping the partition. Second one is newly add method in aws document https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-partitions.html
Your post save my day. Thanks
ApplyMapping_node2 = ApplyMapping.apply(
frame=DataCatalogtable_node,
mappings=[
("partition_0", "string", "partition_0", "string"),
("partition_1", "string", "partition_1", "string"),
....
from aws-glue-samples.
Related Issues (20)
- 'glue/sparkui:latest' missing in Docker hub HOT 3
- Issue with migrating directly from AWS Glue to Hive HOT 2
- Creating AWS- Glue Pipeline using Cloud Formation HOT 3
- Issue migrating directly from Hive Metastore to Glue Data Catalog
- Spark-UI docker container startup issue HOT 4
- hive_metastore_migration.py fails with AttributeError: 'str' object has no attribute '_jdf' HOT 1
- Unable to start Spark-UI docker container from EC2 in China Region HOT 6
- tinyint(1) issue from mysql database
- Issues using Spark_UI/glue-3_0 and Spark_UI/glue-4_0 HOT 2
- Spark UI Glue 4.0 Logging Not Working? HOT 2
- Request to Host Glue Spark UI Images on DockerHub
- Spark UI container is not getting started HOT 3
- Launch AWS Glue Spark UI Filtered to Specific Applications
- EMR Hive Metastore to Glue Migration
- Setup AWS glue
- Wrong escape character in avro.schema.url
- Couldn't resolve host name for Spark UI HOT 4
- writing data to s3 using spark and updating catalog
- Unsupported jdbc driver classname with com.ibm.as400.access.AS400JDBCDriver HOT 1
- Spark history server: README.md to show using AWS_PROFILE HOT 23
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from aws-glue-samples.