Git Product home page Git Product logo

aws-glue-samples's Introduction

AWS Glue Samples

AWS Glue is a serverless data integration service that makes it easier to discover, prepare, move, and integrate data from multiple sources for analytics, machine learning (ML), and application development. This repository has samples that demonstrate various aspects of the AWS Glue service, as well as various AWS Glue utilities.

You can find the AWS Glue open-source Python libraries in a separate repository at: awslabs/aws-glue-libs.

Getting Started

Workshops

  • AWS Glue Learning Series

    In this comprehensive series, you'll learn everything from the basics of Glue to advanced optimization techniques.

Tutorials

General

Data migration

Open Table Format

Development, Test, and CI/CD

Cost and Performance

Glue for Ray

Glue Data Catalog

Glue Crawler

Glue Data Quality

Glue ETL Code Examples

You can run these sample job scripts on any of AWS Glue ETL jobs, container, or local environment.

  • Join and Relationalize Data in S3

    This sample ETL script shows you how to use AWS Glue to load, transform, and rewrite data in AWS S3 so that it can easily and efficiently be queried and analyzed.

  • Clean and Process

    This sample ETL script shows you how to take advantage of both Spark and AWS Glue features to clean and transform data for efficient analysis.

  • The resolveChoice Method

    This sample explores all four of the ways you can resolve choice types in a dataset using DynamicFrame's resolveChoice method.

  • Converting character encoding

    This sample ETL script shows you how to use AWS Glue job to convert character encoding.

  • Notebook using open data dake formats

    The sample iPython notebook files show you how to use open data dake formats; Apache Hudi, Delta Lake, and Apache Iceberg on AWS Glue Interactive Sessions and AWS Glue Studio Notebook.

  • Blueprint examples

    The sample Glue Blueprints show you how to implement blueprints addressing common use-cases in ETL. The samples are located under aws-glue-blueprint-libs repository.

Utilities

Glue Custom Connectors

AWS Glue provides built-in support for the most commonly used data stores such as Amazon Redshift, MySQL, MongoDB. Powered by Glue ETL Custom Connector, you can subscribe a third-party connector from AWS Marketplace or build your own connector to connect to data stores that are not natively supported.

marketplace

  • Development

    Development guide with examples of connectors with simple, intermediate, and advanced functionalities. These examples demonstrate how to implement Glue Custom Connectors based on Spark Data Source or Amazon Athena Federated Query interfaces and plug them into Glue Spark runtime.

  • Local Validation Tests

    This user guide describes validation tests that you can run locally on your laptop to integrate your connector with Glue Spark runtime.

  • Validation

    This user guide shows how to validate connectors with Glue Spark runtime in a Glue job system before deploying them for your workloads.

  • Glue Spark Script Examples

    Python scripts examples to use Spark, Amazon Athena and JDBC connectors with Glue Spark runtime.

  • Create and Publish Glue Connector to AWS Marketplace

    If you would like to partner or publish your Glue custom connector to AWS Marketplace, please refer to this guide and reach out to us at [email protected] for further details on your connector.

License Summary

This sample code is made available under the MIT-0 license. See the LICENSE file.

aws-glue-samples's People

Contributors

ben-bourdin451 avatar boykorr avatar dangereis avatar dependabot[bot] avatar fss18 avatar gherreros avatar haroldhenry avatar hyandell avatar jinet avatar junoha avatar leejianwei avatar lmbo-2020 avatar markatwood avatar mashah avatar mitczach avatar mohitsax avatar moomindani avatar pemmasanikrishna avatar rmattsampson avatar romnempire avatar satyapreddy avatar stewartsmith avatar sumitya avatar tomtongue avatar xy1m avatar yyolk avatar zhukovalexander avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

aws-glue-samples's Issues

Spark_UI docker container-Error: Could not find or load main class $SPARK_HISTORY_OPTS

1.docker build -t glue/sparkui:latest .
2.docker run -itd -e SPARK_HISTORY_OPTS="$SPARK_HISTORY_OPTS -Dspark.history.fs.logDirectory=s3a://test/sparkui -Dspark.hadoop.fs.s3a.access.key=xxxxxx -Dspark.hadoop.fs.s3a.secret.key=yyyyyyy" -p 18080:18080 glue/sparkui:latest "/opt/spark/bin/spark-class org.apache.spark.deploy.history.HistoryServer"

hi,when I run the second step, it prompts me Could not find or load main class $SPARK_HISTORY_OPTS๏ผŒI have no way solve it

post glue catalog upgrade

Hi, I just upgraded to glue data catalog and I could not see any crawlers created on its own. Also, if I want to create a crawler to update an existing table in the catalog, I don't get the option to choose the table of an existing database which it created as i don't want the crawler to create a new table every time. any help is much appreciated! Thank you

Tables created using join_and_relationalize.md example not showing any data

As per https://github.com/aws-samples/aws-glue-samples/blob/master/examples/join_and_relationalize.md example, created a crawler with s3://awsglue-datasets/examples/us-legislators. Crawler ran successfully created tables in the database. When I try to retrieve the data from these tables using Athena, I am seeing "Zero records returned." message. More info (Crawler name:aws-glue-samples-raja0505, region:us-east-2)

persons_json
memberships_json
organizations_json
events_json
areas_json
countries_r_json

AWS Glue error converting data frame to dynamic frame

Here's my code where I am trying to create a new data frame out of the result set of my left join on other 2 data frames and then trying to convert it to a dynamic frame.

dfs = sqlContext.read.format(SNOWFLAKE_SOURCE_NAME).options(**sfOptions).option("query", "SELECT hashkey as hash From randomtable").load()

#Source
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "test", table_name = "randomtable", transformation_ctx = "datasource0")

#add hash value
df = datasource0.toDF()
df.cache()
df = df.withColumn("hashkey", sha2(concat_ws("||", *df.columns), 256))

#drop dupes
df1 = df.dropDuplicates(subset=['hashkey'])

#read incremental data
inc = df1.join(dfs, df1["hashkey"] == dfs["hash"], how='left').filter(col('hash').isNull())

#convert it back to glue context
datasource1 = DynamicFrame.fromDF(inc, glueContext, "datasource1")

Here is the error I get when trying to convert a data frame to a dynamic frame.

datasource1 = DynamicFrame.fromDF(inc, glueContext, "datasource1")
File
"/mnt/yarn/usercache/root/appcache/application_1560272525947_0002/container_1560272525947_0002_01_000001/PyGlue.zip/awsglue/dynamicframe.py",

line 150, in fromDF
File "/mnt/yarn/usercache/root/appcache/application_1560272525947_0002/container_1560272525947_0002_01_000001/py4j-0.10.4-src.zip/py4j/java_gateway.py",
line 1133, in call
File "/mnt/yarn/usercache/root/appcache/application_1560272525947_0002/container_1560272525947_0002_01_000001/pyspark.zip/pyspark/sql/utils.py",
line 63, in deco
File "/mnt/yarn/usercache/root/appcache/application_1560272525947_0002/container_1560272525947_0002_01_000001/py4j-0.10.4-src.zip/py4j/protocol.py",
line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:com.amazonaws.services.glue.DynamicFrame.apply.
: java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.expressions.AttributeReference.(Ljava/lang/String;Lorg/apache/spark/sql/types/DataType;ZLorg/apache/spark/sql/types/Metadata;Lorg/apache/spark/sql/catalyst/expressions/ExprId;Lscala/collection/Seq;)V
at net.snowflake.spark.snowflake.pushdowns.querygeneration.QueryHelper$$anonfun$8.apply(QueryHelper.scala:66)
at net.snowflake.spark.snowflake.pushdowns.querygeneration.QueryHelper$$anonfun$8.apply(QueryHelper.scala:65)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:285)
at net.snowflake.spark.snowflake.pushdowns.querygeneration.QueryHelper.(QueryHelper.scala:64)
at net.snowflake.spark.snowflake.pushdowns.querygeneration.SourceQuery.(SnowflakeQuery.scala:100)
at net.snowflake.spark.snowflake.pushdowns.querygeneration.QueryBuilder.net$snowflake$spark$snowflake$pushdowns$querygeneration$QueryBuilder$$generateQueries(QueryBuilder.scala:98)
at net.snowflake.spark.snowflake.pushdowns.querygeneration.QueryBuilder.liftedTree1$1(QueryBuilder.scala:63)
at net.snowflake.spark.snowflake.pushdowns.querygeneration.QueryBuilder.treeRoot$lzycompute(QueryBuilder.scala:61)
at net.snowflake.spark.snowflake.pushdowns.querygeneration.QueryBuilder.treeRoot(QueryBuilder.scala:60)
at net.snowflake.spark.snowflake.pushdowns.querygeneration.QueryBuilder.tryBuild$lzycompute(QueryBuilder.scala:34)
at net.snowflake.spark.snowflake.pushdowns.querygeneration.QueryBuilder.tryBuild(QueryBuilder.scala:33)
at net.snowflake.spark.snowflake.pushdowns.querygeneration.QueryBuilder$.getRDDFromPlan(QueryBuilder.scala:179)
at net.snowflake.spark.snowflake.pushdowns.SnowflakeStrategy.buildQueryRDD(SnowflakeStrategy.scala:42)
at net.snowflake.spark.snowflake.pushdowns.SnowflakeStrategy.apply(SnowflakeStrategy.scala:24)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:62)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:62)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:92)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:77)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:74)
at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:74)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:66)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:92)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:77)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:74)
at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:74)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:66)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:92)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:77)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:74)
at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)

Any help is greatly appreciated.

Spark_UI docker container doesn't run

Following the steps at utilities/Spark_UI, when I run:

docker run -itd -e SPARK_HISTORY_OPTS="$SPARK_HISTORY_OPTS -Dspark.history.fs.logDirectory=s3a://path_to_my_eventlog_dir -Dspark.hadoop.fs.s3a.access.key=my_key_id -Dspark.hadoop.fs.s3a.secret.key=my_secret_access_key" -p 18080:18080 glue/sparkui:latest "/opt/spark/bin/spark-class org.apache.spark.deploy.history.HistoryServer"

The container quickly craps out with:

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
20/01/27 21:16:44 INFO HistoryServer: Started daemon with process name: 1@4b35fc26fd5a
20/01/27 21:16:44 INFO SignalUtils: Registered signal handler for TERM
20/01/27 21:16:44 INFO SignalUtils: Registered signal handler for HUP
20/01/27 21:16:44 INFO SignalUtils: Registered signal handler for INT
20/01/27 21:16:45 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/01/27 21:16:45 INFO SecurityManager: Changing view acls to: root
20/01/27 21:16:45 INFO SecurityManager: Changing modify acls to: root
20/01/27 21:16:45 INFO SecurityManager: Changing view acls groups to: 
20/01/27 21:16:45 INFO SecurityManager: Changing modify acls groups to: 
20/01/27 21:16:45 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
20/01/27 21:16:45 INFO FsHistoryProvider: History server ui acls disabled; users with admin permissions: ; groups with admin permissions
Exception in thread "main" java.lang.reflect.InvocationTargetException
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at org.apache.spark.deploy.history.HistoryServer$.main(HistoryServer.scala:280)
	at org.apache.spark.deploy.history.HistoryServer.main(HistoryServer.scala)
Caused by: java.lang.NoSuchMethodError: com.fasterxml.jackson.databind.ObjectMapper.enable([Lcom/fasterxml/jackson/core/JsonParser$Feature;)Lcom/fasterxml/jackson/databind/ObjectMapper;
	at com.amazonaws.partitions.PartitionsLoader.<clinit>(PartitionsLoader.java:54)
	at com.amazonaws.regions.RegionMetadataFactory.create(RegionMetadataFactory.java:30)
	at com.amazonaws.regions.RegionUtils.initialize(RegionUtils.java:65)
	at com.amazonaws.regions.RegionUtils.getRegionMetadata(RegionUtils.java:53)
	at com.amazonaws.regions.RegionUtils.getRegion(RegionUtils.java:107)
	at com.amazonaws.services.s3.AmazonS3Client.createSigner(AmazonS3Client.java:4040)
	at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5039)
	at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4998)
	at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1413)
	at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:1349)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.verifyBucketExists(S3AFileSystem.java:276)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:236)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2812)
	at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:100)
	at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2849)
	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2831)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:389)
	at org.apache.hadoop.fs.Path.getFileSystem(Path.java:356)
	at org.apache.spark.deploy.history.FsHistoryProvider.<init>(FsHistoryProvider.scala:117)
	at org.apache.spark.deploy.history.FsHistoryProvider.<init>(FsHistoryProvider.scala:86)
	... 6 more

Bad coding practices (Python)

This and awslabs are full of bad practices which people unfortunately copy over because they use this as an example.

Would you consider connecting this repository to some linters or codeac.io?
To prevent spreading bad practices.

example:
https://github.com/aws-samples/aws-glue-samples/blob/master/examples/join_and_relationalize.py#L5

from x import * - is an anti pattern which might lead to unexpected behaviour and also makes reading the code much harder because it's unsure where the functions come from.

more info: https://www.flake8rules.com/rules/F405.html

Namespaces are one honking great idea -- let's do more of those! "Zen of python"

toDF() isn't working on the shell

I get the same error (see attached) when trying orgs.toDF().show() or memberships.select_fields(['organization_id']).toDF().distinct().show()
todf

Missing DynamicFrame import

In examples/data_cleaning_and_lambda.md section 5. Lambda functions (aka Python UDFs) and ApplyMapping, there's call to DynamicFrame.fromDF() on line 222, but it seems like the class DynamicFrame wasn't imported as part of the boiler plater. I added from awsglue.dynamicframe import DynamicFrame and everything in the sample worked cleanly. Thanks for all the work on these. They're super helpful

Join and rationalize fail on Step 6

Following tutorial.
Hit errors on Step 6 with this code bit
glueContext.write_dynamic_frame.from_options(frame = l_history,
connection_type = "s3",
connection_options = {"path": "s3://****-glue-sample-target/output-dir/legislator_history"},
format = "parquet")

Response:
Traceback (most recent call last): File "/tmp/zeppelin_pyspark-5558815714183414326.py", line 349, in <module> raise Exception(traceback.format_exc()) Exception: Traceback (most recent call last): File "/tmp/zeppelin_pyspark-5558815714183414326.py", line 342, in <module> exec(code) File "<stdin>", line 4, in <module> File "/usr/share/aws/glue/etl/python/PyGlue.zip/awsglue/dynamicframe.py", line 563, in from_options format_options, transformation_ctx) File "/usr/share/aws/glue/etl/python/PyGlue.zip/awsglue/context.py", line 176, in write_dynamic_frame_from_options format, format_options, transformation_ctx) File "/usr/share/aws/glue/etl/python/PyGlue.zip/awsglue/context.py", line 199, in write_from_options return sink.write(frame_or_dfc) File "/usr/share/aws/glue/etl/python/PyGlue.zip/awsglue/data_sink.py", line 32, in write return self.writeFrame(dynamic_frame_or_dfc, info) File "/usr/share/aws/glue/etl/python/PyGlue.zip/awsglue/data_sink.py", line 28, in writeFrame return DynamicFrame(self._jsink.pyWriteDynamicFrame(dynamic_frame._jdf, callsite(), info), dynamic_frame.glue_ctx, dynamic_frame.name + "_errors") File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__ answer, self.gateway_client, self.target_id, self.name) File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco return f(*a, **kw) File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value format(target_id, ".", name), value) Py4JJavaError: An error occurred while calling o209.pyWriteDynamicFrame. : java.io.IOException: Failed to delete key: output-dir/legislator_history/_temporary at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.delete(S3NativeFileSystem.java:667) at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.delete(EmrFileSystem.java:296) at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.cleanupJob(FileOutputCommitter.java:463) at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.abortJob(FileOutputCommitter.java:482) at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.abortJob(HadoopMapReduceCommitProtocol.scala:134) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:146) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:121) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:121) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:121) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:101) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:87) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:87) at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:492) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:215) at com.amazonaws.services.glue.SparkSQLDataSink.writeDynamicFrame(DataSink.scala:123) at com.amazonaws.services.glue.DataSink.pyWriteDynamicFrame(DataSink.scala:38) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:280) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.IOException: 1 exceptions thrown from 1 batch deletes at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.deleteAll(Jets3tNativeFileSystemStore.java:375) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy46.deleteAll(Unknown Source) at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.doSingleThreadedBatchDelete(S3NativeFileSystem.java:1336) at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.delete(S3NativeFileSystem.java:665) ... 36 more Caused by: java.io.IOException: MultiObjectDeleteException thrown with 38 keys in error: output-dir/legislator_history/_temporary/0/task_20180315222929_0045_m_000000/part-00000-acd34e4e-daeb-4d24-9df5-2d1712cf7857.snappy.parquet, output-dir/legislator_history/_temporary/0/_temporary/attempt_20180315215639_0040_m_000000_0/part-00000-060907b6-a737-49bf-8966-bd2d9bf1af91.snappy.parquet, output-dir/legislator_history/_temporary/0/task_20180315222723_0045_m_000000/part-00000-acd34e4e-daeb-4d24-9df5-2d1712cf7857.snappy.parquet at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.deleteAll(Jets3tNativeFileSystemStore.java:360) ... 45 more Caused by: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.MultiObjectDeleteException: One or more objects could not be deleted (Service: null; Status Code: 200; Error Code: null; Request ID: 4CEE1BABB3FC9F96), S3 Extended Request ID: jaXBFrKlvOlBHDjaZOty2v49zSjSsZ4XTTAzjeVf+aYyiSrcZEaPpejdMlLqoULpmtnjaTW7Xf0= at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.deleteObjects(AmazonS3Client.java:2107) at com.amazon.ws.emr.hadoop.fs.s3.lite.call.DeleteObjectsCall.perform(DeleteObjectsCall.java:26) at com.amazon.ws.emr.hadoop.fs.s3.lite.call.DeleteObjectsCall.perform(DeleteObjectsCall.java:12) at com.amazon.ws.emr.hadoop.fs.s3.lite.executor.GlobalS3Executor.execute(GlobalS3Executor.java:82) at com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.invoke(AmazonS3LiteClient.java:176) at com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.deleteObjects(AmazonS3LiteClient.java:125) at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.deleteAll(Jets3tNativeFileSystemStore.java:355) ... 45 more

How to keep the partition structure of the folder after ETL?

I have original files in S3 with folder structure like:
/data/year=2017/month=1/day=1/origin_files
/data/year=2017/month=1/day=2/origin_files

I use glue crawler create a table data(partitioned) as source of glue jobs.
Currently after I use glue job converting files to ORC, I get:
/data_orc/transformed_data_files.orc

Is that possible to keep same partition structure after transforming jobs? like:
/data_orc/year=2017/month=1/day=1/transformed_data_files.orc
/data_orc/year=2017/month=1/day=2/transformed_data_files.orc

It doesn't have to be file to file matching, but I hope the data partition can keep same folder structure.

CONNECTION_LIST_CONNECTION_WITH_AZ_NULL Error

Hi,

I encounter this error when I try to test my connection which created via Amazon Glue Web interface. I could not find anything on Google or Amazon Forums.

I try to connect my postgresql db which hosted on AWS RDS.

screenshot from 2018-02-24 04-24-46

partitioning the glue job output using crawler partitions

Is there any way to partition the data after transformation using Glue job using the same partitions created by the crawler based on the folder structure in s3? i tried using "partitionKeys": ["type"] in the connection_options, However i get an error stating that the partition created by the crawler is not a field in the data. Any help is much appreciated! Thanks

Add an example of a custom classifier

I'd like to see an example of custom classifier that is proven to work with custom data. The reason for the request is my headache when trying to write my own and my efforts simply do not work. My code (and patterns) work perfectly in online Grok debuggers, but they do not work in AWS. I do not get any errors in the logs either. My data simply does not get classified and table schemas are not created.

So, the classifier example should include a custom file to classify, maybe a log file of some sort. The file itself should include various types of information so that the example would demonstrate various pattern matches. Then the example should present the classifier rule, maybe even include a custom keyword to demonstrate the usage of that one too. Also, a deliberate mistake should also be demoed (both in input data and patterns) and how to debug this situation in AWS.

Thanks in advance!

What version of the AWS SDK are you using?

I've tried with the 1.11.297 and 2.0 preview versions and they have a very different API from what your samples show. For example: These versions don't have a GlueContext under com.amazonaws.services.glue

Data Cleaning Example not working

Hello,

I'm unable to get the Data Cleaning Example to work as expected.

Dev Endpoint & local Zeppelin setup as described here. Port forwarding active and working.

Database and crawler are created, crawler ran successfully, the Data Catalog objects exist.

Running the code of the example step by step and preceding it with %pyspark I run into the following errors:

%pyspark
medicare = spark.read.format(
   "com.databricks.spark.csv").option(
   "header", "true").option(
   "inferSchema", "true").load(
   's3://awsglue-datasets/examples/medicare/Medicare_Hospital_Provider.csv')

Traceback (most recent call last):
  File "/tmp/zeppelin_pyspark-8799958238772365289.py", line 349, in <module>
    raise Exception(traceback.format_exc())
Exception: Traceback (most recent call last):
  File "/tmp/zeppelin_pyspark-8799958238772365289.py", line 342, in <module>
    exec(code)
  File "<stdin>", line 5, in <module>
  File "/usr/lib/spark/python/pyspark/sql/readwriter.py", line 149, in load
    return self._df(self._jreader.load(path))
  File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/usr/lib/spark/python/pyspark/sql/utils.py", line 69, in deco
    raise AnalysisException(s.split(': ', 1)[1], stackTrace)
AnalysisException: u'Path does not exist: s3://awsglue-datasets/examples/medicare/Medicare_Hospital_Provider.csv;'

As well as

medicare_dynamicframe = glueContext.create_dynamic_frame.from_catalog(
       database = "payments",
       table_name = "medicare")

medicare_dynamicframe.printSchema()

Traceback (most recent call last):
  File "/tmp/zeppelin_pyspark-8799958238772365289.py", line 349, in <module>
    raise Exception(traceback.format_exc())
Exception: Traceback (most recent call last):
  File "/tmp/zeppelin_pyspark-8799958238772365289.py", line 342, in <module>
    exec(code)
  File "<stdin>", line 12, in <module>
  File "/usr/share/aws/glue/etl/python/PyGlue.zip/awsglue/dynamicframe.py", line 108, in printSchema
    print self._jdf.schema().treeString()
  File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
    format(target_id, ".", name), value)
Py4JJavaError: An error occurred while calling o341.schema.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 7, ip-172-31-57-241.us-west-2.compute.internal, executor 2): java.io.FileNotFoundException: No such file or directory 's3://awsglue-datasets/examples/medicare/Medicare_Hospital_Provider.csv'
	at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:803)
	at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.open(S3NativeFileSystem.java:1181)
	at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:773)
	at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.open(EmrFileSystem.java:166)
	at com.amazonaws.services.glue.hadoop.TapeHadoopRecordReaderSplittable.initialize(TapeHadoopRecordReaderSplittable.scala:123)
	at org.apache.spark.rdd.NewHadoopRDD$$anon$1.liftedTree1$1(NewHadoopRDD.scala:180)
	at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:177)
	at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:134)
	at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:69)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:99)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
	at scala.Option.foreach(Option.scala:257)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1981)
	at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1025)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
	at org.apache.spark.rdd.RDD.reduce(RDD.scala:1007)
	at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1.apply(RDD.scala:1150)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
	at org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1127)
	at org.apache.spark.sql.glue.util.SchemaUtils$.fromRDD(SchemaUtils.scala:57)
	at com.amazonaws.services.glue.DynamicFrame.recomputeSchema(DynamicFrame.scala:230)
	at com.amazonaws.services.glue.DynamicFrame.schema(DynamicFrame.scala:218)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:280)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:214)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.FileNotFoundException: No such file or directory 's3://awsglue-datasets/examples/medicare/Medicare_Hospital_Provider.csv'
	at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:803)
	at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.open(S3NativeFileSystem.java:1181)
	at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:773)
	at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.open(EmrFileSystem.java:166)
	at com.amazonaws.services.glue.hadoop.TapeHadoopRecordReaderSplittable.initialize(TapeHadoopRecordReaderSplittable.scala:123)
	at org.apache.spark.rdd.NewHadoopRDD$$anon$1.liftedTree1$1(NewHadoopRDD.scala:180)
	at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:177)
	at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:134)
	at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:69)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:99)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	... 1 more

At this moment I'm not sure what could be wrong on my end.

Library dependency broken May 5, 2020

Good Afternoon,

I am not certain that this is the right place for this issue. If someone knows better, please point me to the right location. I have a simple example project for testing locally in Scala, and have been using the AWSGlueETL library. As of this afternoon, it has stopped working.

You can see the error here:
https://github.com/Gamesight/aws-glue-local-scala/runs/647153542?check_suite_focus=true

It looks like the file was changed today:

<Contents>
<Key>
release/com/amazonaws/AWSGlueETL/1.0.0/AWSGlueETL-1.0.0.pom
</Key>
<LastModified>2020-05-05T16:53:39.000Z</LastModified>
<ETag>"05ec2f9f20bc535e4d00717fe1c902bf"</ETag>
<Size>17464</Size>
<StorageClass>STANDARD</StorageClass>
</Contents>

Upon inspection of the file: https://aws-glue-etl-artifacts.s3.amazonaws.com/release/com/amazonaws/AWSGlueETL/1.0.0/AWSGlueETL-1.0.0.pom

I see this:

 <properties>
    <maven.compiler.source>1.8</maven.compiler.source>
    <maven.compiler.target>1.8</maven.compiler.target>
    <encoding>UTF-8</encoding>
    <scala.version>2.11.1</scala.version>
    <scala.compat.version>2.11</scala.compat.version>
    <spec2.version>4.2.0</spec2.version>
    <glue.artifacts.bucket>aws-glue-etl-artifacts-beta</glue.artifacts.bucket>
    <glue.sdk.artifactid>AWSSDKGlueJavaClient</glue.sdk.artifactid>
    <glue.sdk.version>1.0</glue.sdk.version>
    <aws.sdk.version>1.11.774</aws.sdk.version>
  </properties>

It looks like the glue.sdk.artifactid may be referencing a java class instead of the usual aws-java-sdk-glue or maybe the beta flag in the reference to aws-glue-etl-artifacts-beta bucket wasn't removed? It doesn't appear that public access is enabled for that bucket. Whatever the cause, the AWSSDKGlueJavaClient file is not there.

Any help would be appreciated.

what "modifications" were made to the json

I have been unable to replicate the sample because I have been unable to get glue to crawl the json file that I have downloaded from EveryPolitician.com. Maybe you could share your .json file with the modifications that were made or tell us what modifications need to be made.

"It contains data in JSON format about United States legislators and the seats they have held in the the House of Representatives and the Senate, and has been modified somewhat for purposes of this tutorial."

My crawler never detects these tables:
persons_json
memberships_json
organizations_json
events_json
areas_json
countries_r_json

Hive table and partition basic statistics are not correctly imported into AWS Glue Catalog

Hive table and partition basic statistics are not correctly imported into AWS Glue Catalog.
The statistics properties are included in the Glue table properties, however, it looks that Hive is not honoring it.

Glue migration script is capable of migrating table and partition statistics from the Hive Metastore, however, it appears that the migration script is escaping some of the characters thus making the statistics unusable in Glue catalog:
https://github.com/aws-samples/aws-glue-samples/blob/master/utilities/Hive_metastore_migration/src/hive_metastore_migration.py#L455


When I created a table in Hive meta store, the table and column statistics looked like the following:

Table Parameters:
COLUMN_STATS_ACCURATE {"BASIC_STATS":"true","COLUMN_STATS":{"id":"true","name":"true"}}
EXTERNAL TRUE
numFiles 1
numRows 2
rawDataSize 14
totalSize 16
transient_lastDdlTime 1536141689


When I ran the migration script to migrate the Hive meta store to Glue catalog, the same statistics became the following. I have found the below statistics is unusable through Hive.

Table Parameters:
COLUMN_STATS_ACCURATE \{\"BASIC_STATS\"\:\"true\",\"COLUMN_STATS\"\:\{\"id\"\:\"true\",\"name\"\:\"true\"\}\}
EXTERNAL TRUE
numFiles 1
numRows 2
rawDataSize 14
totalSize 16
transient_lastDdlTime 1536141689

I then manually modified table property (COLUMN_STATS_ACCURATE) in Glue console to the following and was able to convert 'COLUMN_STATS_ACCURATE' into a usable format

I didn't check the compatibility of the migrated statistics with the other EMR tools (Spark, Presto) and AWS services (Glue ETL job, Athena, Redshift Spectrum).

Regards,
Simone

Issue with direct migration from hive metastore to glue catalog

While migrating table with OpenCSVSerde from hive metastore to glue catalog it brings extra character on serde properties. Like hive metastore has * however glue catalog will have *.

I think the problem is creted by the below function (hive_metastore_migration.py)

def udf_escape_chars(param_value):
ret_param_value = param_value.replace('\', '\\')
.replace('|', '\|')
.replace('"', '\"')
.replace('{', '\{')
.replace(':', '\:')
.replace('}', '\}')

Sample for Searching Data Catalog

All over the web I've seen suggestions that Glue Data Catalog enables metadata searches....so that for example I could ask for all tables that contain a certain column name or that have a particular comment on a column.

To date I've not seen a single tutorial, sample, guide, how to, cheat sheet or any documentation of any kind that actually shows how to search for a column name in the Glue Data Catalog. The closest I've come is this article which suggests that I can use these keywords when searching Glue Tables:

  • column_name
  • column_comment
  • table_desc

When I try using one of these (e.g. column_name: star*) it always returns nothing, even though I do have tables that have a column named with "star".

This capability is far too basic and obviously needed, and thus warrants having a sample or some documentation created somewhere.

Glue catalog to Hive Metastore Migration script not working with partition table

I'm running the script to migrate Glue catalog data crawled from hive style (key=value) partition data from s3 and then migrating to hive metstore(MySQL). And the partition that is getting created in the hive metastore is incorrect.
[+] https://github.com/awslabs/aws-glue-samples/tree/master/utilities/Hive_metastore_migration

Note: Looks like Glue catalog data crawled from partitioned S3 is fine, as launching New EMR cluster with the Glue catalog is working fine and partition information is correct. Also, Athena using the Glue is able to find the partition of the table properly. But the script migrating table information from glue catalog to metastore is getting messed up, hence creating totally wrong partition information in hive metastore.

** Please find the steps carried out: **

  1. S3 path from which crawler was executed (You can see the data is in proper layout or YYYY=value/mm=value)
a0999b1381a5:~ kuntalg$ aws s3 ls --recursive s3://kg-practice/elb_logging/test
2018-02-06 15:54:06          0 elb_logging/test/
2018-02-06 15:54:35          0 elb_logging/test/year=2015/
2018-02-06 15:55:00          0 elb_logging/test/year=2015/month=01/
2018-02-06 16:31:43         22 elb_logging/test/year=2015/month=01/test2.csv
2018-02-06 15:55:08          0 elb_logging/test/year=2015/month=02/
2018-02-06 16:31:57         22 elb_logging/test/year=2015/month=02/test3.csv

  1. Once Crawler was done, I have launched a new EMR cluster by pointing to Glue Catalog while launching. After that executed the following commands on the cluster (Hive & Spark). And it is showing the partition is proper format).
scala> spark.sql("SHOW PARTITIONS test").show(30,false)
18/02/06 16:27:59 WARN CredentialsLegacyConfigLocationProvider: Found the legacy config profiles file at [/home/hadoop/.aws/config]. Please move it to the latest default location [~/.aws/credentials].
+------------------+
|partition         |
+------------------+
|year=2015/month=01|
|year=2015/month=02|
+------------------+

hive> show partitions test;
OK
year=2015/month=02
year=2015/month=01
Time taken: 0.54 seconds, Fetched: 2 row(s)

hive> describe test;
OK
col0                	bigint              	                    
col1                	string              	                    
year                	string              	                    
month               	string              	                    
	 	 
# Partition Information	 	 
# col_name            	data_type           	comment             
	 	 
year                	string              	                    
month               	string              	                    
Time taken: 0.651 seconds, Fetched: 10 row(s)


hive> select * from test;
OK
1	Monty	2015	02
2	Trish	2015	02
5	Lisa	2015	02
1	kuntal	2015	01
2	Rock	2015	01
3	Cena	2015	01
Time taken: 3.047 seconds, Fetched: 6 row(s)

hive> select * from test where year='2015' and month='02';
OK
1	Monty	2015	02
2	Trish	2015	02
5	Lisa	2015	02
Time taken: 0.971 seconds, Fetched: 3 row(s)

================
3) Kindly note that the catalog-2-migration script (export_from_datacatalog.py) will not work with the following key constraint error:

"duplicate entry for key 'UNIQUE_DATABASE'
.....
java.sql.BatchUpdateException: Field 'IS_REWRITE_ENABLED' doesn't have a default value"

I found the column 'IS_REWRITE_ENABLED' is in table hive.TBLS. A strange thing I found is this column can be NULL in table definition. However, the Spark job complains about the default value. So I manually login to my Hive metastore and updated the default value:
ALTER TABLE hive.TBLS ALTER IS_REWRITE_ENABLED SET DEFAULT 1;

After this small change, the Glue ETL job completed successfully. But the partition generated by the script is totally incorrect.

Partition messed up

hive> show partitions test;
OK
year(string),month(string)=2015,01
year(string),month(string)=2015,02
Time taken: 0.176 seconds, Fetched: 2 row(s)

Although the table description is same-

hive> describe test;
OK
col0                	bigint              	                    
col1                	string              	                    
year                	string              	                    
month               	string              	                    
	 	 
# Partition Information	 	 
# col_name            	data_type           	comment             
	 	 
year                	string              	                    
month               	string              	                    
Time taken: 0.492 seconds, Fetched: 10 row(s)

So its totally an issue with the migration script and I'm stuck with our migration process

So kindly look into the issue on an urgent basis and fix the script or provide me a workaround or solution.

does ML transform support model versioning? Can you specify this in FindMatches?

Can you reference which version of a trained ML Transform to use when running FindMatches?
example:
Train model with labels - version 1
Train model with more labels - version 2

Is there a way to specify in FindMatches to look for duplicates via -version 1 even though on the same ML Transform job you have already ran additional labels for tuning the model for version 2?

Sum it up, A vs B test on the same ML Transform job.

sample code missing import library

Data cleaning with AWS Glue
(aws-glue-samples/examples/data_cleaning_and_lambda.md)

not include awsglue.DynamicFrame syntax

check plz insert DynamicFrame class

from awsglue.dynamicframe import DynamicFrame

ex)
medicare_tmp_dyf = DynamicFrame.fromDF(medicare_dataframe, glueContext, "nested")

Spelling Errors in Instructions

The sentence "These are all still strings in the data. We can use the DynamicFrame's powerful apply_mapping tranform method to drop, rename, cast, and nest the data so that other data programming langages and sytems can easily access" has several spelling errors

Unable to Create GlueContext via GlueContext Function in Local Python/awsglue Environment

I'm having the issue described in issue #42.

I am attempting to run the following in my local PySpark console...

from awsglue.context import GlueContext
glueContext = GlueContext(sc)

We receive the following:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\XYZ\bin\aws-glue-libs\PyGlue.zip\awsglue\context.py", line 47, in __init__
  File "C:\Users\XYZ\bin\aws-glue-libs\PyGlue.zip\awsglue\context.py", line 68, in _get_glue_scala_context
TypeError: 'JavaPackage' object is not callable

The following is the complete picture:
image

The environment looks like the following:

  • OS: 10.0.17134.0
  • Python: 3.7.3
  • Hadoop (winutils.exe): 2.8.5
  • Spark: 2.4.3
  • PySpark: 2.4.6
  • awsglue: 1.0

My environment variables look like the following...

  • SPARK_HOME: \bin\spark-2.4.3-bin-hadoop2.8\
  • SPARK_CONF_DIR: \bin\aws-glue-libs\conf\
  • HADOOP_HOME: \bin\hadoop-2.8.5\
  • SPARK_CONF_DIR: \bin\spark-2.4.3-bin-hadoop2.8\
  • JAVA_HOME: C:\Progra~2\Java\jdk1.8.0\
  • CLASSPATH:
    • \bin\aws-glue-libs\jarsv1*
    • \bin\spark-2.4.3-bin-hadoop2.8\jars*
  • PYTHONPATH:
    • ${SPARK_HOME}\python\lib\py4j
    • \bin\aws-glue-libs\PyGlue.zip

Just to confirm which version awsglue repo I'm working with...

image

The following are the "netty" files in my ..\aws-glue-libs\jarsv1\:

image

I'm looking for a little guidance on how to tweak my configuration to resolve this issue.

AWS Glue job is failing for larger csv data on s3

For small s3 input files (~10GB), glue ETL job works fine but for the larger dataset (~200GB), the job is failing.

Adding a part of ETL code.
// Converting Dynamic frame to dataframe
df = dropnullfields3.toDF()
// create new partition column
partitioned_dataframe = df.withColumn('part_date', df['timestamp_utc'].cast('date'))
// store the data in parquet format on s3
partitioned_dataframe.write.partitionBy(['part_date']).format("parquet").save(output_lg_partitioned_dir, mode="append")

Job executed for 4 hours and threw an error.

File "script_2017-11-23-15-07-32.py", line 49, in
partitioned_dataframe.write.partitionBy(['part_date']).format("parquet").save(output_lg_partitioned_dir, mode="append")
File "/mnt/yarn/usercache/root/appcache/application_1511449472652_0001/container_1511449472652_0001_02_000001/pyspark.zip/pyspark/sql/readwriter.py", line 550, in save
File "/mnt/yarn/usercache/root/appcache/application_1511449472652_0001/container_1511449472652_0001_02_000001/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in call
File "/mnt/yarn/usercache/root/appcache/application_1511449472652_0001/container_1511449472652_0001_02_000001/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/mnt/yarn/usercache/root/appcache/application_1511449472652_0001/container_1511449472652_0001_02_000001/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o172.save.
: org.apache.spark.SparkException: Job aborted.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:147)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:121)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:121)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:121)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:101)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:87)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:87)
at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:492)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:215)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:198)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 3385 tasks (1024.1 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1951)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:127)
... 30 more

End of LogType:stdout

Connection to Redshift time's out from SageMaker jupyter's notebook

I am trying to push data to Amazon Redshift as described in the join_and_relationalize.md sample.

The write operation works flawlessly if I run it within an AWS Glue job but it times out when I run it within an Amazon SageMaker Jupyter's PySpark notebook that I have connected to an AWS Glue Dev Endpoint (both use the same AWS Glue connection). I'm guessing it has to do with networking but the document doesn't specify anything about this.

Can AWS elaborate on how we must configure our Amazon SageMaker Jupyter's notebook in order to write to Amazon Redshift using PySpark?

An error was encountered:
An error occurred while calling o87.getJDBCSink.
: java.sql.SQLException: [Amazon](500150) Error setting/closing connection: Connection timed out.
	at com.amazon.redshift.client.PGClient.connect(Unknown Source)
	at com.amazon.redshift.client.PGClient.<init>(Unknown Source)
	at com.amazon.redshift.core.PGJDBCConnection.connect(Unknown Source)
	at com.amazon.jdbc.common.BaseConnectionFactory.doConnect(Unknown Source)
	at com.amazon.jdbc.common.AbstractDriver.connect(Unknown Source)
	at com.amazon.redshift.jdbc.Driver.connect(Unknown Source)
	at com.amazonaws.services.glue.util.JDBCWrapper$$anonfun$8.apply(JDBCUtils.scala:895)
	at com.amazonaws.services.glue.util.JDBCWrapper$$anonfun$8.apply(JDBCUtils.scala:891)
	at com.amazonaws.services.glue.util.JDBCWrapper$$anonfun$connectWithSSLAttempt$1$$anonfun$apply$6.apply(JDBCUtils.scala:847)
	at scala.Option.getOrElse(Option.scala:121)
	at com.amazonaws.services.glue.util.JDBCWrapper$$anonfun$connectWithSSLAttempt$1.apply(JDBCUtils.scala:847)
	at scala.Option.getOrElse(Option.scala:121)
	at com.amazonaws.services.glue.util.JDBCWrapper$.connectWithSSLAttempt(JDBCUtils.scala:847)
	at com.amazonaws.services.glue.util.JDBCWrapper$.connectionProperties(JDBCUtils.scala:890)
	at com.amazonaws.services.glue.util.JDBCWrapper.connectionProperties$lzycompute(JDBCUtils.scala:670)
	at com.amazonaws.services.glue.util.JDBCWrapper.connectionProperties(JDBCUtils.scala:670)
	at com.amazonaws.services.glue.util.JDBCWrapper.getRawConnection(JDBCUtils.scala:683)
	at com.amazonaws.services.glue.RedshiftDataSink.<init>(RedshiftDataSink.scala:40)
	at com.amazonaws.services.glue.GlueContext.getSink(GlueContext.scala:650)
	at com.amazonaws.services.glue.GlueContext.getJDBCSink(GlueContext.scala:463)
	at com.amazonaws.services.glue.GlueContext.getJDBCSink(GlueContext.scala:445)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
Caused by: com.amazon.support.exceptions.GeneralException: [Amazon](500150) Error setting/closing connection: Connection timed out.
	... 31 more
Caused by: java.net.ConnectException: Connection timed out
	at sun.nio.ch.Net.connect0(Native Method)
	at sun.nio.ch.Net.connect(Net.java:454)
	at sun.nio.ch.Net.connect(Net.java:446)
	at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:645)
	at sun.nio.ch.SocketAdaptor.connect(SocketAdaptor.java:107)
	at com.amazon.redshift.client.PGClient.connect(Unknown Source)
	at com.amazon.redshift.client.PGClient.<init>(Unknown Source)
	at com.amazon.redshift.core.PGJDBCConnection.connect(Unknown Source)
	at com.amazon.jdbc.common.BaseConnectionFactory.doConnect(Unknown Source)
	at com.amazon.jdbc.common.AbstractDriver.connect(Unknown Source)
	at com.amazon.redshift.jdbc.Driver.connect(Unknown Source)
	at com.amazonaws.services.glue.util.JDBCWrapper$$anonfun$8.apply(JDBCUtils.scala:895)
	at com.amazonaws.services.glue.util.JDBCWrapper$$anonfun$8.apply(JDBCUtils.scala:891)
	at com.amazonaws.services.glue.util.JDBCWrapper$$anonfun$connectWithSSLAttempt$1$$anonfun$apply$6.apply(JDBCUtils.scala:847)
	at scala.Option.getOrElse(Option.scala:121)
	at com.amazonaws.services.glue.util.JDBCWrapper$$anonfun$connectWithSSLAttempt$1.apply(JDBCUtils.scala:847)
	at scala.Option.getOrElse(Option.scala:121)
	at com.amazonaws.services.glue.util.JDBCWrapper$.connectWithSSLAttempt(JDBCUtils.scala:847)
	at com.amazonaws.services.glue.util.JDBCWrapper$.connectionProperties(JDBCUtils.scala:890)
	at com.amazonaws.services.glue.util.JDBCWrapper.connectionProperties$lzycompute(JDBCUtils.scala:670)
	at com.amazonaws.services.glue.util.JDBCWrapper.connectionProperties(JDBCUtils.scala:670)
	at com.amazonaws.services.glue.util.JDBCWrapper.getRawConnection(JDBCUtils.scala:683)
	at com.amazonaws.services.glue.RedshiftDataSink.<init>(RedshiftDataSink.scala:40)
	at com.amazonaws.services.glue.GlueContext.getSink(GlueContext.scala:650)
	at com.amazonaws.services.glue.GlueContext.getJDBCSink(GlueContext.scala:463)
	at com.amazonaws.services.glue.GlueContext.getJDBCSink(GlueContext.scala:445)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)

Traceback (most recent call last):
  File "/mnt/yarn/usercache/livy/appcache/application_1588365285924_0011/container_1588365285924_0011_01_000001/PyGlue.zip/awsglue/dynamicframe.py", line 665, in from_jdbc_conf
    redshift_tmp_dir, transformation_ctx)
  File "/mnt/yarn/usercache/livy/appcache/application_1588365285924_0011/container_1588365285924_0011_01_000001/PyGlue.zip/awsglue/context.py", line 311, in write_dynamic_frame_from_jdbc_conf
    catalog_id)
  File "/mnt/yarn/usercache/livy/appcache/application_1588365285924_0011/container_1588365285924_0011_01_000001/PyGlue.zip/awsglue/context.py", line 326, in write_from_jdbc_conf
    transformation_ctx, catalog_id)
  File "/mnt/yarn/usercache/livy/appcache/application_1588365285924_0011/container_1588365285924_0011_01_000001/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/mnt/yarn/usercache/livy/appcache/application_1588365285924_0011/container_1588365285924_0011_01_000001/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/mnt/yarn/usercache/livy/appcache/application_1588365285924_0011/container_1588365285924_0011_01_000001/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
    format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o87.getJDBCSink.
: java.sql.SQLException: [Amazon](500150) Error setting/closing connection: Connection timed out.
	at com.amazon.redshift.client.PGClient.connect(Unknown Source)
	at com.amazon.redshift.client.PGClient.<init>(Unknown Source)
	at com.amazon.redshift.core.PGJDBCConnection.connect(Unknown Source)
	at com.amazon.jdbc.common.BaseConnectionFactory.doConnect(Unknown Source)
	at com.amazon.jdbc.common.AbstractDriver.connect(Unknown Source)
	at com.amazon.redshift.jdbc.Driver.connect(Unknown Source)
	at com.amazonaws.services.glue.util.JDBCWrapper$$anonfun$8.apply(JDBCUtils.scala:895)
	at com.amazonaws.services.glue.util.JDBCWrapper$$anonfun$8.apply(JDBCUtils.scala:891)
	at com.amazonaws.services.glue.util.JDBCWrapper$$anonfun$connectWithSSLAttempt$1$$anonfun$apply$6.apply(JDBCUtils.scala:847)
	at scala.Option.getOrElse(Option.scala:121)
	at com.amazonaws.services.glue.util.JDBCWrapper$$anonfun$connectWithSSLAttempt$1.apply(JDBCUtils.scala:847)
	at scala.Option.getOrElse(Option.scala:121)
	at com.amazonaws.services.glue.util.JDBCWrapper$.connectWithSSLAttempt(JDBCUtils.scala:847)
	at com.amazonaws.services.glue.util.JDBCWrapper$.connectionProperties(JDBCUtils.scala:890)
	at com.amazonaws.services.glue.util.JDBCWrapper.connectionProperties$lzycompute(JDBCUtils.scala:670)
	at com.amazonaws.services.glue.util.JDBCWrapper.connectionProperties(JDBCUtils.scala:670)
	at com.amazonaws.services.glue.util.JDBCWrapper.getRawConnection(JDBCUtils.scala:683)
	at com.amazonaws.services.glue.RedshiftDataSink.<init>(RedshiftDataSink.scala:40)
	at com.amazonaws.services.glue.GlueContext.getSink(GlueContext.scala:650)
	at com.amazonaws.services.glue.GlueContext.getJDBCSink(GlueContext.scala:463)
	at com.amazonaws.services.glue.GlueContext.getJDBCSink(GlueContext.scala:445)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
Caused by: com.amazon.support.exceptions.GeneralException: [Amazon](500150) Error setting/closing connection: Connection timed out.
	... 31 more
Caused by: java.net.ConnectException: Connection timed out
	at sun.nio.ch.Net.connect0(Native Method)
	at sun.nio.ch.Net.connect(Net.java:454)
	at sun.nio.ch.Net.connect(Net.java:446)
	at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:645)
	at sun.nio.ch.SocketAdaptor.connect(SocketAdaptor.java:107)
	at com.amazon.redshift.client.PGClient.connect(Unknown Source)
	at com.amazon.redshift.client.PGClient.<init>(Unknown Source)
	at com.amazon.redshift.core.PGJDBCConnection.connect(Unknown Source)
	at com.amazon.jdbc.common.BaseConnectionFactory.doConnect(Unknown Source)
	at com.amazon.jdbc.common.AbstractDriver.connect(Unknown Source)
	at com.amazon.redshift.jdbc.Driver.connect(Unknown Source)
	at com.amazonaws.services.glue.util.JDBCWrapper$$anonfun$8.apply(JDBCUtils.scala:895)
	at com.amazonaws.services.glue.util.JDBCWrapper$$anonfun$8.apply(JDBCUtils.scala:891)
	at com.amazonaws.services.glue.util.JDBCWrapper$$anonfun$connectWithSSLAttempt$1$$anonfun$apply$6.apply(JDBCUtils.scala:847)
	at scala.Option.getOrElse(Option.scala:121)
	at com.amazonaws.services.glue.util.JDBCWrapper$$anonfun$connectWithSSLAttempt$1.apply(JDBCUtils.scala:847)
	at scala.Option.getOrElse(Option.scala:121)
	at com.amazonaws.services.glue.util.JDBCWrapper$.connectWithSSLAttempt(JDBCUtils.scala:847)
	at com.amazonaws.services.glue.util.JDBCWrapper$.connectionProperties(JDBCUtils.scala:890)
	at com.amazonaws.services.glue.util.JDBCWrapper.connectionProperties$lzycompute(JDBCUtils.scala:670)
	at com.amazonaws.services.glue.util.JDBCWrapper.connectionProperties(JDBCUtils.scala:670)
	at com.amazonaws.services.glue.util.JDBCWrapper.getRawConnection(JDBCUtils.scala:683)
	at com.amazonaws.services.glue.RedshiftDataSink.<init>(RedshiftDataSink.scala:40)
	at com.amazonaws.services.glue.GlueContext.getSink(GlueContext.scala:650)
	at com.amazonaws.services.glue.GlueContext.getJDBCSink(GlueContext.scala:463)
	at com.amazonaws.services.glue.GlueContext.getJDBCSink(GlueContext.scala:445)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)

Issue with Direct Migration from Glue Catalog to Hive Metastore

We have a Glue catalog in our dev aws account and now i am trying to migrate this Glue Catalog to Hive Metastore of an EMR Cluster (I need to do this to replace Hive Metastore content with Glue Catalog metadata so that I can track our Glue Catalog Data lineage using Apache Atlas installed on the EMR cluster).

I followed all the steps and procedures to Directly Migrate Glue Catalog to Hive Metastore but i am getting the Duplicate entry 'default' for key 'UNIQUE_DATABASE' error everytime and i have tried various iterations but still keep getting the same errors when i run the Glue ETL job. See the full error message below:

py4j.protocol.Py4JJavaError: An error occurred while calling o1025.jdbc.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 77.0 failed 4 times, most recent failure: Lost task 1.3 in stage 77.0 (TID 1298, ip-00-00-000-000.ec2.internal, executor 32): java.sql.BatchUpdateException: Duplicate entry 'default' for key 'UNIQUE_DATABASE'
at com.mysql.jdbc.PreparedStatement.executeBatchSerially(PreparedStatement.java:1815)
at com.mysql.jdbc.PreparedStatement.executeBatch(PreparedStatement.java:1277)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:642)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:783)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:783)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2069)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2069)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: com.mysql.jdbc.exceptions.jdbc4.MySQLIntegrityConstraintViolationException: Duplicate entry 'default' for key 'UNIQUE_DATABASE'
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at com.mysql.jdbc.Util.handleNewInstance(Util.java:377)
at com.mysql.jdbc.Util.getInstance(Util.java:360)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:971)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3887)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3823)
at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2435)
at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2582)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2530)
at com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1907)
at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:2141)
at com.mysql.jdbc.PreparedStatement.executeBatchSerially(PreparedStatement.java:1773)

So as per the migration procedure, we need to provide "--database-names" parameter values as key pairs so I tried specifying the list of databases separated by a semi-colon(;) first and also tried specifying just one database like "default" but none of this works. The above error was thrown when I used just the default database.

Is anyone familiar with this error? is there any work around to this issue? Please let me know if I am missing something. Any help is appreciated.

Could not configure Spark Submit parameters

Hi Team,
I'm running an ETL job in AWS glue which reads the table data and process the data and write it to S3. The job is failing with following error message:
ExecutorLostFailure (executor 38 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.

When I tried to modify the memory configurations with --conf parameter, it is not getting reflected in the job execution and the job is still failing with the same error.

When I searched for information online (link given below), it said we could not modify Spark configuration parameters in Glue, can you please confirm? If Yes, can you please provide me any alternatives for a successful execution

https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-glue-arguments.html

Thanks in Advance!

Regards,
Vamshi

local execution of aws glue

Trying to run aws glue with AWSGlue.zip throws following error

~/opt/spark-2.2.0-bin-hadoop2.7/bin/pyspark
Python 2.7.15 |Anaconda, Inc.| (default, Dec 14 2018, 13:10:39) 
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
19/02/18 18:36:18 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
19/02/18 18:36:23 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.2.0
      /_/

Using Python version 2.7.15 (default, Dec 14 2018 13:10:39)
SparkSession available as 'spark'.
>>> from awsglue.dynamicframe import DynamicFrameWriter
>>> glueContext = GlueContext(sc)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'GlueContext' is not defined
>>> from awsglue.context import GlueContext
>>> glueContext = GlueContext(sc)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "awsglue/context.py", line 44, in __init__
    self._glue_scala_context = self._get_glue_scala_context(**options)
  File "awsglue/context.py", line 64, in _get_glue_scala_context
    return self._jvm.GlueContext(self._jsc.sc())
TypeError: 'JavaPackage' object is not callable

Example of a custom data sink in Python

I'm looking for information on how to load into services other than S3 or Redshift -- something completely custom (e.g. some HTTP API). I'm not sure if I need to specifically implement some "DataSink" or "writer", or if it's okay for me to just e.g. Map.apply over my DataFrame and call some API while processing my records.

Issue With Direct Migration from Hive Metastore to Glue Data Catalog

Hello,
I hope you are doing well. At my company, we are in the process of adopting Athena for our adhoc data analytics tasks and configured Athena to use Glue data catalogue as its metastore. However, the source of truth for our data is a Hive metastore hosted on an AWS RDS mysql instance and we want Glue data catalog to be in sync with our Hive metastore. For this, we followed instructions outlined on awslabs' aws-glue-samples repo (https://github.com/awslabs/aws-glue-samples/tree/master/utilities/Hive_metastore_migration) to setup a Glue job that we will run (eventually periodically) to directly migrate Hive metastore to Glue data catalog.

However, we are running into a situation where the job finishes successfully but we don't see anything migrated into our Glue data catalog. I am wondering if this issue is experienced by others and if it got resolved. I do see an issue similar to our issue here (I even replied to this thread): #6, but this is closed and I dont see a solution posted in it.

Any help is greatly appreciated.

Thanks,
Kshitij

Question: Group by key

First, thanks for the great work on AWS Glue. One question: Can I use AWS Glue to group by key, and store the results in distinct files on S3 as follows?

Sample data in an s3 bucket:

{ "key1": "A", "key2": "1" }
{ "key1": "B", "key2", "3" }
{ "key1": "A", "key2", "2" }
{ "key1": "B", "key2", "4" }

Expected result:

file1:

{ "key1": "A", "key2": "1" }
{ "key1": "A", "key2", "2" }

file2:

{ "key1": "B", "key2", "3" }
{ "key1": "B", "key2", "4" }

This is what I tried:

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
datasource0 = glueContext.create_dynamic_frame.from_catalog(
    database = "my_db", 
    table_name = "my_input_table", 
    transformation_ctx = "datasource0")
partitioned_dataframe = datasource0.toDF().repartition(2, "key1")
partitioned_dynamicframe = DynamicFrame.fromDF(
    partitioned_dataframe, 
    glueContext, 
    "partitioned_df")
datasink2 = glueContext.write_dynamic_frame.from_options(
    frame = partitioned_dynamicframe, 
    connection_type = "s3", 
    connection_options = {"path": "s3://my-result-bucket/sub-bucket"}, 
    format = "json", 
    transformation_ctx = "datasink2")

Yet as the FAQ clearly state, there is no guarantee that data with different keys will end up in different partitions, hence they mal also end up in the same file.

From the FAQ:

Note that while different records with the same value for this column 
will be assigned to the same partition, there is no guarantee that there 
will be a separate partition for each distinct value.

So I guess there must be a different way to achieve this. Any ideas?

How to connect to my AWS account

Hi,

In the sample code I do not see where we are passing AWS Auth Key and Secret Key to connect to AWS. Can you tell from where I can add those details in order to run this code against my AWS instance.

Thanks.

Join and Relationalize runs forever

Hello,

I'm trying to run the join_and_relationalize.py example. I let it run for 2h24m before killing it. It just keeps repeating messages like this:

19/09/26 20:22:17 DEBUG Client: 
client token: N/A
diagnostics: N/A
ApplicationMaster host: [removed_ip]
ApplicationMaster RPC port: [removed_port]
queue: default
start time: 1569529172541
final status: UNDEFINED
tracking URL: http://ip-[removed].glue.dnsmasq:20888/proxy/application_[removed_id]/
user: root

It's creating tables in redshift and populating them, but I would not expect 11MB of data to take 2.5h to process.

Any idea how to make it work?

Thanks,

Matt

Add gradle build file

Would be nice if this project had a gradle build file with all the right scala and aws lib versions.

Migrate Directly from Hive to AWS Glue | No tables created

hive_databases_S3.txt
hive_tables_S3.txt

I am trying to migrate directly from Hive to AWS Glue.
I created proper Glue job with Hive connection.
Tested connection, and it successfully connected.
Basically followed all steps and all was successful.
But eventually I can't see tables in AWS Glue catalog.
No error logs in job and normal logs say run status as succeeded.

I even tried Migrate from Hive to AWS Glue using Amazon S3 Objects.
And that too was successful, but no tables were created in Glue catalog.
I could find metastore in S3 buckets exported from Hive (files attached).

Now I thought of running this code from my local Eclipse to debug.
Can you please tell me how to debug from local Eclipse or in Glue console.

unclear instructions

in this example:
https://github.com/awslabs/aws-glue-samples/blob/master/examples/join_and_relationalize.md
"1. Crawl our sample dataset" can you specify if user is supposed to copy the dataset from your sample bucket to their own bucket before proceeding to crawl, or just use your bucket as the source? the latter doesn't work. I got
"[626034be-45ee-4e8d-ae26-e985e8131ff9] ERROR : Error Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: F24B84B451EAEC7A) retrieving file at s3://awsglue-datasets/examples/us-legislators/house/persons.json. Tables created did not infer schemas from this file."

also there's a typo in the s3 path (you have a dot at the end)
"s3://awsglue-datasets/examples/us-legislators."

Possible to pull incremental data from Source

Hey

am trying to read incremental data from source. This is Connected using JDBC. I need to pass the timestamp in the filter. Is this a valid usecase if so any documentation is there? Am not able to find any for it.

Thanks

Issue while accessing GlueContext locally

Hi,

To develop ETL Scripts locally with Python I have followed below steps :-

  1. Downloaded the AWS Glue Python library from GitHub. For Glue version 1.0, I checked out branch glue-1.0 as this version supports Python 3.
  2. Installed Apache Maven and referenced the path of Maven in the environment variables.
  3. Installed the Apache Spark and set the SPARK_HOME in environment variables
  4. Installed jdk and set the JAVA_HOME in environment variables
  5. Installed Hadoop and set HADOOP_HOME in environment variables
  6. Installed pyspark, awscli and boto3.

I chose the ETL script that was automatically generated by AWS glue and ran the job.
It worked successfully from AWS console and the file from source was loaded into the destination folder. But when I try to execute the same script locally, I encountered below error:

return self._jvm.GlueContext(self._jsc.sc())
TypeError: 'JavaPackage' object is not callable

Is there any workaround for this?

Thanks

Add description about spark dependecies version

Scala examples were added to the project. But what spark dependecies are used? What versions do they have?

For example it uses spark-sql but what version is? What scala naguage version is? Could you please add this information to the project readme.md or link where this information is placed?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.