aws-samples / aws-glue-samples Goto Github PK

AWS Glue code samples

License: MIT No Attribution

Python 39.46% Shell 5.18% Dockerfile 0.40% Jupyter Notebook 18.79% Java 10.43% Scala 25.74%

aws-glue-samples's Introduction

AWS Glue Samples

AWS Glue is a serverless data integration service that makes it easier to discover, prepare, move, and integrate data from multiple sources for analytics, machine learning (ML), and application development. This repository has samples that demonstrate various aspects of the AWS Glue service, as well as various AWS Glue utilities.

You can find the AWS Glue open-source Python libraries in a separate repository at: awslabs/aws-glue-libs.

Getting Started

Getting Started with AWS Glue

Helps you get started using AWS Glue.
FAQ and How-to

Helps you get started using the many ETL capabilities of AWS Glue, and answers some of the more common questions people have.

Workshops

AWS Glue Learning Series

In this comprehensive series, you'll learn everything from the basics of Glue to advanced optimization techniques.

Tutorials

General

Writing an AWS Glue for Spark script

Introduces the process of writing AWS Glue scripts.
Detect and process sensitive data using AWS Glue Studio

Guides you to create an AWS Glue job that identifies sensitive data at the row level, and create a custom identification pattern to identify case-specific entities.
Enable self-service visual data integration and analysis for fund performance using AWS Glue Studio and Amazon QuickSight

Demonstrates how AWS Glue Studio helps you perform near-real-time analytics, and how to build visualizations and quickly get business insights using Amazon QuickSight.
Stream data from relational databases to Amazon Redshift with upserts using AWS Glue streaming jobs

Guides you to setup Change data capture (CDC) from relational databases to Amazon Redshift with enriching data using Glue Streaming job.
AWS Glue streaming application to process Amazon MSK data using AWS Glue Schema Registry

Shows how to use a combination of Amazon MSK, the AWS Glue Schema Registry, AWS Glue streaming ETL jobs, and Amazon S3 to create a robust and reliable real-time data processing platform.

Data migration

Implement vertical partitioning in Amazon DynamoDB using AWS Glue

Guides you to use AWS Glue to perform vertical partitioning of JSON documents when migrating document data from Amazon S3 to Amazon DynamoDB.
Migrate terabytes of data quickly from Google Cloud to Amazon S3 with AWS Glue Connector for Google BigQuery

Guides you to use AWS Glue to build an optimized ETL process to migrate a large and complex dataset from Google BigQuery storage into Amazon S3 in Parquet format.
Migrate from Google BigQuery to Amazon Redshift using AWS Glue and Custom Auto Loader Framework

Guides you to use AWS Glue to migrate from Google BigQuery to Amazon Redshift.
Migrate from Snowflake to Amazon Redshift using AWS Glue Python shell

Guides you to use AWS Glue Python shell jobs to migrate from Snowflake to Amazon Redshift.
Compose your ETL jobs for MongoDB Atlas with AWS Glue

Guides you to use AWS Glue to process data into MongoDB Atlas.

Open Table Format

Introducing native support for Apache Hudi, Delta Lake, and Apache Iceberg on AWS Glue for Apache Spark
- Part 1: Getting Started
- Part 2: AWS Glue Studio Visual Editor
This series of posts demonstrate how you can use Apache Hudi, Delta Lake, and Apache Iceberg on Glue Studio notebook and Glue Studio Visual Editor.
Implement a CDC-based UPSERT in a data lake using Apache Iceberg and AWS Glue

Guides you to setup Change data capture (CDC) from relational databases to Iceberg-based data lakes using Glue job.
Implement slowly changing dimensions in a data lake using AWS Glue and Delta

Demonstrates how to identify the changed data for a semi-structured source (JSON) and capture the full historical data changes (SCD Type 2) and store them in an S3 data lake.

Development, Test, and CI/CD

Develop and test AWS Glue version 3.0 and 4.0 jobs locally using a Docker container

Gives you an instruction to develop and test Glue scripts locally using a Docker container. This tutorial includes different methods like spark-submit, REPL shell, unit test using pytest, notebook experience on JupyterLab, and local IDE experience using Visual Studio Code.
Build, Test and Deploy ETL solutions using AWS Glue and AWS CDK based CI/CD pipelines

Gives you an instruction to build CI/CD pipelines for AWS Glue components using AWS CDK.

Cost and Performance

Monitor and optimize cost on AWS Glue for Apache Spark

Demonstrates best practices to monitor and optimize cost on AWS Glue for Apache Spark. This tutorial also includes a template to set up automated mechanism to collect and publish DPU Hours metrics in CloudWatch.
Best practices to optimize cost and performance for AWS Glue streaming ETL jobs

Demonstrates best practices to optimize cost and performance for AWS Glue streaming ETL jobs.

Glue for Ray

Introducing AWS Glue for Ray: Scaling your data integration workloads using Python

Provides an introduction to AWS Glue for Ray and shows you how to start using Ray to distribute your Python workloads.
Scale AWS SDK for pandas workloads with AWS Glue for Ray

Shows you how to use pandas to connect to AWS data and analytics services and manipulate data at scale by running on an AWS Glue for Ray job.
Advanced patterns with AWS SDK for pandas on AWS Glue for Ray

Shows how to use some of these APIs in an AWS Glue for Ray job, namely querying with S3 Select, writing to and reading from a DynamoDB table, and writing to a Timestream table.

Glue Data Catalog

Get started managing partitions for Amazon S3 tables backed by the AWS Glue Data Catalog

Covers basic methodologies for managing partitions for Amazon S3 tables in Glue Data Catalog.
Improve query performance using AWS Glue partition indexes

Demonstrates how to utilize partition indexes, and discusses the benefit you can get with partition indexes when working with highly partitioned data.

Glue Crawler

Adding an AWS Glue crawler

Provides an introduction to AWS Glue crawler.
Efficiently crawl your data lake and improve data access with an AWS Glue crawler using partition indexes

Describes how to create partition indexes with an AWS Glue crawler and compare the query performance improvement when accessing the crawled data with and without a partition index from Athena.
AWS Glue crawlers support cross-account crawling to support data mesh architecture

Walks through the creation of a simplified data mesh architecture that shows how to use an AWS Glue crawler with Lake Formation to automate bringing changes from data producer domains to data consumers while maintaining centralized governance.

Glue Data Quality

Getting started with AWS Glue Data Quality from the AWS Glue Data Catalog

Provides an introduction to AWS Glue Data Quality.
Getting started with AWS Glue Data Quality for ETL Pipelines

Shows how to create an AWS Glue job that measures and monitors the data quality of a data pipeline.
Set up advanced rules to validate quality of multiple datasets with AWS Glue Data Quality

Demonstrates the advanced data quality checks that you can typically perform when bringing data from a database to an Amazon S3 data lake.
Set up alerts and orchestrate data quality rules with AWS Glue Data Quality

Explains how to set up alerts and orchestrate data quality rules with AWS Glue Data Quality.
Visualize data quality scores and metrics generated by AWS Glue Data Quality

Explains how to build dashboards to measure and monitor your data quality.

Glue ETL Code Examples

You can run these sample job scripts on any of AWS Glue ETL jobs, container, or local environment.

Join and Relationalize Data in S3

This sample ETL script shows you how to use AWS Glue to load, transform, and rewrite data in AWS S3 so that it can easily and efficiently be queried and analyzed.
Clean and Process

This sample ETL script shows you how to take advantage of both Spark and AWS Glue features to clean and transform data for efficient analysis.
The resolveChoice Method

This sample explores all four of the ways you can resolve choice types in a dataset using DynamicFrame's resolveChoice method.
Converting character encoding

This sample ETL script shows you how to use AWS Glue job to convert character encoding.
Notebook using open data dake formats

The sample iPython notebook files show you how to use open data dake formats; Apache Hudi, Delta Lake, and Apache Iceberg on AWS Glue Interactive Sessions and AWS Glue Studio Notebook.
Blueprint examples

The sample Glue Blueprints show you how to implement blueprints addressing common use-cases in ETL. The samples are located under aws-glue-blueprint-libs repository.

Utilities

Hive metastore migration

This utility can help you migrate your Hive metastore to the AWS Glue Data Catalog.
Crawler undo and redo

These scripts can undo or redo the results of a crawl under some circumstances.
Spark UI

You can use this Dockerfile to run Spark history server in your container. See details: Launching the Spark History Server and Viewing the Spark UI Using Docker
use only IAM access controls

AWS Lake Formation applies its own permission model when you access data in Amazon S3 and metadata in AWS Glue Data Catalog through use of Amazon EMR, Amazon Athena and so on. If you currently use Lake Formation and instead would like to use only IAM Access controls, this tool enables you to achieve it.
Glue Resource Sync Utility

This utility enables you to synchronize your AWS Glue resources (jobs, databases, tables, and partitions) from one environment (region, account) to another.
Glue Job Version Deprecation Checker

This command line utility helps you to identify the target Glue jobs which will be deprecated per AWS Glue version support policy.

Glue Custom Connectors

AWS Glue provides built-in support for the most commonly used data stores such as Amazon Redshift, MySQL, MongoDB. Powered by Glue ETL Custom Connector, you can subscribe a third-party connector from AWS Marketplace or build your own connector to connect to data stores that are not natively supported.

Development

Development guide with examples of connectors with simple, intermediate, and advanced functionalities. These examples demonstrate how to implement Glue Custom Connectors based on Spark Data Source or Amazon Athena Federated Query interfaces and plug them into Glue Spark runtime.
Local Validation Tests

This user guide describes validation tests that you can run locally on your laptop to integrate your connector with Glue Spark runtime.
Validation

This user guide shows how to validate connectors with Glue Spark runtime in a Glue job system before deploying them for your workloads.
Glue Spark Script Examples

Python scripts examples to use Spark, Amazon Athena and JDBC connectors with Glue Spark runtime.
Create and Publish Glue Connector to AWS Marketplace

If you would like to partner or publish your Glue custom connector to AWS Marketplace, please refer to this guide and reach out to us at [email protected] for further details on your connector.

License Summary

This sample code is made available under the MIT-0 license. See the LICENSE file.

aws-glue-samples's People

Contributors

Stargazers

Watchers

Forkers

dylan-raithel-lookout flyboyventures kleopatra999 nandankumar-drg zonkyy vatjjar jameserrico sola-scriptura-zz froghollow robarthur gopicares vinaygupta1234 gridl dev01516 edyoda rentaow amitchmca ashishrpandey unique-identifier rmnvnv kasiviswanathan3876 rajeshmahanta lumoslabs charusheel jningtho hariramprasath dkyol oelesin paparaoc am2rican5 mendickxiao sandeep-jay yuj ankdadoo matthewnatali kynphan kmats aws-awinstan seandavi chappidim ultrasonex rameshbrgpll jhdavino anthonywc greg-rbc kglover29 mas-dse-mhardist souro zcee12 chengat1314 mvramana ishibin mizie vivshri ksaito1125 rakzroamer gopala-raju getamazednow blomskog chrisrimondi lyuji282 joelvogt vishnudxb sqlsage hanman-aws jingyu9 chenkaivincent hakenmt lxhiguera durch arturobayo oliviervg1 katsut dennismasipa mazelx vaughnisaacjohnson xuchangyue adamsondelacruz willyansc agable-vt fschagas ckseet physicalweb ashwini-padhy tejamuddada josesaribeiro davidb gauravchatrath maqsoodmushtaq kahirul awsmohit neudesicny nayana-rbc yujhe krikotti chrisliny mariafung88 daitc2004 naginenisridhar fss18

aws-glue-samples's Issues

Join and rationalize fail on Step 6

Following tutorial.
Hit errors on Step 6 with this code bit
glueContext.write_dynamic_frame.from_options(frame = l_history,
connection_type = "s3",
connection_options = {"path": "s3://****-glue-sample-target/output-dir/legislator_history"},
format = "parquet")

Response:
Traceback (most recent call last): File "/tmp/zeppelin_pyspark-5558815714183414326.py", line 349, in <module> raise Exception(traceback.format_exc()) Exception: Traceback (most recent call last): File "/tmp/zeppelin_pyspark-5558815714183414326.py", line 342, in <module> exec(code) File "<stdin>", line 4, in <module> File "/usr/share/aws/glue/etl/python/PyGlue.zip/awsglue/dynamicframe.py", line 563, in from_options format_options, transformation_ctx) File "/usr/share/aws/glue/etl/python/PyGlue.zip/awsglue/context.py", line 176, in write_dynamic_frame_from_options format, format_options, transformation_ctx) File "/usr/share/aws/glue/etl/python/PyGlue.zip/awsglue/context.py", line 199, in write_from_options return sink.write(frame_or_dfc) File "/usr/share/aws/glue/etl/python/PyGlue.zip/awsglue/data_sink.py", line 32, in write return self.writeFrame(dynamic_frame_or_dfc, info) File "/usr/share/aws/glue/etl/python/PyGlue.zip/awsglue/data_sink.py", line 28, in writeFrame return DynamicFrame(self._jsink.pyWriteDynamicFrame(dynamic_frame._jdf, callsite(), info), dynamic_frame.glue_ctx, dynamic_frame.name + "_errors") File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__ answer, self.gateway_client, self.target_id, self.name) File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco return f(*a, **kw) File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value format(target_id, ".", name), value) Py4JJavaError: An error occurred while calling o209.pyWriteDynamicFrame. : java.io.IOException: Failed to delete key: output-dir/legislator_history/_temporary at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.delete(S3NativeFileSystem.java:667) at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.delete(EmrFileSystem.java:296) at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.cleanupJob(FileOutputCommitter.java:463) at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.abortJob(FileOutputCommitter.java:482) at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.abortJob(HadoopMapReduceCommitProtocol.scala:134) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:146) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:121) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:121) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:121) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:101) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:87) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:87) at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:492) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:215) at com.amazonaws.services.glue.SparkSQLDataSink.writeDynamicFrame(DataSink.scala:123) at com.amazonaws.services.glue.DataSink.pyWriteDynamicFrame(DataSink.scala:38) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:280) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.IOException: 1 exceptions thrown from 1 batch deletes at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.deleteAll(Jets3tNativeFileSystemStore.java:375) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy46.deleteAll(Unknown Source) at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.doSingleThreadedBatchDelete(S3NativeFileSystem.java:1336) at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.delete(S3NativeFileSystem.java:665) ... 36 more Caused by: java.io.IOException: MultiObjectDeleteException thrown with 38 keys in error: output-dir/legislator_history/_temporary/0/task_20180315222929_0045_m_000000/part-00000-acd34e4e-daeb-4d24-9df5-2d1712cf7857.snappy.parquet, output-dir/legislator_history/_temporary/0/_temporary/attempt_20180315215639_0040_m_000000_0/part-00000-060907b6-a737-49bf-8966-bd2d9bf1af91.snappy.parquet, output-dir/legislator_history/_temporary/0/task_20180315222723_0045_m_000000/part-00000-acd34e4e-daeb-4d24-9df5-2d1712cf7857.snappy.parquet at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.deleteAll(Jets3tNativeFileSystemStore.java:360) ... 45 more Caused by: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.MultiObjectDeleteException: One or more objects could not be deleted (Service: null; Status Code: 200; Error Code: null; Request ID: 4CEE1BABB3FC9F96), S3 Extended Request ID: jaXBFrKlvOlBHDjaZOty2v49zSjSsZ4XTTAzjeVf+aYyiSrcZEaPpejdMlLqoULpmtnjaTW7Xf0= at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.deleteObjects(AmazonS3Client.java:2107) at com.amazon.ws.emr.hadoop.fs.s3.lite.call.DeleteObjectsCall.perform(DeleteObjectsCall.java:26) at com.amazon.ws.emr.hadoop.fs.s3.lite.call.DeleteObjectsCall.perform(DeleteObjectsCall.java:12) at com.amazon.ws.emr.hadoop.fs.s3.lite.executor.GlobalS3Executor.execute(GlobalS3Executor.java:82) at com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.invoke(AmazonS3LiteClient.java:176) at com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.deleteObjects(AmazonS3LiteClient.java:125) at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.deleteAll(Jets3tNativeFileSystemStore.java:355) ... 45 more

sample code missing import library

Data cleaning with AWS Glue
(aws-glue-samples/examples/data_cleaning_and_lambda.md)

not include awsglue.DynamicFrame syntax

check plz insert DynamicFrame class

from awsglue.dynamicframe import DynamicFrame

ex)
medicare_tmp_dyf = DynamicFrame.fromDF(medicare_dataframe, glueContext, "nested")

partitioning the glue job output using crawler partitions

Is there any way to partition the data after transformation using Glue job using the same partitions created by the crawler based on the folder structure in s3? i tried using "partitionKeys": ["type"] in the connection_options, However i get an error stating that the partition created by the crawler is not a field in the data. Any help is much appreciated! Thanks

Migrate Directly from Hive to AWS Glue | No tables created

hive_databases_S3.txt
hive_tables_S3.txt

I am trying to migrate directly from Hive to AWS Glue.
I created proper Glue job with Hive connection.
Tested connection, and it successfully connected.
Basically followed all steps and all was successful.
But eventually I can't see tables in AWS Glue catalog.
No error logs in job and normal logs say run status as succeeded.

I even tried Migrate from Hive to AWS Glue using Amazon S3 Objects.
And that too was successful, but no tables were created in Glue catalog.
I could find metastore in S3 buckets exported from Hive (files attached).

Now I thought of running this code from my local Eclipse to debug.
Can you please tell me how to debug from local Eclipse or in Glue console.

Spark_UI docker container-Error: Could not find or load main class $SPARK_HISTORY_OPTS

1.docker build -t glue/sparkui:latest .
2.docker run -itd -e SPARK_HISTORY_OPTS="$SPARK_HISTORY_OPTS -Dspark.history.fs.logDirectory=s3a://test/sparkui -Dspark.hadoop.fs.s3a.access.key=xxxxxx -Dspark.hadoop.fs.s3a.secret.key=yyyyyyy" -p 18080:18080 glue/sparkui:latest "/opt/spark/bin/spark-class org.apache.spark.deploy.history.HistoryServer"

hi,when I run the second step, it prompts me Could not find or load main class $SPARK_HISTORY_OPTS，I have no way solve it

Issue With Direct Migration from Hive Metastore to Glue Data Catalog

Hello,
I hope you are doing well. At my company, we are in the process of adopting Athena for our adhoc data analytics tasks and configured Athena to use Glue data catalogue as its metastore. However, the source of truth for our data is a Hive metastore hosted on an AWS RDS mysql instance and we want Glue data catalog to be in sync with our Hive metastore. For this, we followed instructions outlined on awslabs' aws-glue-samples repo (https://github.com/awslabs/aws-glue-samples/tree/master/utilities/Hive_metastore_migration) to setup a Glue job that we will run (eventually periodically) to directly migrate Hive metastore to Glue data catalog.

However, we are running into a situation where the job finishes successfully but we don't see anything migrated into our Glue data catalog. I am wondering if this issue is experienced by others and if it got resolved. I do see an issue similar to our issue here (I even replied to this thread): #6, but this is closed and I dont see a solution posted in it.

Any help is greatly appreciated.

Thanks,
Kshitij

Sample for Searching Data Catalog

All over the web I've seen suggestions that Glue Data Catalog enables metadata searches....so that for example I could ask for all tables that contain a certain column name or that have a particular comment on a column.

To date I've not seen a single tutorial, sample, guide, how to, cheat sheet or any documentation of any kind that actually shows how to search for a column name in the Glue Data Catalog. The closest I've come is this article which suggests that I can use these keywords when searching Glue Tables:

column_name
column_comment
table_desc

When I try using one of these (e.g. column_name: star*) it always returns nothing, even though I do have tables that have a column named with "star".

This capability is far too basic and obviously needed, and thus warrants having a sample or some documentation created somewhere.

An error occurred while calling o60.getDynamicFrame. Char length 256 out of allowed range [1, 255]

The Error

An error occurres while calling o60.getDynamicFrame. Char length 256 out of allowed range [1, 255]

occurs from

glueContext.create_dynamic_frame.from_catalog(database=db_name, table_name=my_tbl)

unclear instructions

in this example:
https://github.com/awslabs/aws-glue-samples/blob/master/examples/join_and_relationalize.md
"1. Crawl our sample dataset" can you specify if user is supposed to copy the dataset from your sample bucket to their own bucket before proceeding to crawl, or just use your bucket as the source? the latter doesn't work. I got
"[626034be-45ee-4e8d-ae26-e985e8131ff9] ERROR : Error Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: F24B84B451EAEC7A) retrieving file at s3://awsglue-datasets/examples/us-legislators/house/persons.json. Tables created did not infer schemas from this file."

also there's a typo in the s3 path (you have a dot at the end)
"s3://awsglue-datasets/examples/us-legislators."

Is there any way to connect to an AWS Redshift database that's not inside of a VPC?

I have been unable to figure out how to create a working Connection for my classic Redshift instance that was not created with a VPC. Is there any way?

post glue catalog upgrade

Hi, I just upgraded to glue data catalog and I could not see any crawlers created on its own. Also, if I want to create a crawler to update an existing table in the catalog, I don't get the option to choose the table of an existing database which it created as i don't want the crawler to create a new table every time. any help is much appreciated! Thank you

Could not configure Spark Submit parameters

Hi Team,
I'm running an ETL job in AWS glue which reads the table data and process the data and write it to S3. The job is failing with following error message:
ExecutorLostFailure (executor 38 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.

When I tried to modify the memory configurations with --conf parameter, it is not getting reflected in the job execution and the job is still failing with the same error.

When I searched for information online (link given below), it said we could not modify Spark configuration parameters in Glue, can you please confirm? If Yes, can you please provide me any alternatives for a successful execution

https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-glue-arguments.html

Thanks in Advance!

Regards,
Vamshi

How to export 1 csv file?

I want to export DynamicDataframe to just 1 csv file.
How can I do it?

Join and Relationalize runs forever

Hello,

I'm trying to run the join_and_relationalize.py example. I let it run for 2h24m before killing it. It just keeps repeating messages like this:

19/09/26 20:22:17 DEBUG Client: 
client token: N/A
diagnostics: N/A
ApplicationMaster host: [removed_ip]
ApplicationMaster RPC port: [removed_port]
queue: default
start time: 1569529172541
final status: UNDEFINED
tracking URL: http://ip-[removed].glue.dnsmasq:20888/proxy/application_[removed_id]/
user: root

It's creating tables in redshift and populating them, but I would not expect 11MB of data to take 2.5h to process.

Any idea how to make it work?

Thanks,

Matt

Glue catalog to Hive Metastore Migration script not working with partition table

I'm running the script to migrate Glue catalog data crawled from hive style (key=value) partition data from s3 and then migrating to hive metstore(MySQL). And the partition that is getting created in the hive metastore is incorrect.
[+] https://github.com/awslabs/aws-glue-samples/tree/master/utilities/Hive_metastore_migration

Note: Looks like Glue catalog data crawled from partitioned S3 is fine, as launching New EMR cluster with the Glue catalog is working fine and partition information is correct. Also, Athena using the Glue is able to find the partition of the table properly. But the script migrating table information from glue catalog to metastore is getting messed up, hence creating totally wrong partition information in hive metastore.

** Please find the steps carried out: **

S3 path from which crawler was executed (You can see the data is in proper layout or YYYY=value/mm=value)

a0999b1381a5:~ kuntalg$ aws s3 ls --recursive s3://kg-practice/elb_logging/test
2018-02-06 15:54:06          0 elb_logging/test/
2018-02-06 15:54:35          0 elb_logging/test/year=2015/
2018-02-06 15:55:00          0 elb_logging/test/year=2015/month=01/
2018-02-06 16:31:43         22 elb_logging/test/year=2015/month=01/test2.csv
2018-02-06 15:55:08          0 elb_logging/test/year=2015/month=02/
2018-02-06 16:31:57         22 elb_logging/test/year=2015/month=02/test3.csv

Once Crawler was done, I have launched a new EMR cluster by pointing to Glue Catalog while launching. After that executed the following commands on the cluster (Hive & Spark). And it is showing the partition is proper format).

scala> spark.sql("SHOW PARTITIONS test").show(30,false)
18/02/06 16:27:59 WARN CredentialsLegacyConfigLocationProvider: Found the legacy config profiles file at [/home/hadoop/.aws/config]. Please move it to the latest default location [~/.aws/credentials].
+------------------+
|partition         |
+------------------+
|year=2015/month=01|
|year=2015/month=02|
+------------------+

hive> show partitions test;
OK
year=2015/month=02
year=2015/month=01
Time taken: 0.54 seconds, Fetched: 2 row(s)

hive> describe test;
OK
col0                	bigint              	                    
col1                	string              	                    
year                	string              	                    
month               	string              	                    
	 	 
# Partition Information	 	 
# col_name            	data_type           	comment             
	 	 
year                	string              	                    
month               	string              	                    
Time taken: 0.651 seconds, Fetched: 10 row(s)

hive> select * from test;
OK
1	Monty	2015	02
2	Trish	2015	02
5	Lisa	2015	02
1	kuntal	2015	01
2	Rock	2015	01
3	Cena	2015	01
Time taken: 3.047 seconds, Fetched: 6 row(s)

hive> select * from test where year='2015' and month='02';
OK
1	Monty	2015	02
2	Trish	2015	02
5	Lisa	2015	02
Time taken: 0.971 seconds, Fetched: 3 row(s)

================
3) Kindly note that the catalog-2-migration script (export_from_datacatalog.py) will not work with the following key constraint error:

"duplicate entry for key 'UNIQUE_DATABASE'
.....
java.sql.BatchUpdateException: Field 'IS_REWRITE_ENABLED' doesn't have a default value"

I found the column 'IS_REWRITE_ENABLED' is in table hive.TBLS. A strange thing I found is this column can be NULL in table definition. However, the Spark job complains about the default value. So I manually login to my Hive metastore and updated the default value:
ALTER TABLE hive.TBLS ALTER IS_REWRITE_ENABLED SET DEFAULT 1;

After this small change, the Glue ETL job completed successfully. But the partition generated by the script is totally incorrect.

Partition messed up

hive> show partitions test;
OK
year(string),month(string)=2015,01
year(string),month(string)=2015,02
Time taken: 0.176 seconds, Fetched: 2 row(s)

Although the table description is same-

hive> describe test;
OK
col0                	bigint              	                    
col1                	string              	                    
year                	string              	                    
month               	string              	                    
	 	 
# Partition Information	 	 
# col_name            	data_type           	comment             
	 	 
year                	string              	                    
month               	string              	                    
Time taken: 0.492 seconds, Fetched: 10 row(s)

So its totally an issue with the migration script and I'm stuck with our migration process

So kindly look into the issue on an urgent basis and fix the script or provide me a workaround or solution.

does ML transform support model versioning? Can you specify this in FindMatches?

Can you reference which version of a trained ML Transform to use when running FindMatches?
example:
Train model with labels - version 1
Train model with more labels - version 2

Is there a way to specify in FindMatches to look for duplicates via -version 1 even though on the same ML Transform job you have already ran additional labels for tuning the model for version 2?

Sum it up, A vs B test on the same ML Transform job.

Bad coding practices (Python)

This and awslabs are full of bad practices which people unfortunately copy over because they use this as an example.

Would you consider connecting this repository to some linters or codeac.io?
To prevent spreading bad practices.

example:
https://github.com/aws-samples/aws-glue-samples/blob/master/examples/join_and_relationalize.py#L5

from x import * - is an anti pattern which might lead to unexpected behaviour and also makes reading the code much harder because it's unsure where the functions come from.

more info: https://www.flake8rules.com/rules/F405.html

Namespaces are one honking great idea -- let's do more of those! "Zen of python"

Hive table and partition basic statistics are not correctly imported into AWS Glue Catalog

Hive table and partition basic statistics are not correctly imported into AWS Glue Catalog.
The statistics properties are included in the Glue table properties, however, it looks that Hive is not honoring it.

Glue migration script is capable of migrating table and partition statistics from the Hive Metastore, however, it appears that the migration script is escaping some of the characters thus making the statistics unusable in Glue catalog:
https://github.com/aws-samples/aws-glue-samples/blob/master/utilities/Hive_metastore_migration/src/hive_metastore_migration.py#L455

When I created a table in Hive meta store, the table and column statistics looked like the following:

Table Parameters:
COLUMN_STATS_ACCURATE {"BASIC_STATS":"true","COLUMN_STATS":{"id":"true","name":"true"}}
EXTERNAL TRUE
numFiles 1
numRows 2
rawDataSize 14
totalSize 16
transient_lastDdlTime 1536141689

When I ran the migration script to migrate the Hive meta store to Glue catalog, the same statistics became the following. I have found the below statistics is unusable through Hive.

Table Parameters:
COLUMN_STATS_ACCURATE \{\"BASIC_STATS\"\:\"true\",\"COLUMN_STATS\"\:\{\"id\"\:\"true\",\"name\"\:\"true\"\}\}
EXTERNAL TRUE
numFiles 1
numRows 2
rawDataSize 14
totalSize 16
transient_lastDdlTime 1536141689

I then manually modified table property (COLUMN_STATS_ACCURATE) in Glue console to the following and was able to convert 'COLUMN_STATS_ACCURATE' into a usable format

I didn't check the compatibility of the migrated statistics with the other EMR tools (Spark, Presto) and AWS services (Glue ETL job, Athena, Redshift Spectrum).

Regards,
Simone

How to keep the partition structure of the folder after ETL?

I have original files in S3 with folder structure like:
/data/year=2017/month=1/day=1/origin_files
/data/year=2017/month=1/day=2/origin_files

I use glue crawler create a table data(partitioned) as source of glue jobs.
Currently after I use glue job converting files to ORC, I get:
/data_orc/transformed_data_files.orc

Is that possible to keep same partition structure after transforming jobs? like:
/data_orc/year=2017/month=1/day=1/transformed_data_files.orc
/data_orc/year=2017/month=1/day=2/transformed_data_files.orc

It doesn't have to be file to file matching, but I hope the data partition can keep same folder structure.

does Hive_metastore_migration tools support glue catalog View migration ?

Hi @dichenli

Does Hive_metastore_migration support the migration of Glue Catalog views?

thanks

Spelling Errors in Instructions

The sentence "These are all still strings in the data. We can use the DynamicFrame's powerful apply_mapping tranform method to drop, rename, cast, and nest the data so that other data programming langages and sytems can easily access" has several spelling errors

Question: Group by key

First, thanks for the great work on AWS Glue. One question: Can I use AWS Glue to group by key, and store the results in distinct files on S3 as follows?

Sample data in an s3 bucket:

{ "key1": "A", "key2": "1" }
{ "key1": "B", "key2", "3" }
{ "key1": "A", "key2", "2" }
{ "key1": "B", "key2", "4" }

Expected result:

file1:

{ "key1": "A", "key2": "1" }
{ "key1": "A", "key2", "2" }

file2:

{ "key1": "B", "key2", "3" }
{ "key1": "B", "key2", "4" }

This is what I tried:

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
datasource0 = glueContext.create_dynamic_frame.from_catalog(
    database = "my_db", 
    table_name = "my_input_table", 
    transformation_ctx = "datasource0")
partitioned_dataframe = datasource0.toDF().repartition(2, "key1")
partitioned_dynamicframe = DynamicFrame.fromDF(
    partitioned_dataframe, 
    glueContext, 
    "partitioned_df")
datasink2 = glueContext.write_dynamic_frame.from_options(
    frame = partitioned_dynamicframe, 
    connection_type = "s3", 
    connection_options = {"path": "s3://my-result-bucket/sub-bucket"}, 
    format = "json", 
    transformation_ctx = "datasink2")

Yet as the FAQ clearly state, there is no guarantee that data with different keys will end up in different partitions, hence they mal also end up in the same file.

From the FAQ:

Note that while different records with the same value for this column 
will be assigned to the same partition, there is no guarantee that there 
will be a separate partition for each distinct value.

So I guess there must be a different way to achieve this. Any ideas?

Connection to Redshift time's out from SageMaker jupyter's notebook

I am trying to push data to Amazon Redshift as described in the join_and_relationalize.md sample.

The write operation works flawlessly if I run it within an AWS Glue job but it times out when I run it within an Amazon SageMaker Jupyter's PySpark notebook that I have connected to an AWS Glue Dev Endpoint (both use the same AWS Glue connection). I'm guessing it has to do with networking but the document doesn't specify anything about this.

Can AWS elaborate on how we must configure our Amazon SageMaker Jupyter's notebook in order to write to Amazon Redshift using PySpark?

An error was encountered:
An error occurred while calling o87.getJDBCSink.
: java.sql.SQLException: [Amazon](500150) Error setting/closing connection: Connection timed out.
	at com.amazon.redshift.client.PGClient.connect(Unknown Source)
	at com.amazon.redshift.client.PGClient.<init>(Unknown Source)
	at com.amazon.redshift.core.PGJDBCConnection.connect(Unknown Source)
	at com.amazon.jdbc.common.BaseConnectionFactory.doConnect(Unknown Source)
	at com.amazon.jdbc.common.AbstractDriver.connect(Unknown Source)
	at com.amazon.redshift.jdbc.Driver.connect(Unknown Source)
	at com.amazonaws.services.glue.util.JDBCWrapper$$anonfun$8.apply(JDBCUtils.scala:895)
	at com.amazonaws.services.glue.util.JDBCWrapper$$anonfun$8.apply(JDBCUtils.scala:891)
	at com.amazonaws.services.glue.util.JDBCWrapper$$anonfun$connectWithSSLAttempt$1$$anonfun$apply$6.apply(JDBCUtils.scala:847)
	at scala.Option.getOrElse(Option.scala:121)
	at com.amazonaws.services.glue.util.JDBCWrapper$$anonfun$connectWithSSLAttempt$1.apply(JDBCUtils.scala:847)
	at scala.Option.getOrElse(Option.scala:121)
	at com.amazonaws.services.glue.util.JDBCWrapper$.connectWithSSLAttempt(JDBCUtils.scala:847)
	at com.amazonaws.services.glue.util.JDBCWrapper$.connectionProperties(JDBCUtils.scala:890)
	at com.amazonaws.services.glue.util.JDBCWrapper.connectionProperties$lzycompute(JDBCUtils.scala:670)
	at com.amazonaws.services.glue.util.JDBCWrapper.connectionProperties(JDBCUtils.scala:670)
	at com.amazonaws.services.glue.util.JDBCWrapper.getRawConnection(JDBCUtils.scala:683)
	at com.amazonaws.services.glue.RedshiftDataSink.<init>(RedshiftDataSink.scala:40)
	at com.amazonaws.services.glue.GlueContext.getSink(GlueContext.scala:650)
	at com.amazonaws.services.glue.GlueContext.getJDBCSink(GlueContext.scala:463)
	at com.amazonaws.services.glue.GlueContext.getJDBCSink(GlueContext.scala:445)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
Caused by: com.amazon.support.exceptions.GeneralException: [Amazon](500150) Error setting/closing connection: Connection timed out.
	... 31 more
Caused by: java.net.ConnectException: Connection timed out
	at sun.nio.ch.Net.connect0(Native Method)
	at sun.nio.ch.Net.connect(Net.java:454)
	at sun.nio.ch.Net.connect(Net.java:446)
	at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:645)
	at sun.nio.ch.SocketAdaptor.connect(SocketAdaptor.java:107)
	at com.amazon.redshift.client.PGClient.connect(Unknown Source)
	at com.amazon.redshift.client.PGClient.<init>(Unknown Source)
	at com.amazon.redshift.core.PGJDBCConnection.connect(Unknown Source)
	at com.amazon.jdbc.common.BaseConnectionFactory.doConnect(Unknown Source)
	at com.amazon.jdbc.common.AbstractDriver.connect(Unknown Source)
	at com.amazon.redshift.jdbc.Driver.connect(Unknown Source)
	at com.amazonaws.services.glue.util.JDBCWrapper$$anonfun$8.apply(JDBCUtils.scala:895)
	at com.amazonaws.services.glue.util.JDBCWrapper$$anonfun$8.apply(JDBCUtils.scala:891)
	at com.amazonaws.services.glue.util.JDBCWrapper$$anonfun$connectWithSSLAttempt$1$$anonfun$apply$6.apply(JDBCUtils.scala:847)
	at scala.Option.getOrElse(Option.scala:121)
	at com.amazonaws.services.glue.util.JDBCWrapper$$anonfun$connectWithSSLAttempt$1.apply(JDBCUtils.scala:847)
	at scala.Option.getOrElse(Option.scala:121)
	at com.amazonaws.services.glue.util.JDBCWrapper$.connectWithSSLAttempt(JDBCUtils.scala:847)
	at com.amazonaws.services.glue.util.JDBCWrapper$.connectionProperties(JDBCUtils.scala:890)
	at com.amazonaws.services.glue.util.JDBCWrapper.connectionProperties$lzycompute(JDBCUtils.scala:670)
	at com.amazonaws.services.glue.util.JDBCWrapper.connectionProperties(JDBCUtils.scala:670)
	at com.amazonaws.services.glue.util.JDBCWrapper.getRawConnection(JDBCUtils.scala:683)
	at com.amazonaws.services.glue.RedshiftDataSink.<init>(RedshiftDataSink.scala:40)
	at com.amazonaws.services.glue.GlueContext.getSink(GlueContext.scala:650)
	at com.amazonaws.services.glue.GlueContext.getJDBCSink(GlueContext.scala:463)
	at com.amazonaws.services.glue.GlueContext.getJDBCSink(GlueContext.scala:445)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)

Traceback (most recent call last):
  File "/mnt/yarn/usercache/livy/appcache/application_1588365285924_0011/container_1588365285924_0011_01_000001/PyGlue.zip/awsglue/dynamicframe.py", line 665, in from_jdbc_conf
    redshift_tmp_dir, transformation_ctx)
  File "/mnt/yarn/usercache/livy/appcache/application_1588365285924_0011/container_1588365285924_0011_01_000001/PyGlue.zip/awsglue/context.py", line 311, in write_dynamic_frame_from_jdbc_conf
    catalog_id)
  File "/mnt/yarn/usercache/livy/appcache/application_1588365285924_0011/container_1588365285924_0011_01_000001/PyGlue.zip/awsglue/context.py", line 326, in write_from_jdbc_conf
    transformation_ctx, catalog_id)
  File "/mnt/yarn/usercache/livy/appcache/application_1588365285924_0011/container_1588365285924_0011_01_000001/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/mnt/yarn/usercache/livy/appcache/application_1588365285924_0011/container_1588365285924_0011_01_000001/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/mnt/yarn/usercache/livy/appcache/application_1588365285924_0011/container_1588365285924_0011_01_000001/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
    format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o87.getJDBCSink.
: java.sql.SQLException: [Amazon](500150) Error setting/closing connection: Connection timed out.
	at com.amazon.redshift.client.PGClient.connect(Unknown Source)
	at com.amazon.redshift.client.PGClient.<init>(Unknown Source)
	at com.amazon.redshift.core.PGJDBCConnection.connect(Unknown Source)
	at com.amazon.jdbc.common.BaseConnectionFactory.doConnect(Unknown Source)
	at com.amazon.jdbc.common.AbstractDriver.connect(Unknown Source)
	at com.amazon.redshift.jdbc.Driver.connect(Unknown Source)
	at com.amazonaws.services.glue.util.JDBCWrapper$$anonfun$8.apply(JDBCUtils.scala:895)
	at com.amazonaws.services.glue.util.JDBCWrapper$$anonfun$8.apply(JDBCUtils.scala:891)
	at com.amazonaws.services.glue.util.JDBCWrapper$$anonfun$connectWithSSLAttempt$1$$anonfun$apply$6.apply(JDBCUtils.scala:847)
	at scala.Option.getOrElse(Option.scala:121)
	at com.amazonaws.services.glue.util.JDBCWrapper$$anonfun$connectWithSSLAttempt$1.apply(JDBCUtils.scala:847)
	at scala.Option.getOrElse(Option.scala:121)
	at com.amazonaws.services.glue.util.JDBCWrapper$.connectWithSSLAttempt(JDBCUtils.scala:847)
	at com.amazonaws.services.glue.util.JDBCWrapper$.connectionProperties(JDBCUtils.scala:890)
	at com.amazonaws.services.glue.util.JDBCWrapper.connectionProperties$lzycompute(JDBCUtils.scala:670)
	at com.amazonaws.services.glue.util.JDBCWrapper.connectionProperties(JDBCUtils.scala:670)
	at com.amazonaws.services.glue.util.JDBCWrapper.getRawConnection(JDBCUtils.scala:683)
	at com.amazonaws.services.glue.RedshiftDataSink.<init>(RedshiftDataSink.scala:40)
	at com.amazonaws.services.glue.GlueContext.getSink(GlueContext.scala:650)
	at com.amazonaws.services.glue.GlueContext.getJDBCSink(GlueContext.scala:463)
	at com.amazonaws.services.glue.GlueContext.getJDBCSink(GlueContext.scala:445)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
Caused by: com.amazon.support.exceptions.GeneralException: [Amazon](500150) Error setting/closing connection: Connection timed out.
	... 31 more
Caused by: java.net.ConnectException: Connection timed out
	at sun.nio.ch.Net.connect0(Native Method)
	at sun.nio.ch.Net.connect(Net.java:454)
	at sun.nio.ch.Net.connect(Net.java:446)
	at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:645)
	at sun.nio.ch.SocketAdaptor.connect(SocketAdaptor.java:107)
	at com.amazon.redshift.client.PGClient.connect(Unknown Source)
	at com.amazon.redshift.client.PGClient.<init>(Unknown Source)
	at com.amazon.redshift.core.PGJDBCConnection.connect(Unknown Source)
	at com.amazon.jdbc.common.BaseConnectionFactory.doConnect(Unknown Source)
	at com.amazon.jdbc.common.AbstractDriver.connect(Unknown Source)
	at com.amazon.redshift.jdbc.Driver.connect(Unknown Source)
	at com.amazonaws.services.glue.util.JDBCWrapper$$anonfun$8.apply(JDBCUtils.scala:895)
	at com.amazonaws.services.glue.util.JDBCWrapper$$anonfun$8.apply(JDBCUtils.scala:891)
	at com.amazonaws.services.glue.util.JDBCWrapper$$anonfun$connectWithSSLAttempt$1$$anonfun$apply$6.apply(JDBCUtils.scala:847)
	at scala.Option.getOrElse(Option.scala:121)
	at com.amazonaws.services.glue.util.JDBCWrapper$$anonfun$connectWithSSLAttempt$1.apply(JDBCUtils.scala:847)
	at scala.Option.getOrElse(Option.scala:121)
	at com.amazonaws.services.glue.util.JDBCWrapper$.connectWithSSLAttempt(JDBCUtils.scala:847)
	at com.amazonaws.services.glue.util.JDBCWrapper$.connectionProperties(JDBCUtils.scala:890)
	at com.amazonaws.services.glue.util.JDBCWrapper.connectionProperties$lzycompute(JDBCUtils.scala:670)
	at com.amazonaws.services.glue.util.JDBCWrapper.connectionProperties(JDBCUtils.scala:670)
	at com.amazonaws.services.glue.util.JDBCWrapper.getRawConnection(JDBCUtils.scala:683)
	at com.amazonaws.services.glue.RedshiftDataSink.<init>(RedshiftDataSink.scala:40)
	at com.amazonaws.services.glue.GlueContext.getSink(GlueContext.scala:650)
	at com.amazonaws.services.glue.GlueContext.getJDBCSink(GlueContext.scala:463)
	at com.amazonaws.services.glue.GlueContext.getJDBCSink(GlueContext.scala:445)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)

Example of a custom data sink in Python

I'm looking for information on how to load into services other than S3 or Redshift -- something completely custom (e.g. some HTTP API). I'm not sure if I need to specifically implement some "DataSink" or "writer", or if it's okay for me to just e.g. Map.apply over my DataFrame and call some API while processing my records.

Library dependency broken May 5, 2020

Good Afternoon,

I am not certain that this is the right place for this issue. If someone knows better, please point me to the right location. I have a simple example project for testing locally in Scala, and have been using the AWSGlueETL library. As of this afternoon, it has stopped working.

You can see the error here:
https://github.com/Gamesight/aws-glue-local-scala/runs/647153542?check_suite_focus=true

It looks like the file was changed today:

<Contents>
<Key>
release/com/amazonaws/AWSGlueETL/1.0.0/AWSGlueETL-1.0.0.pom
</Key>
<LastModified>2020-05-05T16:53:39.000Z</LastModified>
<ETag>"05ec2f9f20bc535e4d00717fe1c902bf"</ETag>
<Size>17464</Size>
<StorageClass>STANDARD</StorageClass>
</Contents>

Upon inspection of the file: https://aws-glue-etl-artifacts.s3.amazonaws.com/release/com/amazonaws/AWSGlueETL/1.0.0/AWSGlueETL-1.0.0.pom

I see this:

 <properties>
    <maven.compiler.source>1.8</maven.compiler.source>
    <maven.compiler.target>1.8</maven.compiler.target>
    <encoding>UTF-8</encoding>
    <scala.version>2.11.1</scala.version>
    <scala.compat.version>2.11</scala.compat.version>
    <spec2.version>4.2.0</spec2.version>
    <glue.artifacts.bucket>aws-glue-etl-artifacts-beta</glue.artifacts.bucket>
    <glue.sdk.artifactid>AWSSDKGlueJavaClient</glue.sdk.artifactid>
    <glue.sdk.version>1.0</glue.sdk.version>
    <aws.sdk.version>1.11.774</aws.sdk.version>
  </properties>

It looks like the glue.sdk.artifactid may be referencing a java class instead of the usual aws-java-sdk-glue or maybe the beta flag in the reference to aws-glue-etl-artifacts-beta bucket wasn't removed? It doesn't appear that public access is enabled for that bucket. Whatever the cause, the AWSSDKGlueJavaClient file is not there.

Any help would be appreciated.

Scala unit test examples

Would be nice with some examples of how to unittest scala glue logic.

Add description about spark dependecies version

Scala examples were added to the project. But what spark dependecies are used? What versions do they have?

For example it uses spark-sql but what version is? What scala naguage version is? Could you please add this information to the project readme.md or link where this information is placed?

AWS Glue error converting data frame to dynamic frame

Here's my code where I am trying to create a new data frame out of the result set of my left join on other 2 data frames and then trying to convert it to a dynamic frame.

dfs = sqlContext.read.format(SNOWFLAKE_SOURCE_NAME).options(**sfOptions).option("query", "SELECT hashkey as hash From randomtable").load()

#Source
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "test", table_name = "randomtable", transformation_ctx = "datasource0")

#add hash value
df = datasource0.toDF()
df.cache()
df = df.withColumn("hashkey", sha2(concat_ws("||", *df.columns), 256))

#drop dupes
df1 = df.dropDuplicates(subset=['hashkey'])

#read incremental data
inc = df1.join(dfs, df1["hashkey"] == dfs["hash"], how='left').filter(col('hash').isNull())

#convert it back to glue context
datasource1 = DynamicFrame.fromDF(inc, glueContext, "datasource1")

Here is the error I get when trying to convert a data frame to a dynamic frame.

datasource1 = DynamicFrame.fromDF(inc, glueContext, "datasource1")
File
"/mnt/yarn/usercache/root/appcache/application_1560272525947_0002/container_1560272525947_0002_01_000001/PyGlue.zip/awsglue/dynamicframe.py",

line 150, in fromDF
File "/mnt/yarn/usercache/root/appcache/application_1560272525947_0002/container_1560272525947_0002_01_000001/py4j-0.10.4-src.zip/py4j/java_gateway.py",
line 1133, in call
File "/mnt/yarn/usercache/root/appcache/application_1560272525947_0002/container_1560272525947_0002_01_000001/pyspark.zip/pyspark/sql/utils.py",
line 63, in deco
File "/mnt/yarn/usercache/root/appcache/application_1560272525947_0002/container_1560272525947_0002_01_000001/py4j-0.10.4-src.zip/py4j/protocol.py",
line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:com.amazonaws.services.glue.DynamicFrame.apply.
: java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.expressions.AttributeReference.(Ljava/lang/String;Lorg/apache/spark/sql/types/DataType;ZLorg/apache/spark/sql/types/Metadata;Lorg/apache/spark/sql/catalyst/expressions/ExprId;Lscala/collection/Seq;)V
at net.snowflake.spark.snowflake.pushdowns.querygeneration.QueryHelper$$anonfun$8.apply(QueryHelper.scala:66)
at net.snowflake.spark.snowflake.pushdowns.querygeneration.QueryHelper$$anonfun$8.apply(QueryHelper.scala:65)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:285)
at net.snowflake.spark.snowflake.pushdowns.querygeneration.QueryHelper.(QueryHelper.scala:64)
at net.snowflake.spark.snowflake.pushdowns.querygeneration.SourceQuery.(SnowflakeQuery.scala:100)
at net.snowflake.spark.snowflake.pushdowns.querygeneration.QueryBuilder.net$snowflake$spark$snowflake$pushdowns$querygeneration$QueryBuilder$$generateQueries(QueryBuilder.scala:98)
at net.snowflake.spark.snowflake.pushdowns.querygeneration.QueryBuilder.liftedTree1$1(QueryBuilder.scala:63)
at net.snowflake.spark.snowflake.pushdowns.querygeneration.QueryBuilder.treeRoot$lzycompute(QueryBuilder.scala:61)
at net.snowflake.spark.snowflake.pushdowns.querygeneration.QueryBuilder.treeRoot(QueryBuilder.scala:60)
at net.snowflake.spark.snowflake.pushdowns.querygeneration.QueryBuilder.tryBuild$lzycompute(QueryBuilder.scala:34)
at net.snowflake.spark.snowflake.pushdowns.querygeneration.QueryBuilder.tryBuild(QueryBuilder.scala:33)
at net.snowflake.spark.snowflake.pushdowns.querygeneration.QueryBuilder$.getRDDFromPlan(QueryBuilder.scala:179)
at net.snowflake.spark.snowflake.pushdowns.SnowflakeStrategy.buildQueryRDD(SnowflakeStrategy.scala:42)
at net.snowflake.spark.snowflake.pushdowns.SnowflakeStrategy.apply(SnowflakeStrategy.scala:24)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:62)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:62)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:92)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:77)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:74)
at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:74)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:66)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:92)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:77)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:74)
at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:74)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:66)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:92)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:77)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:74)
at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)

Any help is greatly appreciated.

What version of the AWS SDK are you using?

I've tried with the 1.11.297 and 2.0 preview versions and they have a very different API from what your samples show. For example: These versions don't have a GlueContext under com.amazonaws.services.glue

How to check the run status of a crawler?

I need to check whether a crawler finished its job. Is this possible?

Issue while accessing GlueContext locally

Hi,

To develop ETL Scripts locally with Python I have followed below steps :-

Downloaded the AWS Glue Python library from GitHub. For Glue version 1.0, I checked out branch glue-1.0 as this version supports Python 3.
Installed Apache Maven and referenced the path of Maven in the environment variables.
Installed the Apache Spark and set the SPARK_HOME in environment variables
Installed jdk and set the JAVA_HOME in environment variables
Installed Hadoop and set HADOOP_HOME in environment variables
Installed pyspark, awscli and boto3.

I chose the ETL script that was automatically generated by AWS glue and ran the job.
It worked successfully from AWS console and the file from source was loaded into the destination folder. But when I try to execute the same script locally, I encountered below error:

return self._jvm.GlueContext(self._jsc.sc())
TypeError: 'JavaPackage' object is not callable

Is there any workaround for this?

Thanks

CONNECTION_LIST_CONNECTION_WITH_AZ_NULL Error

Hi,

I encounter this error when I try to test my connection which created via Amazon Glue Web interface. I could not find anything on Google or Amazon Forums.

I try to connect my postgresql db which hosted on AWS RDS.

Tables created using join_and_relationalize.md example not showing any data

As per https://github.com/aws-samples/aws-glue-samples/blob/master/examples/join_and_relationalize.md example, created a crawler with s3://awsglue-datasets/examples/us-legislators. Crawler ran successfully created tables in the database. When I try to retrieve the data from these tables using Athena, I am seeing "Zero records returned." message. More info (Crawler name:aws-glue-samples-raja0505, region:us-east-2)

persons_json
memberships_json
organizations_json
events_json
areas_json
countries_r_json

How to connect to my AWS account

Hi,

In the sample code I do not see where we are passing AWS Auth Key and Secret Key to connect to AWS. Can you tell from where I can add those details in order to run this code against my AWS instance.

Thanks.

Possible to pull incremental data from Source

Hey

am trying to read incremental data from source. This is Connected using JDBC. I need to pass the timestamp in the filter. Is this a valid usecase if so any documentation is there? Am not able to find any for it.

Thanks

what "modifications" were made to the json

I have been unable to replicate the sample because I have been unable to get glue to crawl the json file that I have downloaded from EveryPolitician.com. Maybe you could share your .json file with the modifications that were made or tell us what modifications need to be made.

"It contains data in JSON format about United States legislators and the seats they have held in the the House of Representatives and the Senate, and has been modified somewhat for purposes of this tutorial."

My crawler never detects these tables:
persons_json
memberships_json
organizations_json
events_json
areas_json
countries_r_json

Issue with direct migration from hive metastore to glue catalog

While migrating table with OpenCSVSerde from hive metastore to glue catalog it brings extra character on serde properties. Like hive metastore has * however glue catalog will have *.

I think the problem is creted by the below function (hive_metastore_migration.py)

def udf_escape_chars(param_value):
ret_param_value = param_value.replace('\', '\\')
.replace('|', '\|')
.replace('"', '\"')
.replace('{', '\{')
.replace(':', '\:')
.replace('}', '\}')

Missing DynamicFrame import

In examples/data_cleaning_and_lambda.md section 5. Lambda functions (aka Python UDFs) and ApplyMapping, there's call to DynamicFrame.fromDF() on line 222, but it seems like the class DynamicFrame wasn't imported as part of the boiler plater. I added from awsglue.dynamicframe import DynamicFrame and everything in the sample worked cleanly. Thanks for all the work on these. They're super helpful

Unable to Create GlueContext via GlueContext Function in Local Python/awsglue Environment

I'm having the issue described in issue #42.

I am attempting to run the following in my local PySpark console...

from awsglue.context import GlueContext
glueContext = GlueContext(sc)

We receive the following:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\XYZ\bin\aws-glue-libs\PyGlue.zip\awsglue\context.py", line 47, in __init__
  File "C:\Users\XYZ\bin\aws-glue-libs\PyGlue.zip\awsglue\context.py", line 68, in _get_glue_scala_context
TypeError: 'JavaPackage' object is not callable

The following is the complete picture:

The environment looks like the following:

OS: 10.0.17134.0
Python: 3.7.3
Hadoop (winutils.exe): 2.8.5
Spark: 2.4.3
PySpark: 2.4.6
awsglue: 1.0

My environment variables look like the following...

SPARK_HOME: \bin\spark-2.4.3-bin-hadoop2.8\
SPARK_CONF_DIR: \bin\aws-glue-libs\conf\
HADOOP_HOME: \bin\hadoop-2.8.5\
SPARK_CONF_DIR: \bin\spark-2.4.3-bin-hadoop2.8\
JAVA_HOME: C:\Progra~2\Java\jdk1.8.0\
CLASSPATH:
- \bin\aws-glue-libs\jarsv1*
- \bin\spark-2.4.3-bin-hadoop2.8\jars*
PYTHONPATH:
- ${SPARK_HOME}\python\lib\py4j
- \bin\aws-glue-libs\PyGlue.zip

Just to confirm which version awsglue repo I'm working with...

The following are the "netty" files in my ..\aws-glue-libs\jarsv1\:

I'm looking for a little guidance on how to tweak my configuration to resolve this issue.

Issue with Direct Migration from Glue Catalog to Hive Metastore

We have a Glue catalog in our dev aws account and now i am trying to migrate this Glue Catalog to Hive Metastore of an EMR Cluster (I need to do this to replace Hive Metastore content with Glue Catalog metadata so that I can track our Glue Catalog Data lineage using Apache Atlas installed on the EMR cluster).

I followed all the steps and procedures to Directly Migrate Glue Catalog to Hive Metastore but i am getting the Duplicate entry 'default' for key 'UNIQUE_DATABASE' error everytime and i have tried various iterations but still keep getting the same errors when i run the Glue ETL job. See the full error message below:

py4j.protocol.Py4JJavaError: An error occurred while calling o1025.jdbc.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 77.0 failed 4 times, most recent failure: Lost task 1.3 in stage 77.0 (TID 1298, ip-00-00-000-000.ec2.internal, executor 32): java.sql.BatchUpdateException: Duplicate entry 'default' for key 'UNIQUE_DATABASE'
at com.mysql.jdbc.PreparedStatement.executeBatchSerially(PreparedStatement.java:1815)
at com.mysql.jdbc.PreparedStatement.executeBatch(PreparedStatement.java:1277)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:642)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:783)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:783)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2069)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2069)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: com.mysql.jdbc.exceptions.jdbc4.MySQLIntegrityConstraintViolationException: Duplicate entry 'default' for key 'UNIQUE_DATABASE'
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at com.mysql.jdbc.Util.handleNewInstance(Util.java:377)
at com.mysql.jdbc.Util.getInstance(Util.java:360)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:971)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3887)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3823)
at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2435)
at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2582)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2530)
at com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1907)
at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:2141)
at com.mysql.jdbc.PreparedStatement.executeBatchSerially(PreparedStatement.java:1773)

So as per the migration procedure, we need to provide "--database-names" parameter values as key pairs so I tried specifying the list of databases separated by a semi-colon(;) first and also tried specifying just one database like "default" but none of this works. The above error was thrown when I used just the default database.

Is anyone familiar with this error? is there any work around to this issue? Please let me know if I am missing something. Any help is appreciated.

Data Cleaning Example not working

Hello,

I'm unable to get the Data Cleaning Example to work as expected.

Dev Endpoint & local Zeppelin setup as described here. Port forwarding active and working.

Database and crawler are created, crawler ran successfully, the Data Catalog objects exist.

Running the code of the example step by step and preceding it with %pyspark I run into the following errors:

%pyspark
medicare = spark.read.format(
   "com.databricks.spark.csv").option(
   "header", "true").option(
   "inferSchema", "true").load(
   's3://awsglue-datasets/examples/medicare/Medicare_Hospital_Provider.csv')

Traceback (most recent call last):
  File "/tmp/zeppelin_pyspark-8799958238772365289.py", line 349, in <module>
    raise Exception(traceback.format_exc())
Exception: Traceback (most recent call last):
  File "/tmp/zeppelin_pyspark-8799958238772365289.py", line 342, in <module>
    exec(code)
  File "<stdin>", line 5, in <module>
  File "/usr/lib/spark/python/pyspark/sql/readwriter.py", line 149, in load
    return self._df(self._jreader.load(path))
  File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/usr/lib/spark/python/pyspark/sql/utils.py", line 69, in deco
    raise AnalysisException(s.split(': ', 1)[1], stackTrace)
AnalysisException: u'Path does not exist: s3://awsglue-datasets/examples/medicare/Medicare_Hospital_Provider.csv;'

As well as

medicare_dynamicframe = glueContext.create_dynamic_frame.from_catalog(
       database = "payments",
       table_name = "medicare")

medicare_dynamicframe.printSchema()

Traceback (most recent call last):
  File "/tmp/zeppelin_pyspark-8799958238772365289.py", line 349, in <module>
    raise Exception(traceback.format_exc())
Exception: Traceback (most recent call last):
  File "/tmp/zeppelin_pyspark-8799958238772365289.py", line 342, in <module>
    exec(code)
  File "<stdin>", line 12, in <module>
  File "/usr/share/aws/glue/etl/python/PyGlue.zip/awsglue/dynamicframe.py", line 108, in printSchema
    print self._jdf.schema().treeString()
  File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
    format(target_id, ".", name), value)
Py4JJavaError: An error occurred while calling o341.schema.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 7, ip-172-31-57-241.us-west-2.compute.internal, executor 2): java.io.FileNotFoundException: No such file or directory 's3://awsglue-datasets/examples/medicare/Medicare_Hospital_Provider.csv'
	at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:803)
	at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.open(S3NativeFileSystem.java:1181)
	at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:773)
	at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.open(EmrFileSystem.java:166)
	at com.amazonaws.services.glue.hadoop.TapeHadoopRecordReaderSplittable.initialize(TapeHadoopRecordReaderSplittable.scala:123)
	at org.apache.spark.rdd.NewHadoopRDD$$anon$1.liftedTree1$1(NewHadoopRDD.scala:180)
	at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:177)
	at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:134)
	at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:69)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:99)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
	at scala.Option.foreach(Option.scala:257)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1981)
	at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1025)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
	at org.apache.spark.rdd.RDD.reduce(RDD.scala:1007)
	at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1.apply(RDD.scala:1150)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
	at org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1127)
	at org.apache.spark.sql.glue.util.SchemaUtils$.fromRDD(SchemaUtils.scala:57)
	at com.amazonaws.services.glue.DynamicFrame.recomputeSchema(DynamicFrame.scala:230)
	at com.amazonaws.services.glue.DynamicFrame.schema(DynamicFrame.scala:218)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:280)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:214)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.FileNotFoundException: No such file or directory 's3://awsglue-datasets/examples/medicare/Medicare_Hospital_Provider.csv'
	at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:803)
	at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.open(S3NativeFileSystem.java:1181)
	at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:773)
	at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.open(EmrFileSystem.java:166)
	at com.amazonaws.services.glue.hadoop.TapeHadoopRecordReaderSplittable.initialize(TapeHadoopRecordReaderSplittable.scala:123)
	at org.apache.spark.rdd.NewHadoopRDD$$anon$1.liftedTree1$1(NewHadoopRDD.scala:180)
	at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:177)
	at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:134)
	at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:69)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:99)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	... 1 more

At this moment I'm not sure what could be wrong on my end.

Spark_UI docker container doesn't run

Following the steps at utilities/Spark_UI, when I run:

docker run -itd -e SPARK_HISTORY_OPTS="$SPARK_HISTORY_OPTS -Dspark.history.fs.logDirectory=s3a://path_to_my_eventlog_dir -Dspark.hadoop.fs.s3a.access.key=my_key_id -Dspark.hadoop.fs.s3a.secret.key=my_secret_access_key" -p 18080:18080 glue/sparkui:latest "/opt/spark/bin/spark-class org.apache.spark.deploy.history.HistoryServer"

The container quickly craps out with:

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
20/01/27 21:16:44 INFO HistoryServer: Started daemon with process name: 1@4b35fc26fd5a
20/01/27 21:16:44 INFO SignalUtils: Registered signal handler for TERM
20/01/27 21:16:44 INFO SignalUtils: Registered signal handler for HUP
20/01/27 21:16:44 INFO SignalUtils: Registered signal handler for INT
20/01/27 21:16:45 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/01/27 21:16:45 INFO SecurityManager: Changing view acls to: root
20/01/27 21:16:45 INFO SecurityManager: Changing modify acls to: root
20/01/27 21:16:45 INFO SecurityManager: Changing view acls groups to: 
20/01/27 21:16:45 INFO SecurityManager: Changing modify acls groups to: 
20/01/27 21:16:45 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
20/01/27 21:16:45 INFO FsHistoryProvider: History server ui acls disabled; users with admin permissions: ; groups with admin permissions
Exception in thread "main" java.lang.reflect.InvocationTargetException
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at org.apache.spark.deploy.history.HistoryServer$.main(HistoryServer.scala:280)
	at org.apache.spark.deploy.history.HistoryServer.main(HistoryServer.scala)
Caused by: java.lang.NoSuchMethodError: com.fasterxml.jackson.databind.ObjectMapper.enable([Lcom/fasterxml/jackson/core/JsonParser$Feature;)Lcom/fasterxml/jackson/databind/ObjectMapper;
	at com.amazonaws.partitions.PartitionsLoader.<clinit>(PartitionsLoader.java:54)
	at com.amazonaws.regions.RegionMetadataFactory.create(RegionMetadataFactory.java:30)
	at com.amazonaws.regions.RegionUtils.initialize(RegionUtils.java:65)
	at com.amazonaws.regions.RegionUtils.getRegionMetadata(RegionUtils.java:53)
	at com.amazonaws.regions.RegionUtils.getRegion(RegionUtils.java:107)
	at com.amazonaws.services.s3.AmazonS3Client.createSigner(AmazonS3Client.java:4040)
	at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5039)
	at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4998)
	at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1413)
	at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:1349)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.verifyBucketExists(S3AFileSystem.java:276)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:236)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2812)
	at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:100)
	at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2849)
	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2831)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:389)
	at org.apache.hadoop.fs.Path.getFileSystem(Path.java:356)
	at org.apache.spark.deploy.history.FsHistoryProvider.<init>(FsHistoryProvider.scala:117)
	at org.apache.spark.deploy.history.FsHistoryProvider.<init>(FsHistoryProvider.scala:86)
	... 6 more

How to start Crawler at the end of a job?

I want AWS Glue job to start the crawler at the end of the job.
Can anyone share python script that I will put at the end of the job to do that?

AWS Glue job is failing for larger csv data on s3

For small s3 input files (~10GB), glue ETL job works fine but for the larger dataset (~200GB), the job is failing.

Adding a part of ETL code.
// Converting Dynamic frame to dataframe
df = dropnullfields3.toDF()
// create new partition column
partitioned_dataframe = df.withColumn('part_date', df['timestamp_utc'].cast('date'))
// store the data in parquet format on s3
partitioned_dataframe.write.partitionBy(['part_date']).format("parquet").save(output_lg_partitioned_dir, mode="append")

Job executed for 4 hours and threw an error.

File "script_2017-11-23-15-07-32.py", line 49, in
partitioned_dataframe.write.partitionBy(['part_date']).format("parquet").save(output_lg_partitioned_dir, mode="append")
File "/mnt/yarn/usercache/root/appcache/application_1511449472652_0001/container_1511449472652_0001_02_000001/pyspark.zip/pyspark/sql/readwriter.py", line 550, in save
File "/mnt/yarn/usercache/root/appcache/application_1511449472652_0001/container_1511449472652_0001_02_000001/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in call
File "/mnt/yarn/usercache/root/appcache/application_1511449472652_0001/container_1511449472652_0001_02_000001/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/mnt/yarn/usercache/root/appcache/application_1511449472652_0001/container_1511449472652_0001_02_000001/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o172.save.
: org.apache.spark.SparkException: Job aborted.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:147)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:121)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:121)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:121)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:101)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:87)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:87)
at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:492)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:215)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:198)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 3385 tasks (1024.1 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1951)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:127)
... 30 more

End of LogType:stdout

Add gradle build file

Would be nice if this project had a gradle build file with all the right scala and aws lib versions.

Documentation doesn't mention PyGlue prominently

Given that most of the example python scripts won't work without the awsglue module, perhaps this repo would be more useful for folks if information about awsglue (and how to get it) were presented prominently. If you are at this issue and wondering where... https://docs.aws.amazon.com/glue/latest/dg/dev-endpoint-tutorial-pycharm.html

Data cleaning with AWS Glue example - crawler fails to create table

Using the CSV at s3://awsglue-datasets/examples/medicare/Medicare_Hospital_Provider.csv and a newly provisioned crawler fails to create any tables. This was tested using Glue in the Oregon region.

Add an example of a custom classifier

I'd like to see an example of custom classifier that is proven to work with custom data. The reason for the request is my headache when trying to write my own and my efforts simply do not work. My code (and patterns) work perfectly in online Grok debuggers, but they do not work in AWS. I do not get any errors in the logs either. My data simply does not get classified and table schemas are not created.

So, the classifier example should include a custom file to classify, maybe a log file of some sort. The file itself should include various types of information so that the example would demonstrate various pattern matches. Then the example should present the classifier rule, maybe even include a custom keyword to demonstrate the usage of that one too. Also, a deliberate mistake should also be demoed (both in input data and patterns) and how to debug this situation in AWS.

Thanks in advance!

toDF() isn't working on the shell

I get the same error (see attached) when trying orgs.toDF().show() or memberships.select_fields(['organization_id']).toDF().distinct().show()

local execution of aws glue

Trying to run aws glue with AWSGlue.zip throws following error

~/opt/spark-2.2.0-bin-hadoop2.7/bin/pyspark
Python 2.7.15 |Anaconda, Inc.| (default, Dec 14 2018, 13:10:39) 
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
19/02/18 18:36:18 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
19/02/18 18:36:23 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.2.0
      /_/

Using Python version 2.7.15 (default, Dec 14 2018 13:10:39)
SparkSession available as 'spark'.
>>> from awsglue.dynamicframe import DynamicFrameWriter
>>> glueContext = GlueContext(sc)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'GlueContext' is not defined
>>> from awsglue.context import GlueContext
>>> glueContext = GlueContext(sc)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "awsglue/context.py", line 44, in __init__
    self._glue_scala_context = self._get_glue_scala_context(**options)
  File "awsglue/context.py", line 64, in _get_glue_scala_context
    return self._jvm.GlueContext(self._jsc.sc())
TypeError: 'JavaPackage' object is not callable