Comments (12)
I just ran into this same issue. The problem was that in order to test an updated classifier, you need to create a whole new crawler. Simply updating the classifier and rerunning the crawler will NOT result in the updated classifier being used. This is not intuitive at all and lacks documentation in relevant places.
The only place this is explicitly mentioned (that I found) is in https://docs.aws.amazon.com/glue/latest/dg/add-classifier.html - "To reclassify data to correct an incorrect classifier, create a new crawler with the updated classifier"
This nugget of information needs to be added to every other place custom classifiers are documented in bold capital letters.
from aws-glue-samples.
Here is one of mine. For log lines like this:
some-log-type: source-host-name 2017-07-01 00:00:01 - {"foo":1,"bar":2}
I set up a Glue custom classifier with:
Grok pattern: %{OURLOGWITHJSON}
Custom patterns:
OURTIMESTAMP (%{TIMESTAMP_ISO8601}|%{YEAR}/%{MONTHNUM}/%{MONTHDAY} %{TIME})
OURWORDWITHDASHES \b[\w-]+\b
OURLOGSTART %{OURWORDWITHDASHES:ourevent}:? %{SYSLOGHOST:logsource}( %{POSINT:pid})? %{OURTIMESTAMP:ourtimestamp}
GREEDYJSON (\{.*\})
OURLOGWITHJSON ^%{OURLOGSTART}( - )?[^{]+%{GREEDYJSON:json}$
(Note Logstash works with GREEDYJSON ({.*}) but Glue's Grok parser rejects that)
and I get rows with four fields:
ourevent: some-log-type
logsource: source-host-name
ourtimestamp: 2017-07-01 00:00:01
json: {"foo":1,"bar":2}
The Grok patterns are a bit more complicated than the minimum to match that,
in particular the colon after "some-log-type" is optional, the ' - ' may
or may not be present, and the timestamp might be in ISO8601 format.
from aws-glue-samples.
I have lot text files in our S3 under different folders in columnar data sections. Automatic crawler does not recognize the schema in those files. How do we setup custom crawler for text files with column data.
from aws-glue-samples.
I got the same issue than @vatjujar vatjujar.
from aws-glue-samples.
We are also experiencing the same issue while trying to parse apache styled log lines—everything works perfect in online grok debuggers, but manually running a crawler shows nothing...a more detailed example would be greatly appreciated!
from aws-glue-samples.
I updated the text above, so the backslashes are now correctly shown in the GREEDYJSON pattern... (The text above elided the backslashes in front of the braces in the GREEDYJSON pattern -- I needed to add those in order for Glue's Grok parser to accept the pattern.)
from aws-glue-samples.
I have given many tries but not working , all my grok patterns work well with grok debugger but not in AWS Glue
from aws-glue-samples.
I tried writing a pattern for single quoted semi json data file and it works on the debugger. However, not in Glue. Any help is much appreciated!
from aws-glue-samples.
As shown above, I had to include backslashes before the brace characters (see "GREEDYJSON") to get it to match the JSON part of my log lines (to a string field named json, which I later unbox in a Glue script like this:
...
unbox5 = Unbox.apply(frame = priorframe4, path = "json", format = "json", transformation_ctx = "unbox5")
...)
The backslashes weren't necessary in the online Grok debugger or in Logstash, but were necessary in Glue's Grok patterns. Dunno if that's your issue or not, but you might try throwing around some backslashes to see if it helps!
from aws-glue-samples.
How do we have crawler setup on S3 buckets with "ini" file formats?
from aws-glue-samples.
I just ran into this same issue. The problem was that in order to test an updated classifier, you need to create a whole new crawler. Simply updating the classifier and rerunning the crawler will NOT result in the updated classifier being used. This is not intuitive at all and lacks documentation in relevant places.
The only place this is explicitly mentioned (that I found) is in https://docs.aws.amazon.com/glue/latest/dg/add-classifier.html - "To reclassify data to correct an incorrect classifier, create a new crawler with the updated classifier"
This nugget of information needs to be added to every other place custom classifiers are documented in bold capital letters.
Thank you @vkubushyn , you saved me some time. I faced the same here.
from aws-glue-samples.
in addition:
- Glue grok classifiers and grok debugger patterns are not exactly the same
- don't crawl specific files; instead, crawl the directories
- multiline and newline not supported -> need to transform the file contents via a script
from aws-glue-samples.
Related Issues (20)
- 'glue/sparkui:latest' missing in Docker hub HOT 3
- Issue with migrating directly from AWS Glue to Hive HOT 2
- Creating AWS- Glue Pipeline using Cloud Formation HOT 3
- Issue migrating directly from Hive Metastore to Glue Data Catalog
- Spark-UI docker container startup issue HOT 4
- hive_metastore_migration.py fails with AttributeError: 'str' object has no attribute '_jdf' HOT 1
- Unable to start Spark-UI docker container from EC2 in China Region HOT 6
- tinyint(1) issue from mysql database
- Issues using Spark_UI/glue-3_0 and Spark_UI/glue-4_0 HOT 2
- Spark UI Glue 4.0 Logging Not Working? HOT 2
- Request to Host Glue Spark UI Images on DockerHub
- Spark UI container is not getting started HOT 3
- Launch AWS Glue Spark UI Filtered to Specific Applications
- EMR Hive Metastore to Glue Migration
- Setup AWS glue
- Wrong escape character in avro.schema.url
- Couldn't resolve host name for Spark UI HOT 4
- writing data to s3 using spark and updating catalog
- Unsupported jdbc driver classname with com.ibm.as400.access.AS400JDBCDriver HOT 1
- Spark history server: README.md to show using AWS_PROFILE HOT 23
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from aws-glue-samples.