Git Product home page Git Product logo

Comments (12)

vkubushyn avatar vkubushyn commented on August 24, 2024 30

I just ran into this same issue. The problem was that in order to test an updated classifier, you need to create a whole new crawler. Simply updating the classifier and rerunning the crawler will NOT result in the updated classifier being used. This is not intuitive at all and lacks documentation in relevant places.

The only place this is explicitly mentioned (that I found) is in https://docs.aws.amazon.com/glue/latest/dg/add-classifier.html - "To reclassify data to correct an incorrect classifier, create a new crawler with the updated classifier"

This nugget of information needs to be added to every other place custom classifiers are documented in bold capital letters.

from aws-glue-samples.

wintersky avatar wintersky commented on August 24, 2024 6

Here is one of mine. For log lines like this:
some-log-type: source-host-name 2017-07-01 00:00:01 - {"foo":1,"bar":2}

I set up a Glue custom classifier with:

Grok pattern: %{OURLOGWITHJSON}

Custom patterns:
OURTIMESTAMP (%{TIMESTAMP_ISO8601}|%{YEAR}/%{MONTHNUM}/%{MONTHDAY} %{TIME})
OURWORDWITHDASHES \b[\w-]+\b
OURLOGSTART %{OURWORDWITHDASHES:ourevent}:? %{SYSLOGHOST:logsource}( %{POSINT:pid})? %{OURTIMESTAMP:ourtimestamp}
GREEDYJSON (\{.*\})
OURLOGWITHJSON ^%{OURLOGSTART}( - )?[^{]+%{GREEDYJSON:json}$

(Note Logstash works with GREEDYJSON ({.*}) but Glue's Grok parser rejects that)

and I get rows with four fields:
ourevent: some-log-type
logsource: source-host-name
ourtimestamp: 2017-07-01 00:00:01
json: {"foo":1,"bar":2}

The Grok patterns are a bit more complicated than the minimum to match that,
in particular the colon after "some-log-type" is optional, the ' - ' may
or may not be present, and the timestamp might be in ISO8601 format.

from aws-glue-samples.

naginenisridhar avatar naginenisridhar commented on August 24, 2024 2

I have lot text files in our S3 under different folders in columnar data sections. Automatic crawler does not recognize the schema in those files. How do we setup custom crawler for text files with column data.

from aws-glue-samples.

billmetangmo avatar billmetangmo commented on August 24, 2024 1

I got the same issue than @vatjujar vatjujar.

from aws-glue-samples.

loudmouth avatar loudmouth commented on August 24, 2024 1

We are also experiencing the same issue while trying to parse apache styled log lines—everything works perfect in online grok debuggers, but manually running a crawler shows nothing...a more detailed example would be greatly appreciated!

from aws-glue-samples.

wintersky avatar wintersky commented on August 24, 2024

I updated the text above, so the backslashes are now correctly shown in the GREEDYJSON pattern... (The text above elided the backslashes in front of the braces in the GREEDYJSON pattern -- I needed to add those in order for Glue's Grok parser to accept the pattern.)

from aws-glue-samples.

ramzanfarooq avatar ramzanfarooq commented on August 24, 2024

I have given many tries but not working , all my grok patterns work well with grok debugger but not in AWS Glue

from aws-glue-samples.

bmardimani avatar bmardimani commented on August 24, 2024

I tried writing a pattern for single quoted semi json data file and it works on the debugger. However, not in Glue. Any help is much appreciated!

from aws-glue-samples.

wintersky avatar wintersky commented on August 24, 2024

As shown above, I had to include backslashes before the brace characters (see "GREEDYJSON") to get it to match the JSON part of my log lines (to a string field named json, which I later unbox in a Glue script like this:
...
unbox5 = Unbox.apply(frame = priorframe4, path = "json", format = "json", transformation_ctx = "unbox5")
...)

The backslashes weren't necessary in the online Grok debugger or in Logstash, but were necessary in Glue's Grok patterns. Dunno if that's your issue or not, but you might try throwing around some backslashes to see if it helps!

from aws-glue-samples.

naginenisridhar avatar naginenisridhar commented on August 24, 2024

How do we have crawler setup on S3 buckets with "ini" file formats?

from aws-glue-samples.

danilocgomes avatar danilocgomes commented on August 24, 2024

I just ran into this same issue. The problem was that in order to test an updated classifier, you need to create a whole new crawler. Simply updating the classifier and rerunning the crawler will NOT result in the updated classifier being used. This is not intuitive at all and lacks documentation in relevant places.

The only place this is explicitly mentioned (that I found) is in https://docs.aws.amazon.com/glue/latest/dg/add-classifier.html - "To reclassify data to correct an incorrect classifier, create a new crawler with the updated classifier"

This nugget of information needs to be added to every other place custom classifiers are documented in bold capital letters.

Thank you @vkubushyn , you saved me some time. I faced the same here.

from aws-glue-samples.

BwL1289 avatar BwL1289 commented on August 24, 2024

in addition:

  1. Glue grok classifiers and grok debugger patterns are not exactly the same
  2. don't crawl specific files; instead, crawl the directories
  3. multiline and newline not supported -> need to transform the file contents via a script

from aws-glue-samples.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.