awslabs / athena-glue-service-logs Goto Github PK
View Code? Open in Web Editor NEWGlue scripts for converting AWS Service Logs for use in Athena
License: Apache License 2.0
Glue scripts for converting AWS Service Logs for use in Athena
License: Apache License 2.0
I left and returned to AWS. Since that time, Glue has:
I want to research what impact these changes have and if they should be incorporated. Initial thoughts are that:
In addition, I need to go through the backlog. :)
I cannot run the CloudFront task without getting this responses.
ERROR TaskSetManager: Total size of serialized results of 3055 tasks (1052.9 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)
I've tried this out and it works great. The only thing missing is support for org Cloudtrails or aggregated log account buckets. If this could be pointed at an aggregator account and create a single table for all accounts with accounts as partitions, I could easily search for what a user has done across all our accounts.
There are several places where INSIDE_QS
is used instead of QS
particularly in ALB regex. As noted in #3 this can cause issues in certain cases.
However, we have to be careful in replacing this - there are some fields where if they are empty, they will be a simple hyphen (-
) or a quoted hyphen ("-"
) depending on the log entry. This pops up specifically with S3 access logs, where some entries can be quoted strings, a quoted hyphen, or a non-quoted hyphen. If the grok expression doesn't match, data won't be returned so could go missing.
Further, these situations can be difficult to detect. In a simple test, I replaced INSIDE_QS
with QS
and did a COUNT(*)
of the two tables with Athena. The same number was returned. However, when I did a COUNT(*)
with a WHERE
statement filtering by request_id prefix...different results were returned. I'm guessing this is due to a COUNT(*)
not deserializing everything(?).
I ran into an issue where we aggregate S3 distribution logs from a variety of sources into one account, and the logs are broken down into sub-prefix:
s3://<log-bucket>/s3_distribution_logs/<deployment-name-sub-prefix>/<log-file-name>
I was trying to run a single Glue job for s3://<log-bucket>/s3_distribution_logs/
to populate all <deployment-name-sub-prefix>
logs into the same CONVERTED_TABLE_NAME
. In this case, the RAW_TABLE_NAME
athena table was getting populated, the job would initially error withe the below error, then on subsequent runs would run "successfully". Unfortunately, I wouldn't get any logs into my CONVERTED_TABLE_NAME
Athena table.
With continuous logging enabled, and a little tinkering, I tracked the issue down to _get_first_key_in_prefix()
:
line 128, in _get_first_key_in_prefix
first_object = response.get('Contents')[0].get('Key')
TypeError: 'NoneType' object is not subscriptable
The values going into self.s3_client.list_objects_v2(**query_params)
were:
{'Bucket': 'reformated-log-bucket', 'Prefix': 's3_access/', 'MaxKeys': 10}
from the glue_jobs.json
:
"S3_CONVERTED_TARGET":"s3://reformated-log-bucket/s3_access/"
Its entirely unclear to my why, since I'm VERY new to both this project and Glue in general, but if I supply
"S3_SOURCE_LOCATION":"s3://<log-bucket>/s3_distribution_logs/<deployment-name-sub-prefix>/"
instead of:
"S3_SOURCE_LOCATION":"s3://<log-bucket>/s3_distribution_logs/"
...it just work. Albeit with a smaller subset of data than I wanted. This may also be related to #30
Hi,
My log bucket is fairly large in size, however we have Glaicered anything older than three months. When I run the job, I get the following, as it completes in a minute or two:
19/09/25 13:26:02 WARN HadoopDataSource: Skipping Partition
{}as no new files detected @ s3://<BUCKET>/ / or path does not exist
where is the name of my S3 access log storage bucket.
My logs are being saved at top-level in the S3 bucket, i.e. all log files are at s3:///
What could be happening here? I know there are logs in the bucket that are not partitioned, and the converted DB/tables are empty when I preview them. I have given the classification of the raw data table as CSV, but I am not sure what is correct.
Any pointers would be appreciated!
Not working if bucket naming is not as specified format.
When tested on 6 files only, It still created the Athena table in Parquet format but no results on query.
For larger load in GB's, it just kept failing with OOM or error code 137
Crawlers can now use existing tables as a crawler source, which may give us the ability to deprecate our custom partitioning code that searches S3 for new partitions.
In combination with Workflows, we could easily trigger a Crawler to run after our job is finished.
Another AWS log format that would be very useful to process with a Glue Job and query with Athena is AWS WAF logs. Today we can save it to an ElasticSearch, Redshift, Splunk, or S3. Saving to ElasticSearch, Redshift, and Splunk would require infrastructure and VPC on us-east-1 which is not always desirable. Save to S3 is posing a difficulty because the json format makes files too large (want to save money on S3). Save to S3 on parquet format is not helping us because we don't know the format the data is in inside.
I'm trying to run some analysis on a collection of S3 Access Logs, and set up a Glue job using the steps in the README to do so. The set of logs is about 14 GB over 12.8 million files. Whenever I kick off the job, it runs for about 13 minutes and then fails with a Command failed with exit code 1
message. Looking at the logs, I see this line that seems important:
Diagnostics: Container [pid=11027,containerID=container_1569865532923_0001_01_000001] is running beyond physical memory limits. Current usage: 5.5 GB of 5.5 GB physical memory used; 7.7 GB of 27.5 GB virtual memory used. Killing container.
This is corroborated by CloudWatch metrics, which show the driver memory usage steadily climbing and the executor staying low.
Based on the athena_glue_service_logs
blog post here, it seems like my volume of data is well within the expected limits. I retried the job after adding the --conf
parameter set to spark.yarn.executor.memoryOverhead=1G
, but it failed in the same way.
Any advice for getting this to work are appreciated - otherwise I'll follow the Glue documentation suggestion of writing a script to do the conversion using DynamicFrames.
I am working with a large set of S3 access logs (90 days). The first 60 days have been moved to Glacier. I really only need the last 30 days. Its still a large amount of objects (~1.5 million). Could you point me to the easiest way to modify this code base to run for smaller date range? Even one day at a time would suffice. Thanks.
Hi
We got some S3 access logs stored in S3 and we tried to use this lib but we were unable to make the job run on that dataset: 25GB s3 access logs/day for 30days.
I've tried with:
150 standard DPUs
100 G.1X
50 G.2X
all with many combinations of memory settings to no avail.
I instead went to the code and skipped the repartition stage: https://github.com/awslabs/athena-glue-service-logs/blob/master/athena_glue_service_logs/converter.py#L66
I also had to add spark.hadoop.fs.s3.maxRetries=20 since it now makes quite a lot S3 calls which caused throttling.
The job succeeded with 100 'standard' workers after only 4hours.
The drawback is of course that more objects were created: between 50-140 per day-partition. For smaller datasets the amount of files are higher: some thousands.
But for us at least it is better to have the jobs succeeding, than having no log data at all. Also, for our use case, the athena query performance will be good enough.
Would it make sense to make the repartitioning step configurable? I.e being able to skip it.
I can foresee that someone will mention the option to use coalesce
instead of repartition
. I have tried that already, and that only failed as well.
Another option is to have a (separate?) step that reduces the number of objects but more efficiently.
Another AWS log format that would be very useful to process with a Glue Job and query with Athena (or Redshift Spectrum) is Redshift audit logs, specifically the user log, user activity log, and connection log file formats created in S3 by the feature. See here for the Redshift feature that creates these logs.
We are testing version 6.0.0 of the tool using Glue 3.0 and have noticed that some access log data is being deduped when files are converted into hive/parquet format.
An example from our access logs are
9d306c3478cf2e54f72d7f972c2e7090d30324572b78f1901d30a5d89e33cbe8 BUCKET_NAME [02/May/2023:19:19:55 +0000] 52.12.241.113 IAM_ARN_HERE 1C25EZNCB2HBMQQY BATCH.DELETE.OBJECT f1683055189142x766494105173435800/IMG_1138.jpeg - 204 - - - - - - - - Pad2oayPBK9Yqw9/BWjhgn84fAsgRK7OTjjRFTy8Nuzlr27Ou+InFTsEf3eJsOaOkr2jw9xLBUa6d1tHwjf+xg== SigV2 ECDHE-RSA-AES128-GCM-SHA256 AuthHeader s3.amazonaws.com TLSv1.2 - -
9d306c3478cf2e54f72d7f972c2e7090d30324572b78f1901d30a5d89e33cbe8 BUCKET_NAME [02/May/2023:19:19:55 +0000] 52.12.241.113 IAM_ARN_HERE 1C25EZNCB2HBMQQY REST.POST.MULTI_OBJECT_DELETE - "POST /BUCKET_NAME/?delete HTTP/1.1" 200 - 305 - 29 - "-" "-" - Pad2oayPBK9Yqw9/BWjhgn84fAsgRK7OTjjRFTy8Nuzlr27Ou+InFTsEf3eJsOaOkr2jw9xLBUa6d1tHwjf+xg== SigV2 ECDHE-RSA-AES128-GCM-SHA256 AuthHeader s3.amazonaws.com TLSv1.2 - -
The Athena query output is as follows
# bucket_owner bucket time remote_ip requester request_id operation key request_uri http_status error_code bytes_sent object_size total_time turnaround_time referrer user_agent version_id host_id signature_version cipher_suite authentication_type host_header tls_version year month day
1 9d306c3478cf2e54f72d7f972c2e7090d30324572b78f1901d30a5d89e33cbe8 BUCKET_NAME 2023-05-02 19:19:55.000 52.12.241.113 IAM_ARN_HERE 1C25EZNCB2HBMQQY REST.POST.MULTI_OBJECT_DELETE POST /BUCKET_NAME/?delete HTTP/1.1 200 305 29 Pad2oayPBK9Yqw9/BWjhgn84fAsgRK7OTjjRFTy8Nuzlr27Ou+InFTsEf3eJsOaOkr2jw9xLBUa6d1tHwjf+xg== SigV2 ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 - - 2023 05 02
This is concerning because the Athena query output does not show which file was deleted. It appears the second s3 access log entry is overwriting the first when the file is converted.
Thank you for taking the time to look into this.
Just wondering if bytes_sent and object_size could be switched from type string to int or bigint for the optimized table. Is there a reason these are set the way they are?
Previously the TempDir
was only used in the case you were doing Redshift transformations. However, it's also now used for bookmark tracking. We need to ensure a valid location is provided as part of the Makefile.
We have multiple CloudFront distributions (across a couple different AWS accounts). It seems like the job definition file only allows a single job per service type. It seems like allowing the job definition file to be overridden could help both those situations.
Like maybe changing this line https://github.com/awslabs/athena-glue-service-logs/blob/master/Makefile#L6 to:
JOB_DEFINITION_FILE?=scripts/glue_jobs.json
?
Using latest from this repo ...
The databases get created, but just after the access_optimized table is created, we receive this error:
glue_service_logs/utils.py", line 128, in _get_first_key_in_prefix TypeError: 'NoneType' object has no attribute '__getitem__' End of LogType:stdout
Full trace:
`20/07/21 09:47:30 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
20/07/21 09:47:30 INFO MemoryStore: MemoryStore cleared
20/07/21 09:47:30 INFO BlockManager: BlockManager stopped
20/07/21 09:47:30 INFO BlockManagerMaster: BlockManagerMaster stopped
20/07/21 09:47:30 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
20/07/21 09:47:30 INFO SparkContext: Successfully stopped SparkContext
20/07/21 09:47:30 INFO ApplicationMaster: Unregistering ApplicationMaster with FAILED (diag message: User application exited with status 1)
20/07/21 09:47:30 INFO AMRMClientImpl: Waiting for application to be successfully unregistered.
20/07/21 09:47:30 INFO ApplicationMaster: Deleting staging directory hdfs://ip-172-32-99-27.ec2.internal:8020/user/root/.sparkStaging/application_1595322435097_0001
20/07/21 09:47:30 INFO ShutdownHookManager: Shutdown hook called
20/07/21 09:47:30 INFO ShutdownHookManager: Deleting directory /mnt/yarn/usercache/root/appcache/application_1595322435097_0001/spark-1ca0f7e1-0bc4-47d6-ab91-12879d18055f
20/07/21 09:47:30 INFO ShutdownHookManager: Deleting directory /mnt/yarn/usercache/root/appcache/application_1595322435097_0001/spark-1ca0f7e1-0bc4-47d6-ab91-12879d18055f/pyspark-11265b43-46c7-46a1-a043-8f182446bdf0
End of LogType:stderr
LogType:stdout
Log Upload Time:Tue Jul 21 09:47:32 +0000 2020
LogLength:2765
Log Contents:
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Starting new HTTP connection (1): 169.254.169.254
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Starting new HTTP connection (1): 169.254.169.254
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): glue.us-east-1.amazonaws.com
INFO:athena_glue_service_logs.job:Initial run, scanning S3 for partitions.
INFO:athena_glue_service_logs.catalog_manager:Creating database aws_service_logs
INFO:athena_glue_service_logs.catalog_manager:Creating database table s3_access_raw
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): glue.us-east-1.amazonaws.com
INFO:athena_glue_service_logs.catalog_manager:Creating database table s3_access_optimized
null_fields []
INFO:athena_glue_service_logs.converter:No data returned, skipping conversion.
INFO:athena_glue_service_logs.job:Initial run with source NullPartitioner, adding all partitions from S3.
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): rodin-glue.s3.amazonaws.com
Parse yarn logs get error message: TypeError: 'NoneType' object has no attribute 'getitem'
Traceback (most recent call last):
File "script_2020-07-21-09-45-54.py", line 4, in
job_run.convert_and_partition()
File "/mnt/yarn/usercache/root/appcache/application_1595322435097_0001/container_1595322435097_0001_01_000001/athena_glue_converter_latest.zip/athena_glue_service_logs/job.py", line 156, in convert_and_partition
File "/mnt/yarn/usercache/root/appcache/application_1595322435097_0001/container_1595322435097_0001_01_000001/athena_glue_converter_latest.zip/athena_glue_service_logs/job.py", line 137, in add_new_optimized_partitions
File "/mnt/yarn/usercache/root/appcache/application_1595322435097_0001/container_1595322435097_0001_01_000001/athena_glue_converter_latest.zip/athena_glue_service_logs/catalog_manager.py", line 89, in get_and_create_partitions
File "/mnt/yarn/usercache/root/appcache/application_1595322435097_0001/container_1595322435097_0001_01_000001/athena_glue_converter_latest.zip/athena_glue_service_logs/partitioners/date_partitioner.py", line 34, in build_partitions_from_s3
File "/mnt/yarn/usercache/root/appcache/application_1595322435097_0001/container_1595322435097_0001_01_000001/athena_glue_converter_latest.zip/athena_glue_service_logs/utils.py", line 80, in get_first_hivecompatible_date_in_prefix
File "/mnt/yarn/usercache/root/appcache/application_1595322435097_0001/container_1595322435097_0001_01_000001/athena_glue_converter_latest.zip/athena_`
example:
Error:
make create_job service_name=cloudfront
'service' variable must be defined
make: *** [require_service] Error 1
Success:
make create_job service=cloudfront
test -n CloudFront_Partitioner || (echo "cloudfront service not found" && echo 1)
aws glue create-job --name CloudFront_Partitioner_LogMaster_v5.3.0 \
--description "CloudFront_Partitioner Log infra generator" \
--role AWSGlueServiceRoleDefault \
...
...
$ make --version
GNU Make 3.81
Copyright (C) 2006 Free Software Foundation, Inc.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.