awslabs / athena-glue-service-logs Goto Github PK

View Code? Open in Web Editor NEW

140.0 14.0 44.0 390 KB

Glue scripts for converting AWS Service Logs for use in Athena

License: Apache License 2.0

Makefile 3.47% Python 96.53%

glue-scripts cloudtrail-logs elb-logs athena glue-job aws-glue s3-log-parser cloudfront-logs alb-logs vpc-flow-logs

athena-glue-service-logs's People

Contributors

Stargazers

Watchers

athena-glue-service-logs's Issues

Bring current with changes in Glue

I left and returned to AWS. Since that time, Glue has:

Added support for defining table names with crawlers
Added support for schema evolution
Upgraded to Python 3 (#16)
Added support for incremental crawler runs

I want to research what impact these changes have and if they should be incorporated. Initial thoughts are that:

Some of the manual table creation I'm doing could be replaced with crawlers
Python3 upgrade is a must

In addition, I need to go through the backlog. :)

ERROR TaskSetManager: Total size ... is bigger than spark.driver.maxResultSize (1024.0 MB)

I cannot run the CloudFront task without getting this responses.

ERROR TaskSetManager: Total size of serialized results of 3055 tasks (1052.9 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)

aws_account_id partition

I've tried this out and it works great. The only thing missing is support for org Cloudtrails or aggregated log account buckets. If this could be pointed at an aggregator account and create a single table for all accounts with accounts as partitions, I could easily search for what a user has done across all our accounts.

Replace INSIDE_QS with QS

There are several places where INSIDE_QS is used instead of QS particularly in ALB regex. As noted in #3 this can cause issues in certain cases.

However, we have to be careful in replacing this - there are some fields where if they are empty, they will be a simple hyphen (-) or a quoted hyphen ("-") depending on the log entry. This pops up specifically with S3 access logs, where some entries can be quoted strings, a quoted hyphen, or a non-quoted hyphen. If the grok expression doesn't match, data won't be returned so could go missing.

Further, these situations can be difficult to detect. In a simple test, I replaced INSIDE_QS with QS and did a COUNT(*) of the two tables with Athena. The same number was returned. However, when I did a COUNT(*) with a WHERE statement filtering by request_id prefix...different results were returned. I'm guessing this is due to a COUNT(*) not deserializing everything(?).

Clarify documentation to note that ingesting s3 access logs from a sub-prefix inside prefix does not work

I ran into an issue where we aggregate S3 distribution logs from a variety of sources into one account, and the logs are broken down into sub-prefix:

s3://<log-bucket>/s3_distribution_logs/<deployment-name-sub-prefix>/<log-file-name>

I was trying to run a single Glue job for s3://<log-bucket>/s3_distribution_logs/ to populate all <deployment-name-sub-prefix> logs into the same CONVERTED_TABLE_NAME. In this case, the RAW_TABLE_NAME athena table was getting populated, the job would initially error withe the below error, then on subsequent runs would run "successfully". Unfortunately, I wouldn't get any logs into my CONVERTED_TABLE_NAME Athena table.

With continuous logging enabled, and a little tinkering, I tracked the issue down to _get_first_key_in_prefix():

line 128, in _get_first_key_in_prefix
    first_object = response.get('Contents')[0].get('Key')
    TypeError: 'NoneType' object is not subscriptable

The values going into self.s3_client.list_objects_v2(**query_params) were:

{'Bucket': 'reformated-log-bucket', 'Prefix': 's3_access/', 'MaxKeys': 10}

from the glue_jobs.json:

"S3_CONVERTED_TARGET":"s3://reformated-log-bucket/s3_access/"

Its entirely unclear to my why, since I'm VERY new to both this project and Glue in general, but if I supply

"S3_SOURCE_LOCATION":"s3://<log-bucket>/s3_distribution_logs/<deployment-name-sub-prefix>/"

instead of:

"S3_SOURCE_LOCATION":"s3://<log-bucket>/s3_distribution_logs/"

...it just work. Albeit with a smaller subset of data than I wanted. This may also be related to #30

Skipping partition as no new files detected

Hi,

My log bucket is fairly large in size, however we have Glaicered anything older than three months. When I run the job, I get the following, as it completes in a minute or two:

19/09/25 13:26:02 WARN HadoopDataSource: Skipping Partition
{}as no new files detected @ s3://<BUCKET>/ / or path does not exist

where is the name of my S3 access log storage bucket.

My logs are being saved at top-level in the S3 bucket, i.e. all log files are at s3:///

What could be happening here? I know there are logs in the bucket that are not partitioned, and the converted DB/tables are empty when I preview them. I have given the classification of the raw data table as CSV, but I am not sure what is correct.

Any pointers would be appreciated!

S3 Server Access not working

Not working if bucket naming is not as specified format.

When tested on 6 files only, It still created the Athena table in Parquet format but no results on query.

For larger load in GB's, it just kept failing with OOM or error code 137

Investigate Glue Crawlers and Workflows

Crawlers can now use existing tables as a crawler source, which may give us the ability to deprecate our custom partitioning code that searches S3 for new partitions.

In combination with Workflows, we could easily trigger a Crawler to run after our job is finished.

Add AWS WAF logs

Another AWS log format that would be very useful to process with a Glue Job and query with Athena is AWS WAF logs. Today we can save it to an ElasticSearch, Redshift, Splunk, or S3. Saving to ElasticSearch, Redshift, and Splunk would require infrastructure and VPC on us-east-1 which is not always desirable. Save to S3 is posing a difficulty because the json format makes files too large (want to save money on S3). Save to S3 on parquet format is not helping us because we don't know the format the data is in inside.

Error: Container is running beyond physical memory limits

I'm trying to run some analysis on a collection of S3 Access Logs, and set up a Glue job using the steps in the README to do so. The set of logs is about 14 GB over 12.8 million files. Whenever I kick off the job, it runs for about 13 minutes and then fails with a Command failed with exit code 1 message. Looking at the logs, I see this line that seems important:

Diagnostics: Container [pid=11027,containerID=container_1569865532923_0001_01_000001] is running beyond physical memory limits. Current usage: 5.5 GB of 5.5 GB physical memory used; 7.7 GB of 27.5 GB virtual memory used. Killing container.

This is corroborated by CloudWatch metrics, which show the driver memory usage steadily climbing and the executor staying low.

Based on the athena_glue_service_logs blog post here, it seems like my volume of data is well within the expected limits. I retried the job after adding the --conf parameter set to spark.yarn.executor.memoryOverhead=1G, but it failed in the same way.

Any advice for getting this to work are appreciated - otherwise I'll follow the Glue documentation suggestion of writing a script to do the conversion using DynamicFrames.

Partition subset of S3 logs based on date

I am working with a large set of S3 access logs (90 days). The first 60 days have been moved to Glacier. I really only need the last 30 days. Its still a large amount of objects (~1.5 million). Could you point me to the easiest way to modify this code base to run for smaller date range? Even one day at a time would suffice. Thanks.

Overcoming memory issues by configurable repartition step

Hi
We got some S3 access logs stored in S3 and we tried to use this lib but we were unable to make the job run on that dataset: 25GB s3 access logs/day for 30days.

I've tried with:

150 standard DPUs
100 G.1X
50 G.2X
all with many combinations of memory settings to no avail.

I instead went to the code and skipped the repartition stage: https://github.com/awslabs/athena-glue-service-logs/blob/master/athena_glue_service_logs/converter.py#L66
I also had to add spark.hadoop.fs.s3.maxRetries=20 since it now makes quite a lot S3 calls which caused throttling.

The job succeeded with 100 'standard' workers after only 4hours.
The drawback is of course that more objects were created: between 50-140 per day-partition. For smaller datasets the amount of files are higher: some thousands.

But for us at least it is better to have the jobs succeeding, than having no log data at all. Also, for our use case, the athena query performance will be good enough.

Would it make sense to make the repartitioning step configurable? I.e being able to skip it.
I can foresee that someone will mention the option to use coalesce instead of repartition. I have tried that already, and that only failed as well.

Another option is to have a (separate?) step that reduces the number of objects but more efficiently.

Add Redshift audit logs processing.

Another AWS log format that would be very useful to process with a Glue Job and query with Athena (or Redshift Spectrum) is Redshift audit logs, specifically the user log, user activity log, and connection log file formats created in S3 by the feature. See here for the Redshift feature that creates these logs.

Duplicate S3 access log request IDs and glue deduping

We are testing version 6.0.0 of the tool using Glue 3.0 and have noticed that some access log data is being deduped when files are converted into hive/parquet format.

An example from our access logs are

9d306c3478cf2e54f72d7f972c2e7090d30324572b78f1901d30a5d89e33cbe8 BUCKET_NAME [02/May/2023:19:19:55 +0000] 52.12.241.113 IAM_ARN_HERE 1C25EZNCB2HBMQQY BATCH.DELETE.OBJECT f1683055189142x766494105173435800/IMG_1138.jpeg - 204 - - - - - - - - Pad2oayPBK9Yqw9/BWjhgn84fAsgRK7OTjjRFTy8Nuzlr27Ou+InFTsEf3eJsOaOkr2jw9xLBUa6d1tHwjf+xg== SigV2 ECDHE-RSA-AES128-GCM-SHA256 AuthHeader s3.amazonaws.com TLSv1.2 - -
9d306c3478cf2e54f72d7f972c2e7090d30324572b78f1901d30a5d89e33cbe8 BUCKET_NAME [02/May/2023:19:19:55 +0000] 52.12.241.113 IAM_ARN_HERE 1C25EZNCB2HBMQQY REST.POST.MULTI_OBJECT_DELETE - "POST /BUCKET_NAME/?delete HTTP/1.1" 200 - 305 - 29 - "-" "-" - Pad2oayPBK9Yqw9/BWjhgn84fAsgRK7OTjjRFTy8Nuzlr27Ou+InFTsEf3eJsOaOkr2jw9xLBUa6d1tHwjf+xg== SigV2 ECDHE-RSA-AES128-GCM-SHA256 AuthHeader s3.amazonaws.com TLSv1.2 - -

The Athena query output is as follows

#	bucket_owner	bucket	time	remote_ip	requester	request_id	operation	key	request_uri	http_status	error_code	bytes_sent	object_size	total_time	turnaround_time	referrer	user_agent	version_id	host_id	signature_version	cipher_suite	authentication_type	host_header	tls_version	year	month	day
1	9d306c3478cf2e54f72d7f972c2e7090d30324572b78f1901d30a5d89e33cbe8	BUCKET_NAME	2023-05-02 19:19:55.000	52.12.241.113	IAM_ARN_HERE	1C25EZNCB2HBMQQY	REST.POST.MULTI_OBJECT_DELETE		POST /BUCKET_NAME/?delete HTTP/1.1	200		305		29					Pad2oayPBK9Yqw9/BWjhgn84fAsgRK7OTjjRFTy8Nuzlr27Ou+InFTsEf3eJsOaOkr2jw9xLBUa6d1tHwjf+xg==	SigV2	ECDHE-RSA-AES128-GCM-SHA256			TLSv1.2 - -	2023	05	02

This is concerning because the Athena query output does not show which file was deleted. It appears the second s3 access log entry is overwriting the first when the file is converted.

Is this expected?
Is there a way to log both access log entries?

Thank you for taking the time to look into this.

Better datatypes for S3 access logs?

Just wondering if bytes_sent and object_size could be switched from type string to int or bigint for the optimized table. Is there a reason these are set the way they are?

https://github.com/awslabs/athena-glue-service-logs/blob/master/athena_glue_service_logs/s3_access.py#L112

TempDir is now required to be a valid location

Previously the TempDir was only used in the case you were doing Redshift transformations. However, it's also now used for bookmark tracking. We need to ensure a valid location is provided as part of the Makefile.

Support for multiple jobs for a service type?

We have multiple CloudFront distributions (across a couple different AWS accounts). It seems like the job definition file only allows a single job per service type. It seems like allowing the job definition file to be overridden could help both those situations.

Like maybe changing this line https://github.com/awslabs/athena-glue-service-logs/blob/master/Makefile#L6 to:
JOB_DEFINITION_FILE?=scripts/glue_jobs.json ?

Get error running initial job for S3 logs

Using latest from this repo ...

The databases get created, but just after the access_optimized table is created, we receive this error:

glue_service_logs/utils.py", line 128, in _get_first_key_in_prefix TypeError: 'NoneType' object has no attribute '__getitem__' End of LogType:stdout

Full trace:

`20/07/21 09:47:30 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
20/07/21 09:47:30 INFO MemoryStore: MemoryStore cleared
20/07/21 09:47:30 INFO BlockManager: BlockManager stopped
20/07/21 09:47:30 INFO BlockManagerMaster: BlockManagerMaster stopped
20/07/21 09:47:30 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
20/07/21 09:47:30 INFO SparkContext: Successfully stopped SparkContext
20/07/21 09:47:30 INFO ApplicationMaster: Unregistering ApplicationMaster with FAILED (diag message: User application exited with status 1)
20/07/21 09:47:30 INFO AMRMClientImpl: Waiting for application to be successfully unregistered.
20/07/21 09:47:30 INFO ApplicationMaster: Deleting staging directory hdfs://ip-172-32-99-27.ec2.internal:8020/user/root/.sparkStaging/application_1595322435097_0001
20/07/21 09:47:30 INFO ShutdownHookManager: Shutdown hook called
20/07/21 09:47:30 INFO ShutdownHookManager: Deleting directory /mnt/yarn/usercache/root/appcache/application_1595322435097_0001/spark-1ca0f7e1-0bc4-47d6-ab91-12879d18055f
20/07/21 09:47:30 INFO ShutdownHookManager: Deleting directory /mnt/yarn/usercache/root/appcache/application_1595322435097_0001/spark-1ca0f7e1-0bc4-47d6-ab91-12879d18055f/pyspark-11265b43-46c7-46a1-a043-8f182446bdf0
End of LogType:stderr

LogType:stdout
Log Upload Time:Tue Jul 21 09:47:32 +0000 2020
LogLength:2765
Log Contents:
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Starting new HTTP connection (1): 169.254.169.254
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Starting new HTTP connection (1): 169.254.169.254
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): glue.us-east-1.amazonaws.com
INFO:athena_glue_service_logs.job:Initial run, scanning S3 for partitions.
INFO:athena_glue_service_logs.catalog_manager:Creating database aws_service_logs
INFO:athena_glue_service_logs.catalog_manager:Creating database table s3_access_raw
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): glue.us-east-1.amazonaws.com
INFO:athena_glue_service_logs.catalog_manager:Creating database table s3_access_optimized
null_fields []
INFO:athena_glue_service_logs.converter:No data returned, skipping conversion.
INFO:athena_glue_service_logs.job:Initial run with source NullPartitioner, adding all partitions from S3.
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): rodin-glue.s3.amazonaws.com
Parse yarn logs get error message: TypeError: 'NoneType' object has no attribute 'getitem'
Traceback (most recent call last):
File "script_2020-07-21-09-45-54.py", line 4, in
job_run.convert_and_partition()
File "/mnt/yarn/usercache/root/appcache/application_1595322435097_0001/container_1595322435097_0001_01_000001/athena_glue_converter_latest.zip/athena_glue_service_logs/job.py", line 156, in convert_and_partition
File "/mnt/yarn/usercache/root/appcache/application_1595322435097_0001/container_1595322435097_0001_01_000001/athena_glue_converter_latest.zip/athena_glue_service_logs/job.py", line 137, in add_new_optimized_partitions
File "/mnt/yarn/usercache/root/appcache/application_1595322435097_0001/container_1595322435097_0001_01_000001/athena_glue_converter_latest.zip/athena_glue_service_logs/catalog_manager.py", line 89, in get_and_create_partitions
File "/mnt/yarn/usercache/root/appcache/application_1595322435097_0001/container_1595322435097_0001_01_000001/athena_glue_converter_latest.zip/athena_glue_service_logs/partitioners/date_partitioner.py", line 34, in build_partitions_from_s3
File "/mnt/yarn/usercache/root/appcache/application_1595322435097_0001/container_1595322435097_0001_01_000001/athena_glue_converter_latest.zip/athena_glue_service_logs/utils.py", line 80, in get_first_hivecompatible_date_in_prefix
File "/mnt/yarn/usercache/root/appcache/application_1595322435097_0001/container_1595322435097_0001_01_000001/athena_glue_converter_latest.zip/athena_`

README says to use "service_name=xxx" but Makefile expects "service=xxx"

example:

Error:

make create_job service_name=cloudfront
'service' variable must be defined
make: *** [require_service] Error 1

Success:

make create_job service=cloudfront
test -n CloudFront_Partitioner || (echo "cloudfront service not found" && echo 1)
aws glue create-job --name CloudFront_Partitioner_LogMaster_v5.3.0 \
		--description "CloudFront_Partitioner Log infra generator" \
		--role AWSGlueServiceRoleDefault \
...
...

$ make --version
GNU Make 3.81
Copyright (C) 2006  Free Software Foundation, Inc.

awslabs / athena-glue-service-logs Goto Github PK

athena-glue-service-logs's People

Contributors

Stargazers

Watchers

Forkers

athena-glue-service-logs's Issues

Recommend Projects

Recommend Topics

Recommend Org