censoredplanet / censoredplanet-analysis Goto Github PK

View Code? Open in Web Editor NEW

12.0 12.0 4.0 15.37 MB

Analysis of the CensoredPlanet data.

License: Apache License 2.0

Python 87.68% Dockerfile 0.16% Shell 0.60% HTML 10.63% Go 0.93%

censoredplanet-analysis's People

Contributors

Stargazers

Watchers

Forkers

fathershawn xiayishusheng iglesys347 adivakaruni

censoredplanet-analysis's Issues

Scaling issue with backfilling data

We're currently having an issue where attempting to run a full backfill (over all data from 2018-2023) runs a job that succeeds, but which ends up having dropped 2/3 of the expected rows. The dropping is not spread out evenly, but is caused by 2/3 of server ips to be dropped entirely.

Here's an example of that the missing data look like:

Running over only smaller amounts of data causes the jobs to correctly write all the data. In particular writing the data out one year per job causes it to work correctly.

Example that succeeded:

2022 backfill - python3.9 -m pipeline.run_beam_tables --env=prod --user_dataset=agix.http_scan_full_2022 --start_date="2022-01-01" --end_date="2022-12-31" --scan_type=http --full
2021 backfill - python3.9 -m pipeline.run_beam_tables --env=prod --user_dataset=agix.http_scan_full_2022 --start_date="2021-01-01" --end_date="2021-12-31" --scan_type=http
2020 backfill - python3.9 -m pipeline.run_beam_tables --env=prod --user_dataset=agix.http_scan_full_2022 --start_date="2020-01-01" --end_date="2020-12-31" --scan_type=http
2019 backfill - python3.9 -m pipeline.run_beam_tables --env=prod --user_dataset=agix.http_scan_full_2022 --start_date="2019-01-01" --end_date="2019-12-31" --scan_type=http
2018 backfill - python3.9 -m pipeline.run_beam_tables --env=prod --user_dataset=agix.http_scan_full_2022 --start_date="2018-07-27" --end_date="2018-12-31" --scan_type=http
2023 backfill - python3.9 -m pipeline.run_beam_tables --env=prod --user_dataset=agix.http_scan_full_2022 --start_date="2023-01-01" --end_date="2023-03-14" --scan_type=http

Example that dropped data:

Full https backfill - python3.9 -m pipeline.run_beam_tables --env=prod --user_dataset=agix.https_scan_full --scan_type=https --full

One thing we're seeing in the jobs with issues is the scaling message
Autoscaling: Unable to reach resize target in zone us-east1-c. QUOTA_EXCEEDED: Instance 'abc' creation failed: Quota 'IN_USE_ADDRESSES' exceeded. Limit: 575.0 in region us-east1.
we're also seeing the error
Autoscaling: Unable to reach resize target in zone us-east1-c. ZONE_RESOURCE_POOL_EXHAUSTED_WITH_DETAILS: Instance 'abc' creation failed: The zone 'projects/censoredplanet-analysisv1/zones/us-east1-c' does not have enough resources available to fulfill the request. '(resource type:compute)'.

The job is also not scaling to as many workers at it wants

Add blockpage matching to pipeline

Additional field in schema for blockpage
- True if match, False if false positive match, Null otherwise (use script to update Null fields later)

Hard to read the pipeline

I noticed that we are ingesting files with different types with the same functions in the pipeline. Furthermore, the types are not only different, but sometimes only slightly different, causing even more confusion. Add to that the fact that all the different data types are simply called "row" and it becomes a huge effort to make sense of how data is flowing.

We have to clean that up so we can make sense of what's going on. Especially with Satellite, which has many types.

Split flows

First and foremost, this call has to change:

censoredplanet-analysis/pipeline/beam_tables.py

Line 610 in 777f21b

lines = _read_scan_text(p, new_filenames)

We should not create a single PCollection with different types. Instead, each file should be its own PCollection.

Then process_satellite_lines should be removed in favor of different flows that process and join the different datasets.
The partition logic can be gone.

As a rule of thumb, consider all logic selection based on filenames like this harmful:

censoredplanet-analysis/pipeline/metadata/flatten_satellite.py

Line 211 in 777f21b

if filename == SATELLITE_BLOCKPAGES_FILE:

Another similar practice that is also harmful is to detect the source file based on the presence of fields:

censoredplanet-analysis/pipeline/metadata/satellite.py

Line 124 in 777f21b

if 'location' in scan:

Those practices can be replaced by creating separate PCollections for each input types.

This cleanup can be incremental. For instance, it seems that extracting the blockpage logic is quite easy. Just call _process_satellite_blockpages on a PCollection with the blockpage files only.

Then you can extract each tag input file to their own flows.

I have the impression that this cleanup will speed up the pipeline, as you can have more lightweight workers for many parts of the flow, instead of having to load all the data for all the flows. It will also shuffle less data for joins (each flow can be sorted separately).

Define and clarify row types

Another significant improvement is to name each data type and create type aliases for the existing Row (e.g. SatelliteScan or ResolverTags). We should not see the Row type anywhere. Note that they can all be a generic dictionary type, but the type annotations will help understand what goes where. We should also rename the variables to reflect their type.

/cc @ohnorobo @avirkud

TODOs

triple cogroup
add strict types
remove all filename selection in logic
switch to 'big cogroup' instead of doing individual joins

Support multiple data formats

Satellite data
New format for Hyperquack

censoredplanet / censoredplanet-analysis Goto Github PK

censoredplanet-analysis's People

Contributors

Stargazers

Watchers

Forkers

censoredplanet-analysis's Issues

Scaling issue with backfilling data

Add blockpage matching to pipeline

Hard to read the pipeline

Split flows

Define and clarify row types

TODOs

Support multiple data formats

Add Censored Planet external datasources

URL field in the Satellite data

Add filtering to pipeline

Compare Maxmind and CAIDA for vantage point location

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent