Git Product home page Git Product logo

censoredplanet-analysis's People

Contributors

agiix avatar avirkud avatar dependabot[bot] avatar eltsai avatar fortuna avatar ohnorobo avatar ramakrishnansr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

censoredplanet-analysis's Issues

Scaling issue with backfilling data

We're currently having an issue where attempting to run a full backfill (over all data from 2018-2023) runs a job that succeeds, but which ends up having dropped 2/3 of the expected rows. The dropping is not spread out evenly, but is caused by 2/3 of server ips to be dropped entirely.

Here's an example of that the missing data look like:
image

Running over only smaller amounts of data causes the jobs to correctly write all the data. In particular writing the data out one year per job causes it to work correctly.

Example that succeeded:

  • 2022 backfill - python3.9 -m pipeline.run_beam_tables --env=prod --user_dataset=agix.http_scan_full_2022 --start_date="2022-01-01" --end_date="2022-12-31" --scan_type=http --full
  • 2021 backfill - python3.9 -m pipeline.run_beam_tables --env=prod --user_dataset=agix.http_scan_full_2022 --start_date="2021-01-01" --end_date="2021-12-31" --scan_type=http
  • 2020 backfill - python3.9 -m pipeline.run_beam_tables --env=prod --user_dataset=agix.http_scan_full_2022 --start_date="2020-01-01" --end_date="2020-12-31" --scan_type=http
  • 2019 backfill - python3.9 -m pipeline.run_beam_tables --env=prod --user_dataset=agix.http_scan_full_2022 --start_date="2019-01-01" --end_date="2019-12-31" --scan_type=http
  • 2018 backfill - python3.9 -m pipeline.run_beam_tables --env=prod --user_dataset=agix.http_scan_full_2022 --start_date="2018-07-27" --end_date="2018-12-31" --scan_type=http
  • 2023 backfill - python3.9 -m pipeline.run_beam_tables --env=prod --user_dataset=agix.http_scan_full_2022 --start_date="2023-01-01" --end_date="2023-03-14" --scan_type=http

Example that dropped data:

  • Full https backfill - python3.9 -m pipeline.run_beam_tables --env=prod --user_dataset=agix.https_scan_full --scan_type=https --full

One thing we're seeing in the jobs with issues is the scaling message
Autoscaling: Unable to reach resize target in zone us-east1-c. QUOTA_EXCEEDED: Instance 'abc' creation failed: Quota 'IN_USE_ADDRESSES' exceeded. Limit: 575.0 in region us-east1.
we're also seeing the error
Autoscaling: Unable to reach resize target in zone us-east1-c. ZONE_RESOURCE_POOL_EXHAUSTED_WITH_DETAILS: Instance 'abc' creation failed: The zone 'projects/censoredplanet-analysisv1/zones/us-east1-c' does not have enough resources available to fulfill the request. '(resource type:compute)'.

The job is also not scaling to as many workers at it wants
image

Add blockpage matching to pipeline

  • Additional field in schema for blockpage
    • True if match, False if false positive match, Null otherwise (use script to update Null fields later)

Hard to read the pipeline

I noticed that we are ingesting files with different types with the same functions in the pipeline. Furthermore, the types are not only different, but sometimes only slightly different, causing even more confusion. Add to that the fact that all the different data types are simply called "row" and it becomes a huge effort to make sense of how data is flowing.

We have to clean that up so we can make sense of what's going on. Especially with Satellite, which has many types.

Split flows

First and foremost, this call has to change:

lines = _read_scan_text(p, new_filenames)

We should not create a single PCollection with different types. Instead, each file should be its own PCollection.

Then process_satellite_lines should be removed in favor of different flows that process and join the different datasets.
The partition logic can be gone.

As a rule of thumb, consider all logic selection based on filenames like this harmful:

if filename == SATELLITE_BLOCKPAGES_FILE:

Another similar practice that is also harmful is to detect the source file based on the presence of fields:

Those practices can be replaced by creating separate PCollections for each input types.

This cleanup can be incremental. For instance, it seems that extracting the blockpage logic is quite easy. Just call _process_satellite_blockpages on a PCollection with the blockpage files only.

Then you can extract each tag input file to their own flows.

I have the impression that this cleanup will speed up the pipeline, as you can have more lightweight workers for many parts of the flow, instead of having to load all the data for all the flows. It will also shuffle less data for joins (each flow can be sorted separately).

Define and clarify row types

Another significant improvement is to name each data type and create type aliases for the existing Row (e.g. SatelliteScan or ResolverTags). We should not see the Row type anywhere. Note that they can all be a generic dictionary type, but the type annotations will help understand what goes where. We should also rename the variables to reflect their type.

/cc @ohnorobo @avirkud

TODOs

  • triple cogroup
  • add strict types
  • remove all filename selection in logic
  • switch to 'big cogroup' instead of doing individual joins

URL field in the Satellite data

Satellite is about testing domain names. Still, we have a "url" field instead of "domain" or "domain_name". That makes me question whether that's always a domain name. It's also unclear what an actual url would mean.

We must use a domain name field name instead.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.