censoredplanet / censoredplanet-analysis Goto Github PK
View Code? Open in Web Editor NEWAnalysis of the CensoredPlanet data.
License: Apache License 2.0
Analysis of the CensoredPlanet data.
License: Apache License 2.0
We're currently having an issue where attempting to run a full backfill (over all data from 2018-2023) runs a job that succeeds, but which ends up having dropped 2/3 of the expected rows. The dropping is not spread out evenly, but is caused by 2/3 of server ips to be dropped entirely.
Here's an example of that the missing data look like:
Running over only smaller amounts of data causes the jobs to correctly write all the data. In particular writing the data out one year per job causes it to work correctly.
Example that succeeded:
Example that dropped data:
One thing we're seeing in the jobs with issues is the scaling message
Autoscaling: Unable to reach resize target in zone us-east1-c. QUOTA_EXCEEDED: Instance 'abc' creation failed: Quota 'IN_USE_ADDRESSES' exceeded. Limit: 575.0 in region us-east1.
we're also seeing the error
Autoscaling: Unable to reach resize target in zone us-east1-c. ZONE_RESOURCE_POOL_EXHAUSTED_WITH_DETAILS: Instance 'abc' creation failed: The zone 'projects/censoredplanet-analysisv1/zones/us-east1-c' does not have enough resources available to fulfill the request. '(resource type:compute)'.
I noticed that we are ingesting files with different types with the same functions in the pipeline. Furthermore, the types are not only different, but sometimes only slightly different, causing even more confusion. Add to that the fact that all the different data types are simply called "row" and it becomes a huge effort to make sense of how data is flowing.
We have to clean that up so we can make sense of what's going on. Especially with Satellite, which has many types.
First and foremost, this call has to change:
We should not create a single PCollection with different types. Instead, each file should be its own PCollection.
Then process_satellite_lines should be removed in favor of different flows that process and join the different datasets.
The partition logic can be gone.
As a rule of thumb, consider all logic selection based on filenames like this harmful:
Another similar practice that is also harmful is to detect the source file based on the presence of fields:
Those practices can be replaced by creating separate PCollections for each input types.
This cleanup can be incremental. For instance, it seems that extracting the blockpage logic is quite easy. Just call _process_satellite_blockpages on a PCollection with the blockpage files only.
Then you can extract each tag input file to their own flows.
I have the impression that this cleanup will speed up the pipeline, as you can have more lightweight workers for many parts of the flow, instead of having to load all the data for all the flows. It will also shuffle less data for joins (each flow can be sorted separately).
Another significant improvement is to name each data type and create type aliases for the existing Row
(e.g. SatelliteScan
or ResolverTags
). We should not see the Row
type anywhere. Note that they can all be a generic dictionary type, but the type annotations will help understand what goes where. We should also rename the variables to reflect their type.
Jigsaw uses CAIDA routeview data. Add Maxmind + Censys.
Satellite is about testing domain names. Still, we have a "url" field instead of "domain" or "domain_name". That makes me question whether that's always a domain name. It's also unclear what an actual url would mean.
We must use a domain name field name instead.
Remove faulty measurements (ex. fail_sanity = True, invalid dates)
Flag vantage points that return different countries
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.