Git Product home page Git Product logo

crawling-workshop's Introduction

Notes

Goal: discover and record version message for as many live nodes as we can

Just run it once.

Ignoring: duration of connection, who told us which addresses (topology),

Code plan

Should we attempt to prevent duplicates?

  • if yes, this would be main thread's job. perhaps workers could just send back (VersionMessage(), AddrMessage()). Main thread can persist VersionMessage and feed any new addresses into the address queue. But it would need to filter out duplicates using sql at this point ...
  • I tend to say "no". This is an optimization. Maybe do this if there is time left over ...

Should I even use a queue? How to test a minimal program???

Main thread

  • Start workers
  • Seed addresses queue
  • loop
    • just needs to read from result queue and save them into sqlite

Should worker thread add a timestamp?

SQLITE

Should I make a separate table to store every IP observed? Basically a backup for the queue?

Done

  • Eliminate results queues

Next

  • create_table
  • save_observation

Others

  • get_node_count()
  • get_protocol_distribution()
    • ipv4, ipv6, tor
SQL Schema

addresses table: id, host, port

observations table: all attributes of version message, timestamp, perhaps the list of addresses shared?

  • an observations table would still allow us the flexibility to crawl multiple times and see what changes ...

what to do when the connection fails? or we can't connect for some reason? just throw out the address or add it to a blacklist? The simplest thing is to only keep track of successes ...

Logistics Plan

Logistics

  • Make a separate server for just me during the presentation. This way we can have massive congestion and the talk still goes well.
  • Deploy more than 1 server
  • Use a HUGE box
  • Tune the resource limits?
  • Move the open file limit to like 100000.

Questions

Should I copy the whole git repository to /etc/skel? This way they wouldn't have to do any binder setup ...

How to set up

Install "tlj" on Digital Ocean

Open terminal

image

sudo -E pip install --upgrade pip
sudo -E pip install PySocks requests pytest

Start Tor and check that it is running

sudo netstat -plnt | grep 9050
Users choose their passwords
sudo tljh-config set auth.type firstuseauthenticator.FirstUseAuthenticator 
sudo tljh-config reload
Anyone can sign up
tljh-config set auth.FirstUseAuthenticator.create_users true
tljh-config reload
Copy Setup.ipynb to /etc/skel

This are the base files everyone gets. I should probably just keep the repo in here ... Then I can git pull right before the talk starts and everyone should get up-to-date code ...

Notes

Python is installed here: /opt/tljh/user/bin/python3

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.