Git Product home page Git Product logo

tor-browser-crawler's Introduction

tor-browser-crawler

DISCLAIMER experimental - PLEASE BE CAREFUL. Intended for reasearch purposes.

Version of the tor-browser-crawler that we used with Onionpop.

We have freezed the repository with the source code that we used to collect data for our paper in NDSS 2018 “Inside Job: Applying Traffic Analysis to Measure Tor from Within”.

The crawler uses a modified tor to collect traces from a middle node. It is based on Selenium to drive the Tor Browser and stem to control tor. Our implementation started as a fork of tor-browser-crawler (by the @webfp team).

For the crawl parameters such as batch and instance refer to the ACM WPES’13 paper by Wang and Goldberg.

Difference with respect to tor-browser-crawler

This crawler implements the functionality of tor-browser-crawler and extends it to collect data from the middle position. In particular, we use OnionPerf to collect cell-level information. For ethical reasons, as we describe in the paper, we also implement a signaling mechanism to indicate specific middles nodes to only capture traffic from circuits that our crawler has initiated, so that we do not capture traffic from Tor users.

Getting started

1. Configure the environment

  • We recommend running crawls in a VM or a container (e.g. LXC) to avoid perturbations introduced by the background network traffic and system level network settings. Please note that the crawler will not only store the Tor traffic but will capture all the network traffic generated during a visit to a website. That’s why it’s extremely important to disable all the automatic/background network traffic such as the auto-updates. See, for example the instructions for disabling automatic connections for Ubuntu.

  • You’ll need to set capture capabilities to your user: sudo setcap 'CAP_NET_RAW+eip CAP_NET_ADMIN+eip' /usr/bin/dumpcap

  • Download the TBB and extract it to ./tbb/tor-browser-linux<arch>-<version>_<locale>/.

  • You might want to change the MTU of your network interface and disable NIC offloads that might make the traffic collected by tcpdump look different from how it would have been seen on the wire.

  • Change MTU to standard ethernet MTU (1500 bytes): sudo ifconfig <interface> mtu 1500

  • Disable offloads: sudo ethtool -K <interface> tx off rx off tso off gso off gro off lro off

  • See the Wireshark Offloading page for more info.

2. Run a crawl with the defaults

python main.py -u ./etc/localized-urls-100-top.csv -e wang_and_goldberg

To get all the available command line parameters and the usage run:

python main.py --help

3. Check out the results

The collected data can be found in the results folder:

* Pcaps: `./results/latest`
* Logs: `./results/latest_crawl_log`

Run multiple instances of the crawler (optional)

We also provide ansible roles to provision AWS images automatically. See the AUTOMATION.md readme for more info.

Notes

  • Tested on Xubuntu 14.04 and Debian 7.8.

tor-browser-crawler's People

Contributors

mjuarezm avatar

Stargazers

Umbra avatar MHTTHM avatar  avatar  avatar Al avatar zhangwanyue avatar  avatar Francesco Palmarini avatar

Watchers

Rob Jansen avatar James Cloos avatar  avatar  avatar

tor-browser-crawler's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.