openwpm / openwpm Goto Github PK

View Code? Open in Web Editor NEW

1.3K 1.3K 313.0 104.17 MB

A web privacy measurement framework

Home Page: https://openwpm.readthedocs.io

License: Other

Python 68.50% JavaScript 5.58% Shell 0.97% HTML 10.48% CSS 0.01% Dockerfile 0.16% TypeScript 14.31%

crawler firefox privacy python3

openwpm's Introduction

OpenWPM

OpenWPM is a web privacy measurement framework which makes it easy to collect data for privacy studies on a scale of thousands to millions of websites. OpenWPM is built on top of Firefox, with automation provided by Selenium. It includes several hooks for data collection. Check out the instrumentation section below for more details.

Installation
Quick Start
Troubleshooting
Documentation
Advice for Measurement Researchers
Developer instructions
Instrumentation and Configuration
Storage
- Local Storage
- Remote storage
Docker Deployment for OpenWPM
Citation
License

Installation

OpenWPM is tested on Ubuntu 18.04 via GitHub actions and is commonly used via the docker container that this repo builds, which is also based on Ubuntu. Although we don't officially support other platforms, mamba is a cross platform utility and the install script can be expected to work on OSX and other linux distributions.

OpenWPM does not support windows: #503

Pre-requisites

The main pre-requisite for OpenWPM is mamba, a fast cross-platform package management tool.

Mamba is open-source, and can be installed from https://mamba.readthedocs.io/en/latest/installation.html.

Mamba is a reimplmentation of conda and so sometimes a conda command has to be invoked instead of the mamba one.

Install

An installation script, install.sh is included to: install the conda environment, install unbranded firefox, and build the instrumentation extension.

All installation is confined to your conda environment and should not affect your machine. The installation script will, however, override any existing conda environment named openwpm.

To run the install script, run

./install.sh

After running the install script, activate your conda environment by running:

conda activate openwpm

Mac OSX

You may need to install make / gcc in order to build the extension. The necessary packages are part of xcode: xcode-select --install

We do not run CI tests for Mac, so new issues may arise. We welcome PRs to fix these issues and add full CI testing for Mac.

Running Firefox with xvfb on OSX is untested and will require the user to install an X11 server. We suggest XQuartz. This setup has not been tested, we welcome feedback as to whether this is working.

Quick Start

Once installed, it is very easy to run a quick test of OpenWPM. Check out demo.py for an example. This will use the default setting specified in openwpm/config.py::ManagerParams and openwpm/config.py::BrowserParams, with the exception of the changes specified in demo.py.

The demo script also includes a sample of how to use the Tranco top sites list via the optional command line flag demo.py --tranco. Note that since this is a real top sites list it will include NSFW websites, some of which will be highly ranked.

More information on the instrumentation and configuration parameters is given below.

The docs provide a more in-depth tutorial, and a description of the methods of data collection available.

Troubleshooting

WebDriverException: Message: The browser appears to have exited before we could connect...

This error indicates that Firefox exited during startup (or was prevented from starting). There are many possible causes of this error:
- If you are seeing this error for all browser spawn attempts check that:
  - Both selenium and Firefox are the appropriate versions. Run the following commands and check that the versions output match the required versions in install.sh and environment.yaml. If not, re-run the install script.
```
cd firefox-bin/
firefox --version
```
  and
```
  conda list selenium
```
  - If you are running in a headless environment (e.g. a remote server), ensure that all browsers have the headless browser parameter set to True before launching.
- If you are seeing this error randomly during crawls it can be caused by an overtaxed system, either memory or CPU usage. Try lowering the number of concurrent browsers.
In older versions of firefox (pre 74) the setting to enable extensions was called extensions.legacy.enabled. If you need to work with earlier firefox, update the setting name extensions.experiments.enabled in openwpm/deploy_browsers/configure_firefox.py.
Make sure you're conda environment is activated (conda activate openwpm). You can see you environments and the activate one by running conda env list the active environment will have a * by it.
make / gcc may need to be installed in order to build the web extension. On Ubuntu, this is achieved with apt-get install make. On OSX the necessary packages are part of xcode: xcode-select --install.
On a very sparse operating system additional dependencies may need to be installed. See the Dockerfile for more inspiration, or open an issue if you are still having problems.
If you see errors related to incompatible or non-existing python packages, try re-running the file with the environment variable PYTHONNOUSERSITE set. E.g., PYTHONNOUSERSITE=True python demo.py. If that fixes your issues, you are experiencing issue 689, which can be fixed by clearing your python user site packages directory, by prepending PYTHONNOUSERSITE=True to a specific command, or by setting the environment variable for the session (e.g., export PYTHONNOUSERSITE=True in bash). Please also add a comment to that issue to let us know you ran into this problem.

Documentation

Further information is available at OPENWPM's Documentation Page.

Advice for Measurement Researchers

OpenWPM is often used for web measurement research. We recommend the following for researchers using the tool:

Use a versioned release. We aim to follow Firefox's release cadence, which is roughly once every four weeks. If we happen to fall behind on checking in new releases, please file an issue. Versions more than a few months out of date will use unsupported versions of Firefox, which are likely to have known security vulnerabilities. Versions less than v0.10.0 are from a previous architecture and should not be used.

Include the OpenWPM version number in your publication. As of v0.10.0 OpenWPM pins all python, npm, and system dependencies. Including this information alongside your work will allow other researchers to contextualize the results, and can be helpful if future versions of OpenWPM have instrumentation bugs that impact results.

Developer instructions

If you want to contribute to OpenWPM have a look at our CONTRIBUTING.md

Instrumentation and Configuration

OpenWPM provides a breadth of configuration options which can be found in Configuration.md More detail on the output is available below.

Storage

OpenWPM distinguishes between two types of data, structured and unstructured. Structured data is all data captured by the instrumentation or emitted by the platform. Generally speaking all data you download is unstructured data.

For each of the data classes we offer a variety of storage providers, and you are encouraged to implement your own, should the provided backends not be enough for you.

We have an outstanding issue to enable saving content generated by commands, such as screenshots and page dumps to unstructured storage (see #232).
For now, they get saved to manager_params.data_directory.

Local Storage

For storing structured data locally we offer two StorageProviders:

The SQLiteStorageProvider which writes all data into a SQLite database
- This is the recommended approach for getting started as the data is easily explorable
The LocalArrowProvider which stores the data into Parquet files.
- This method integrates well with NumPy/Pandas
- It might be harder to ad-hoc process

For storing unstructured data locally we also offer two solutions:

The LevelDBProvider which stores all data into a LevelDB
- This is the recommended approach
The LocalGzipProvider that gzips and stores the files individually on disk
- Please note that file systems usually don't like thousands of files in one folder
- Use with care or for single site visits

Remote storage

When running in the cloud, saving records to disk is not a reasonable thing to do. So we offer a remote StorageProviders for S3 (See #823) and GCP. Currently, all remote StorageProviders write to the respective object storage service (S3/GCS). The structured providers use the Parquet format.

NOTE: The Parquet and SQL schemas should be kept in sync except output-specific columns (e.g., instance_id in the Parquet output). You can compare the two schemas by running diff -y openwpm/DataAggregator/schema.sql openwpm/DataAggregator/parquet_schema.py.

Docker Deployment for OpenWPM

OpenWPM can be run in a Docker container. This is similar to running OpenWPM in a virtual machine, only with less overhead.

Building the Docker Container

Step 1: install Docker on your system. Most Linux distributions have Docker in their repositories. It can also be installed from docker.com. For Ubuntu you can use: sudo apt-get install docker.io

You can test the installation with: sudo docker run hello-world

Note, in order to run Docker without root privileges, add your user to the docker group (sudo usermod -a -G docker $USER). You will have to logout-login for the change to take effect, and possibly also restart the Docker service.

Step 2: to build the image, run the following command from a terminal within the root OpenWPM directory:

    docker build -f Dockerfile -t openwpm .

After a few minutes, the container is ready to use.

Running Measurements from inside the Container

You can run the demo measurement from inside the container, as follows:

First of all, you need to give the container permissions on your local X-server. You can do this by running: xhost +local:docker

Then you can run the demo script using:

    mkdir -p docker-volume && docker run -v $PWD/docker-volume:/opt/OpenWPM/datadir \
    -e DISPLAY=$DISPLAY -v /tmp/.X11-unix:/tmp/.X11-unix --shm-size=2g \
    -it --init openwpm

Note: the --shm-size=2g parameter is required, as it increases the amount of shared memory available to Firefox. Without this parameter you can expect Firefox to crash on 20-30% of sites.

This command uses bind-mounts to share scripts and output between the container and host, as explained below (note the paths in the command assume it's being run from the root OpenWPM directory):

run starts the openwpm container and executes the python /opt/OpenWPM/demo.py command.
-v binds a directory on the host ($PWD/docker-volume) to a directory in the container (/opt/OpenWPM/datadir). Binding allows the script's output to be saved on the host (./docker-volume), and also allows you to pass inputs to the docker container (if necessary). We first create the docker-volume direction (if it doesn't exist), as docker will otherwise create it with root permissions.
The -it option states the command is to be run interactively (use -d for detached mode).
The demo scripts runs instances of Firefox that are not headless. As such, this command requires a connection to the host display server. If you are running headless crawls you can remove the following options: -e DISPLAY=$DISPLAY -v /tmp/.X11-unix:/tmp/.X11-unix.

Alternatively, it is possible to run jobs as the user openwpm in the container too, but this might cause problems with none headless browers. It is therefore only recommended for headless crawls.

MacOS GUI applications in Docker

Requirements: Install XQuartz by following these instructions.

Given properly installed prerequisites (including a reboot), the helper script run-on-osx-via-docker.sh in the project root folder can be used to facilitate working with Docker in Mac OSX.

To open a bash session within the environment:

./run-on-osx-via-docker.sh /bin/bash

Or, run commands directly:

./run-on-osx-via-docker.sh python demo.py
./run-on-osx-via-docker.sh python -m test.manual_test
./run-on-osx-via-docker.sh python -m pytest
./run-on-osx-via-docker.sh python -m pytest -vv -s

Citation

If you use OpenWPM in your research, please cite our CCS 2016 publication on the infrastructure. You can use the following BibTeX.

@inproceedings{englehardt2016census,
    author    = "Steven Englehardt and Arvind Narayanan",
    title     = "{Online tracking: A 1-million-site measurement and analysis}",
    booktitle = {Proceedings of ACM CCS 2016},
    year      = "2016",
}

OpenWPM has been used in over 75 studies.

License

OpenWPM is licensed under GNU GPLv3. Additional code has been included from FourthParty and Privacy Badger, both of which are licensed GPLv3+.

openwpm's People

Contributors

Stargazers

Watchers

Forkers

vtoubiana shayanb yosmite-k samatt pombredanne datatransparencylab chenkaigithub noscripter cathrynstickel najmeh-mir dcoloma pkthebud andersju spgb zixan tspannhw curtiszimmerman grimderp tchen0123 vieyang appotry deepukr85 lordchalant jarek1208 andrew889 awesome-python awesome-security bryonglodencissp raj347 peak-dev brianmillsjr luiscabus youyou2018 antbean zhhb andy737 cyrvin amoghbl1 zawarudo csd-kop shivamchoudhary dsphinx ecneladis workybee tienhv rfmcpherson terry2012 ehowardtillit kirillzubovsky telefunkenvf14 acrain revollat albertterradas price-discrimination-research zenbhang jasonmk47 tommybananas sunk0015 ondrocks gunesacar wrapperband nullhypothesis natasasdj gmu-swe alexristich zackw fsfindravi jeinfeldt meess 5hadowblad3 jtudjwh minhajussalam cybigdata eriktews cliqz-oss alexxnica kryndex ughe ruizhiyou adolfoeliazat darobin ttthoma snow-summer jasonthomas ksmaheshkumar nuuts hmahal helpper libraryman acidburn0zzz usrcoin-forks samuel-campbell ssghost nishkumar harshitjoshi noise-field leorpoirot tronje charliecha1990 stephendonner

openwpm's Issues

window.navigator.plugin not instrumented

There is no such symbol in the javascript table

Record access to all window.navigator properties

The properties we currently collect are largely those ported over from FourthParty. We should add as much from this list as possible. It's even fine to add shims for properties Firefox doesn't support.

Examples of things we're missing include

window.navigator.oscpu
window.navigator.doNotTrack
window.navigator.clientinformation

Javascript Instrumentation fails to correctly pickle all strings encountered

The error rate is very low, but some statements fail to pickle correctly and are dropped on the receiving end.

Profile information lost during timeout

Some profile information is lost during timeouts or crashes. By profile information I mean rows in places.sqlite and cookies.sqlite. The amount of data loss varies by the site, but can sometimes be all data from the site visit.

My suspicion is that this is the result of sqlite's write-ahead logging. The temporary wal files that were written prior to the timeout or crash are never successfully synced with the main database after the profile are recovered and are instead discarded.

I haven't confirmed this as the cause, but if it is, it should be possible to connect to the databases directly and force the wal sync to occur before copying over the database.

LevelDB database clears if the system exceeds memory limits during the crawl

If the system runs out of memory, it may cause the LevelDBAggregator to crash with the following error: terminate called after throwing an instance of 'std::bad_alloc'. When this happens, all future browsers launch attempts will fail to connect to the aggregator (since it's down) and the crawl will terminate with a CommandExecutionError. If the crawl is subsequently restarted the LevelDB database will lose all of the previous data.

This comes as a surprise, as the aggregator continually commits (by writing a batch) a minimum of every 5 seconds. I will need to re-create the issue to determine the exact cause. I suspect it results from the database not closing correctly during the crash.

SQL Insert abstraction

Right now most of the code is littered with SQL INSERT statements. Inspired by loggingDB.createInsert() here, we should switch over to building inserts from dictionaries.

Use unicode and utf-8 encoded strings internally

In #14 I addressed how I've run into multiple encodings at different stages of the infrastructure, and have patched in conversions where necessary. This is a bit of a hack. Let's transition to storing all strings as unicode in memory and as utf-8 encoded byte strings on disk / in the DB.

Add support for issuing a sequence of commands to a single browser.

When crawling with multiple browsers, it is currently more difficult then it should be to issue multiple commands to the same browser instance. For example, consider the scenario where would are running a measurement with two crawlers, browser_1 and browser_2, and you want to issue a get, parse, and dump command. If you write:

for site in sites:
    TaskManager.get(site)
    TaskManager.parse()
    TaskManager.dump()

browser_1 will issue a get, browser_2 will issue a parse, and the one that finishes first will issue dump. It's possible to avoid this by explicitly issuing the commands with an index, so:

TaskManager.get(site, index=1)
TaskManager.parse(index = 1)
TaskManager.dump(index = 1)
TaskManager.get(site, index=2)
TaskManager.parse(index = 2)
TaskManager.dump(index = 2)

but this is less than ideal since it will serialize your program. Of course it's possible to avoid serialization by checking if the browser is available before issuing the command and using some system to queue to commands in the crawl script.

This should be supported by the infrastructure. The simplest solution is just to queue the commands in the TaskManager per browser, so a user could issue commands by index without serializing their program. However, for very large crawls these queues may grow to take up a large amount of memory.

Instead, I think it makes sense to introduce a notion of a command sequence. The user can initialize an instance of a CommandSequence object, add the respective commands to it, and ask the TaskManager to execute the CommandSequence on the first available browser. This would look something like:

for site in sites:
    command_sequence = CommandSequence()
    command_sequence.get(site)
    command_sequence.parse()
    command_sequence.dump()

    TaskManager.execute(command_sequence)

random_attributes dose not change navigator.platform, ...

When random_attributes is enabled, it only choose a user_agent_string randomly, but navigator.platform, navigator.appVersion, navigator.appName,... are not changed to be synced with that. So it can be detected easily that you are announcing fake platform and browser.

AdBlock Plus crawls do not appear correct

ABP crawls seem to fail to block a significant portion of trackers. I'm guess this has something to do with list distribution for the crawls, such that the block lists never have a chance to load on stateless crawls. The solution is likely to preload up-to-date lists when the crawl starts and stop the extension from checking for updates.

Some SSL connections fail when MITMProxy is enabled

Specifically the error is: ssl_error_rx_record_too_long.

This seems to be an issues for other users as well, see: this issue and this issue on the mitmproxy bug tracker.

Since we're still on version 0.13 (and using a MITM cert even older than that) upgrading and refreshing the cert store (#36) will hopefully fix this.

Ghostery and Adblock Plus phone home on startup

This leads to a bunch of adblock plus / ghostery requests intermingled with regular data on stateless crawls. It would be nice to silence them.

Exception in Proxy should notify browser manager

Currently the proxy simply shuts itself down, causing the following javascript command to time out. This means that there is likely proxy data missing for the site that causes the exception, and the following site needlessly times out.

Update to newer version of jpm

We've stuck with version 0.0.23 of jpm since newer versions require Firefox >= 38, and Selenium did not support Firefox 38 at the time. Since we are now on Firefox 41, we should be safe to update.

There are some syntax changes after 0.0.23, so changes to the extension manifest will be required.

Cookie parsing breaks for expiry dates <1900

From Green C.:

Basically time.strptime() varies in the lowest dates it can handle but the limit is set at >= 1900 for all platforms. I ran into a cookie with expiry val "Wed, 01 Jan 1800 00:00:00 GMT" and it didn't like it so I added a check to PostProcessing/build_cookie_table.py/select_date_format:

...
if time_obj.tm_year > 1900:
return time.strftime("%Y-%m-%d %H:%M:%S", time_obj)
else:
return None
...

Virtual Box image link is down

Hi, the Virtual Box Ubuntu image on https://github.com/citp/OpenWPM/wiki/Setting-Up-OpenWPM#virtualbox is down.
Thanks

Script URL occasionally mislabeled in JS Instrumentation

I've noticed two examples of this so far. In both cases the incorrect url is a jquery include. Both third-party and first-party.

Update key3.db and cert.db

Both were generated 2 years ago, and exist to allow us to programmatically add a the MITM cert to the browser. They are likely in need of a refresh.

Consider switching tests to localtest.me

Gunes points out that hosting some tests on localhost may cause issues since localhost is handled differently than most origins. localtest.me always points to 127.0.0.1, so it may be any easy way to test locally with normal security policies.

Another option is to host the tests on an OpenWPM github page, but this fragments the code (since the test pages live on a separate branch), and makes it more difficult to test the tests before pushing them public.

Improve parameter passing

Having parameters passed in a dictionary is prone to errors, since a single mistype could cause the parameter to fail to apply. We should investigate alternative configuration options that are safer.

[QUESTION]

Dear,
I had checked your source code but still not clear if how to make two browsers execute a command at the same time (for example, every 1 hour, they should query data from website). Could you give me some ideas?
Thank you

Split dump_storage_vectors command

The dump_storage_vectors command saves both profile cookies from cookies.sqlite as well as flash objects. This should be two separate commands, given that we have several ways to collect cookies.

Replace subprocess calls with native python calls

Use of the subprocess module should be avoided. I believe we primarily use it to remove directories by calling rm -rf but we should move over to shutil.rmtree for directories and os.remove for files, which will give us proper access to errors and avoids potential platform differences.

Fourthparty page-manager / cookie-instrument unreliable

The javascript instrumentation ported from Fourthparty could use some tlc. page-manager.js and cookie-instrument.js fail to reliably activate. There are times when all page loads and cookies are logged correctly, but other times only a handful or no pages/cookies are logged. I haven't been able to track down the cause.

The page-manager.js table is only used by content-policy-instrument.js, but nearly 1/3 to 1/2 of the records in that table fail to label correctly (i.e. are labeled with a -1) so even when page-manager.js activates correctly the labeling is incomplete.

The cookie-instrument.js instrumentation is quite redundant with monitoring HTTP cookies or reading the cookies directly from the firefox sqlite databases on disk. Reading the sqlite databases provides a full picture of cookies (rather than just those set / sent over HTTP), but tends to be slow so monitoring via javascript might speed things up.

Documentation on Interfacing with Data Aggregator

Right now this doesn't exist, and isn't very intuitive.

Profile fails to save when new browser launch times out on new profile creation

The browser's profile will fail to save following 5 launch timeouts with the following stack trace.

Traceback (most recent call last):
  File “/crawl.py", line 64, in <module>
    manager.dump_flash_cookies('http://'+sites[i], start_time, timeout=90)
  File "/OpenWPM/automation/TaskManager.py", line 460, in dump_flash_cookies
    self._distribute_command(('DUMP_FLASH_COOKIES', url, start_time), index, timeout)
  File "/OpenWPM/automation/TaskManager.py", line 327, in _distribute_command
    self._start_thread(browser, command, reset)
  File "/OpenWPM/automation/TaskManager.py", line 377, in _start_thread
    self._cleanup_before_fail()
  File "/OpenWPM/automation/TaskManager.py", line 307, in _cleanup_before_fail
    self._shutdown_manager(failure=True, during_init=during_init)
  File "/OpenWPM/automation/TaskManager.py", line 287, in _shutdown_manager
    browser.shutdown_browser(during_init)
  File "/OpenWPM/automation/BrowserManager.py", line 225, in shutdown_browser
    save_flash=self.browser_params['disable_flash'] is False)
  File "/OpenWPM/automation/Commands/profile_commands.py", line 99, in dump_profile
    browser_profile_folder = browser_profile_folder if browser_profile_folder.endswith("/")\
AttributeError: 'NoneType' object has no attribute 'endswith'

This implies that self.current_profile_path is None when browser.shutdown_browser() is called during the infrastructure shutdown. This specifically happens when the browser launch timeout occurs while creating the browser profile, i.e. lines 19-25 of deploy_firefox.py. The timeout either occurs while connecting to the logging server or while setting up the profile.

I'll need to investigate why self.current_profile_path is None at this point.

Throw an exception when a profile fails to load

Currently, when a profile isn't found the TaskManager will log the Error and continue with the crawl. This doesn't make sense as the crawl requires state which is not present. A top-level Exception (like CommandExecutionError should be thrown).

Refactor how browsers handle failures and restarts

The Browser.restart_browser_manager(),Browser.reset(), and Browser.launch_browser_manager() methods are needlessly complex and confusing. This had led to a multitude of bugs, particularly around profile handling (e.g. the simple one in 6e384db). This section of the infrastructure needs to be refactored in the following way:

Have a single pipeline for restarts, which handles restarts triggered from timeouts, excessive memory usage, crashes, and restarts for stateless crawls.
Infer whether or not the profile should be cleared by the flag on the command. No part of the infrastructure should be hard-coding this.
Make the logic of launch_browser_manager() clear. It should be obvious when a launch is post-crash, is loading a clean profile, or is loading an external (user-specified) profile. Right now we condition on a bunch of nested logic (i.e. is a directory None or not).

It is still possible to crash without saving the profile

In commit 7bc545f and d2fda45 we added support for automatic profile archiving during shutdown from child exceptions or TaskManager closures. Test cases of these errors are including in tests/test_profile.py, yet it still seems possible to end a crawl without saving the profile. See:

TaskManager          - INFO     - BROWSER 1: Timeout while executing command, DUMP_STORAGE_VECTORS, killing browser manager
deploy_mitm_proxy    - INFO     - BROWSER 1: Intercepting Proxy listening on 40586
BrowserManager       - INFO     - BROWSER 1: Crash in driver, restarting browser manager
 <class 'socket.error'>
 [Errno 111] Connection refused
Exception in thread Thread-6039:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 763, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/home/ubuntu/openwpm-census/OpenWPM/automation/TaskManager.py", line 438, in _issue_command
    success = browser.restart_browser_manager(clear_profile = reset)
  File "/home/ubuntu/openwpm-census/OpenWPM/automation/BrowserManager.py", line 167, in restart_browser_manager
    return self.launch_browser_manager()
  File "/home/ubuntu/openwpm-census/OpenWPM/automation/BrowserManager.py", line 133, in launch_browser_manager
    self.logger.error("BROWSER %i: Spawn unsuccessful %s" % error_string)
TypeError: %d format: a number is required, not str

TaskManager          - INFO     - BROWSER 1: Timeout while executing command, GET, killing browser manager
TaskManager          - CRITICAL - BROWSER 1: Command execution failure pushes failure count above the allowable limit. Setting failure_flag.
CLOSING TaskManager after batch
TaskManager          - ERROR    - TaskManager already closed
Total time: 306.544947147

as well as:

2015-11-09 08:13:30,470 - MainProcess[Thread-6039] - TaskManager          - INFO    : BROWSER 1: Timeout while executing command, DUMP_STORAGE_VECTORS, killing browser manager
2015-11-09 08:13:30,471 - MainProcess[Thread-6039] - BrowserManager       - DEBUG   : BROWSER 1: Display process does not exit
2015-11-09 08:13:30,471 - MainProcess[Thread-6039] - BrowserManager       - DEBUG   : BROWSER 1: Screen lockfile already removed
2015-11-09 08:13:30,471 - MainProcess[Thread-6039] - BrowserManager       - DEBUG   : BROWSER 1: Browser process does not exist
2015-11-09 08:13:30,471 - MainProcess[Thread-6039] - BrowserManager       - DEBUG   : BROWSER 1: Spawn attempt 0
2015-11-09 08:13:30,490 - Process-415[Thread-6039] - deploy_mitm_proxy    - INFO    : BROWSER 1: Intercepting Proxy listening on 40586
2015-11-09 08:13:30,490 - Process-415[Thread-6039] - BrowserManager       - INFO    : BROWSER 1: Crash in driver, restarting browser manager
 <class 'socket.error'>
 [Errno 111] Connection refused
2015-11-09 08:17:00,694 - MainProcess[Thread-6040] - TaskManager          - INFO    : BROWSER 1: Timeout while executing command, GET, killing browser manager
2015-11-09 08:17:00,694 - MainProcess[Thread-6040] - TaskManager          - CRITICAL: BROWSER 1: Command execution failure pushes failure count above the allowable limit. Setting failure_flag.
2015-11-09 08:17:00,793 - MainProcess[MainThread] - TaskManager          - DEBUG   : TaskManager failure threshold exceeded, raising CommandExecutionError
2015-11-09 08:17:00,793 - MainProcess[MainThread] - BrowserManager       - DEBUG   : BROWSER 1: Joining command thread
2015-11-09 08:17:00,793 - MainProcess[MainThread] - BrowserManager       - DEBUG   : BROWSER 1: 0.000008 seconds to join command thread
2015-11-09 08:17:00,793 - MainProcess[MainThread] - BrowserManager       - DEBUG   : BROWSER 1: Killing browser manager...
2015-11-09 08:17:00,793 - MainProcess[MainThread] - BrowserManager       - DEBUG   : BROWSER 1: Display process does not exit
2015-11-09 08:17:00,794 - MainProcess[MainThread] - BrowserManager       - DEBUG   : BROWSER 1: Screen lockfile already removed
2015-11-09 08:17:00,794 - MainProcess[MainThread] - BrowserManager       - DEBUG   : BROWSER 1: Browser process does not exist
2015-11-09 08:17:00,794 - MainProcess[MainThread] - TaskManager          - DEBUG   : Telling the DataAggregator to shut down...
2015-11-09 08:17:03,825 - MainProcess[MainThread] - TaskManager          - DEBUG   : DataAggregator took 3.03015780449 seconds to close
2015-11-09 08:17:03,825 - MainProcess[MainThread] - TaskManager          - DEBUG   : Telling the LevelDBAggregator to shut down...
2015-11-09 08:17:03,825 - MainProcess[MainThread] - TaskManager          - DEBUG   : LevelDBAggregator took 9.05990600586e-06 seconds to close

Commit 6b43027 fixes the socket bug that caused this crash. It's still unclear to me how the TaskManager shut down without archiving the profile.

XRayWrapper error caused by mozRTCPeerConnection instrumentation

Instrumenting mozRTCPeerConnection.prototype causes the following error:

XrayWrapper denied access to property 0 (reason: value is callable). See https://developer.mozilla.org/en-US/docs/Xray_vision for more information. Note that only the first denied property access from a given global object will be reported.

on line 282 of content.js, which is the return statement of logFunction:

function logFunction(object, objectName, method) {
  var originalMethod = object[method];
  object[method] = function () {
    var scriptUrl = getOriginatingScriptUrl();
    logCall(objectName + '.' + method, arguments, scriptUrl);
    return originalMethod.apply(this, arguments);
  };
}

The error is triggered by line 28-30 of the test page. Specifically, the code block:

connection.createOffer(function(a) {
    connection.setLocalDescription(a)
}, function(err) {})

This only occurs when mozRTCPeerConnection.createOffer is instrumented. It seems that the Xray Vision wrapper blocks the call to apply when a page script defined function is included as an argument.

A security boundary exists between our content script (content.js) and the page script (included in webrtc_localip.html). To waive Xray Vision, we set window = unsafeWindow at the start of the content script, which is enough for editing the properties of built-in objects. This is confirmed by the Xray Vision documentation.

In Add-on SDK content scripts and GreaseMonkey user scripts, you can use the global unsafeWindow...

However, it seems that since we save the method in a function variable (i.e. var originalMethod = object[method]; from logFunction) it is saved in the elevated context. Thus when we try to call apply to it later from the page script we receive a security error. To fix this, we'll need to hook originalMethod to something in the page script's context.

Record access to window.location

Makes it easier to follow redirects.

moz_cookies table is always empty

About a week ago I tried the new updates on master branch. It seems to me that moz_cookies table is always empty when get/browser and dump_profile_cookie commands are being run sequentially. It used to work before. Is there any explanation for it?

Some javascript URLs have "line XX > function" after them

Will need to update this regex to fix it. Should be able to check for eval or function.

Close browser gracefully when possible

It's definitely not best practice to kill everything indiscriminately. We should attempt to first close each child process and kill if the close operation times out.

Consider making relative location urls absolute in the HTTP Response table

We currently grab the location field of HTTP Response headers and save it in its own column to make certain queries easier. It seems the location field of an HTTP Response header can be relative to the requested resource's url. We save the string exactly as it is in the header, but it might make analysis a bit easier if we check for relative urls real time. Some sample code to do this is:

from urlparse import urlparse, urljoin

hostname = urlparse(location).hostname
if location.startswith('/') or location.startswith('.') or hostname == '' or hostname is None:
    location = urljoin(url, location)

where url is the request resource URL (from the HTTP Request) and location is the URL from a 3XX HTTP Response Header. Note the location.startswith(...) checks may be unnecessary.

Add the option to install some plugins alongside Firefox

Issue #23 mentions that the javascript instrumentation for plugins doesn't seem to be working. I've confirmed that it does work, but nothing will be recorded when no plugins are installed.

We can add some additional (optional) plugins to install.sh to fix this.

Replace pickle with safe serialization

Pickling data from an unknown source is unsafe, and occasionally does cause stray code to be evaluated during a crawl.

Should be able to replace with json serialization which is both safe and has better supported in javascript.

Prevent Location Services and Tiles from pinging Mozilla

The newer versions of Firefox make requests to Mozilla urls on startup. This is an issue for stateless crawls where browsers are restarted with each new page visit, as the proxy data contains requests to mozilla.com for every page visit. It should be possible to fix this with prefs:

Set browser.newtabpage.enabled to False
Perhaps give a blank url for browser.search.geoip.url

The urls that show up in the database are:

browser.search.geoip.url;https://location.services.mozilla.com/v1/country?key=%MOZILLA_API_KEY%
https://tiles.services.mozilla.com/v3/links/fetch/en-US/release

Provide a better error message when application.ini isn't found

This is currently used to grab the Firefox version in platform_utils.py. Ideally we should provide a descriptive error message when it is not found.

Add support for a unique page visit ID

Right now the database is keyed by crawl_id and top_url. If we want to visit the same url multiple times with the same crawler we will have multiple visits in the database with the same effective key. Instead, we should develop a notion of a visit_id which keys all data for a specific site visit.

It's possible to implement this in the following way

Add a page_visits table to the database with schema

create table page_visits {
    id integer primary key,
    url VARCHAR(500) NOT NULL,
}

When TaskManager initializes it pulls max(id) from page_visits and sets a class property max_visit_id.
For each get (or browse) command issued, TaskManager:
- assigns max_visit_id to a last_visit_id property of the Browser object corresponding to the browser the get/browse command was issued to.
- saves a new record in the page_visits table corresponding to the new id and url
- increments max_visit_id
Any new commands issued to that Browser will use the last_visit_id property to key all data generated by that command. Since the Browser object exists in the main thread, this ID will need to explicitly be passed with each command.

Save the configuration parameters and environment details to crawl database

It would be helpful to save the full configuration dictionary and version numbers to the crawl database to make it easier to keep track of these settings when distributing data.

The approach I see is:

OpenWPM release number / commit
Serialize browser_params (and possibly manager_params) and write the strings
Save the Firefox and Adobe Flash version numbers
Save the version numbers for any extensions launched (these can be pulled from extension file names)

Where is run_simple_crawl.py?

Provide a warning when the user has enabled Flash but hasn't installed it

Right now it's possible to set manager_params['disable_flash'] = False when Flash isn't installed. The user won't be notified of the issue when launching the crawl, and flash isn't included in the default install script (for users who might be testing on their personal machine -- and might not want flash in their personal browsers).

Record script-url in javascript instrumentation.

This seems to be at least somewhat possible by looking at the stack trace. We should look more into the approach used in Chameleon.

Virtual Machine OpenWPM-0.1.1.ova password

I need root access to the virtual machine OpenWPM-0.1.1.ova to install softwares. I searched the documentation but I did not find the password.

Thanks :)

EDIT 1
The pasword for the OpenWPM user is "password" and this user is in the sudoer group.

Add support for forwarding proxy

There are several use cases when a user may want to both use MITMProxy for instrumentation and route the traffic through an upstream proxy.

This is possible by utilizing the -f flag in mitmproxy, however we'll need to take care to support an upstream proxy both when mitmproxy is used and isn't used.

Provide a way to change the Firefox binary location and other platform-wide settings

There are many hardcoded constants.

The Firefox binary location here and here
Timeouts here, here, and here.
Default screen resolution here.
Many others

We should move all of these to a central config/common file that we can import elsewhere.

Use a key-value store for javascript files

The current method of storing all javascript files in a folder doesn't scale for 1 million site crawls. Although the overall size is reasonable, there are far too many files stored in the directory. This significantly increases the time it takes to tar and untar it. A key-value store should fix this, and LevelDB is a good option.

Javascript instrumentation breaks functionality

Javascript instrumentation in the most recent release of FourthParty is broken. I've attempted to fix and expand upon it (see automation/Extension/firefox/lib/javascript-instrument.js) while porting the code, with some success.

I found it possible to monitor certain things, such as canvas and localStorage, by intercepting calls and property updates on the prototype objects. This works well for method calls, but causes property updates to fail to apply which can very easily break the functionality of the object being instrumented.

For example, setting fonts / colors for canvas rendering contexts will fail to apply when the instrumentation is enabled.

Firefox binary path hard coded

The path to the binary is

./firefox/firefox

(in deploy_firefox.py:170)

Is the assumption that the user has a custom firefox install inside the OpenWPM directory? We don't seem to mention this in the tutorial. The demo fails out of the box for this reason (probably what's causing the error in #15 ).