Git Product home page Git Product logo

venmo-research's Introduction

Venmo Research

The code used in the paper Contact Tracing With Venmo as a part of UT Austin's Computational Media Lab.

methods

Usage & Replication

Database Setup

You'll need a fresh postgres database hosted on a server with at least ~100GB of free storage. If you have the disk space and reasonable hardware could also just download and run postgres locally. I also heavily recommend the use of pgAdmin for debugging and exploring the database.

You'll need to add have following environment variables when running all the commands below: POSTGRES_PASS, POSTGRES_ADDR, POSTGRES_USER, POSTGRES_DB. For example:

export POSTGRES_PASS=password
export POSTGRES_ADDR=127.0.0.1:5432
export POSTGRES_USER=postgres
export POSTGRES_DB=venmo

Download Research Code

Download and extract the latest binaries from releases.

If you're familar with Go you could also clone this repo and go run things.

Venmo Collection

  1. Create a Venmo account
  2. Use your Venmo login to generate an API key with scripts/login.py. This only has to be done once as the API key does not expire.
  3. Collect data
Randomly scrape transactions by user
./scrape -mode transactions -token <your API key here> -random
Scrape transactions of user with an ID between 0 and 95000000 using 5 parallel workers.
./scrape -mode transactions -token <your API key here> -start_id 0 -end_id 95000000 -workers 5
As machine 2 of 10 (0-indexed), scrape transactions of users with an ID between 0 and 95000000 using 5 parallel workers.
./scrape -mode transactions -token <your API key here> -start_id 0 -end_id 95000000 -workers 5 -shard_idx 2 -shard_cnt 10
Continously scrape the latest transactions from https://venmo.com/api/v5/public.
./scrape -mode transactions2 -token <your API key here>
View help
./scrape -h

Name Search (finding social media profiles)

Randomly sample Venmo users from database and look them up on Bing, DuckDuckGo, and PeekYou.
./scrape -mode namesearch -workers 1

Geotag Extraction (scraping Facebook)

  1. Create a Facebook account (the account must be created with a phone number to avoid being blocked)
  2. Install Chrome
  3. Download the chromedriver and as well as the latest selenium server
  4. Collect data
Randomly sample users with PeekYou matches and extract geotags
./scrape -mode peekyoulocs -fb_user <facebook phone number> -fb_pass <facebook password> -sel_driver chromedriver -sel_headless -workers 3

Analysis & Visualization

  1. Open a jupyter notebook in this repo
  2. Pip install necessary dependencies
  3. Edit the connect() function to match the parameters for your database
  4. Run the notebook cells in order

Our Dataset

Creating our dataset took several months and with several API changes Venmo collection may no longer be possible at this scale (135M transactions, 22.1M users). Open an issue here or contact us if you would like to receive a copy of our dataset (note: we'll need to verify your use case and intentions before hand, additional restrictions may apply).

Use (with parameters adjusted for your postgres installation) to replicate the database used when running our notebooks:

$ pg_restore --host "127.0.0.1" --port "5432" --username "postgres" --no-password --dbname "venmo" --verbose "dataset.sql"

Schema

{
	'users':{
		'created':'timestamp without time zone',
		'bing_results':'json',
		'facebook_results':'json',
		'facebook_profile':'json',
		'peek_you_results':'json',
		'is_business':'boolean',
		'cancelled':'boolean',
		'id':'bigint',
		'last_name':'character varying',
		'username':'character varying',
		'picture_url':'character varying',
		'name':'character varying',
		'ddg_results':'text',
		'external_id':'character varying',
		'first_name':'character varying'
	},
	'transactions':{
		'created':'timestamp without time zone',
		'updated':'timestamp without time zone',
		'actor_user_id':'bigint',
		'recipient_id':'bigint',
		'id':'bigint',
		'message':'character varying',
		'story':'character varying',
		'type':'character varying'
	},
	'user_to_transactions':{
		'user_id':'bigint',
		'transaction_id':'bigint',
		'is_actor':'boolean'
	}
}

TACC Suggestions

TACC can be a huge pain compared to any cloud provider but it can be useful as a free (for us at UT) compute resource. Personally, I only used it for jobs running with the transactions and namesearch mode. You can use scripts/scrape.tacc.job as a template for doing this. Keep in mind that you'll need to download and extract the latest release, update the environment variables (see placeholders in the script), and run $ sbatch scrape.tacc.job while on a stampede2.tacc.utexas.edu login node.

It would be extremely useful to run postgres directly on TACC, but running a database as a job is pretty weird (I contacted them and that's only way of doing it now) as it will only run for fix amount of time (e.g. 6 hours) before shutting down and you'll have to wait for the job queue before it even starts. If you do want to still try this, I've left some snippets below that may be useful.

# after starting an interactive job w/idev
# use docker (TACC uses docker alt called singularity) to run postgres server
module load tacc-singularity
singularity pull docker://postgres
SINGULARITYENV_POSTGRES_PASSWORD=pgpass SINGULARITYENV_PGDATA=$SCRATCH/pgdata singularity run --cleanenv --bind $SCRATCH:/var postgres_latest.sif

# portforwarding with ssh magic (copied from VNC demo script), you could maybe ngrok tcp 5432 instead (?)
NODE_HOSTNAME=`hostname -s`
for i in `seq 4`; do
    ssh -q -f -g -N -R 15426:$NODE_HOSTNAME:15426 login$i
done
ssh -f -N -L 15426:stampede2.tacc.utexas.edu:15426 <your username>@stampede2.tacc.utexas.edu

venmo-research's People

Contributors

sshh12 avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar

venmo-research's Issues

request for the dataset

Hi there:

Open this issue for requesting the dataset. Used for research and science purpose. Thank you!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.