Git Product home page Git Product logo

pull_facebook_data_for_good's Introduction

[DEPRECATED] pull_facebook_data_for_good

GitHub Actions (Tests) codecov


โ›” [DEPRECATED] โ›”

This library has been deprecated by changes to the portal for downloading data from Facebook Data for Good and enhanced security mechanisms in the new download portal. The recommended method for downloading data is from the newly available "Partner Portal". This library is compatible with the previous "Geoinsights Portal" and will no longer be maintained.


Imitate an API for downloading data from Facebook Data for Good.

This library uses selenium webdriver to imitate the behaviour of an API for downloading the full timeseries of a data collection.

This library is developed and tested in Python 3.8.

Disclaimer: This download routine will only work for those with access to the Facebook Geoinsights platform, and will only function for datasets to which the user has been granted access. This tool is not developed by or associated with Facebook, it is simply a utility to automate downloading data from the Geoinsights platform.

Installation

From a clone:

To develop this project locally, clone it onto your machine:

git clone https://github.com/hamishgibbs/pull_facebook_data_for_good.git

Enter the project directory:

cd pull_facebook_data_for_good

Install the package with:

pip install .

From GitHub:

To install the package directly from GitHub run:

pip install git+https://github.com/hamishgibbs/pull_facebook_data_for_good.git

Usage

Currently functional for TileMovement and TilePopulation datasets only.

Use the CLI from the directory where you would like data to be downloaded:

cd path/to/downloaded/data

The CLI follows the format:

pull_fb --dataset_name --area

For example, to pull the TileMovement dataset for Britain:

pull_fb --dataset_name TileMovement --area Britain

or:

pull_fb --d TileMovement --a Britain

The country name must exactly match the name stored in the .config file. For multi-word names, each word will be separated by '_'. ie. New_Zealand

For a full API reference, please see the Reference.

Please Note:

If the .config file is missing variables for a given dataset, please alter the .config file and open a pull request to share with others.

Chrome Web Driver

To download data, this library relies on selenium and ChromeDriver.

This requires a chromedriver executable which can be downloaded here. Make sure that your Chrome version is the same as your chromedriver version.

pull_facebook_data_for_good assumes that the chromedriver executable is located at Applications/chromedriver. To supply a different path, use the argument --driver_path or -driver from the command line.

Credentials

Credentials must be input manually on each download.

Credentials are not stored on your computer and are passed directly to the Facebook login page by the web driver.

Reference

--dataset_name (-d)

Name of the data collection to be downloaded (i.e. "TileMovement", "TilePopulation").

--area (-a)

Name of the area of the data collection (i.e. "Britain").

--outdir (-o)

Directory where datasets will be downloaded (default: current directory - os.getcwd()).

--end_date (-e)

Dataset end date (default: datetime.datetime.now()).

--frequency (-f)

Dataset update frequency in hours (default: 8).

--driver_path (-driver)

Path to ChromeDriver (see download options here).

--config_path (-config)

Path to .config file (default: config in remote repo. Also accepts local paths).

--username (-user)

Facebook username (or registered email address). To run without prompting for login credentials.

--password (-pass)

Facebook password. To run without prompting for login credentials.

--driver_flags (-driver_flags)

Flags passed to chromedriver (allows multiple flags). Flags are passed with selenium.webdriver.ChromeOptions.add_argument.

--driver_prefs (-driver_prefs)

Preferences passed to chromedriver as dict. Preferences are passed with selenium.webdriver.ChromeOptions.add_experimental_option.

Tests

This project is tested with tox.

To run unit tests:

tox

Contributions

Issues:

To request a feature or report an issue with this tool, please open an issue.

Adding a dataset:

Dataset attributes are stored in the .config file.

Each time you use the library, pull_facebook_data_for_good will look for dataset configuration variables here unless you specify a path to another .config file.

To add the ability to download another dataset, alter the .config file with two pieces of information:

  1. The dataset id, embedded in the url of the Geoinsights download page. For example, the dataset ID for the collection stored at https://www.facebook.com/geoinsights-portal/downloads/?id=243071640406689 is 243071640406689.

  2. The date origin of the dataset, the earliest date of data publication, in the format: year_month_day(_hour). i.e. 2020_01_01_00.

Please open a pull request to share the config variables for a new dataset with everyone.

Other contributions:

Other contributions are welcome.

Please look for open issues with the Help Wanted tag.

pull_facebook_data_for_good's People

Contributors

alebitetto avatar hamishgibbs avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

pull_facebook_data_for_good's Issues

Add messaging to driver initiation

Add messaging to indicate whether each step of initiating the driver was successful.

These steps are:

  • Accepting Facebook cookies.
  • Passing Username and Password to the login form.
  • Submitting the login form.

This will help with debugging when operating with a headless driver.
This messaging could be controlled with a -v (--verbose) flag.

Add stable release branch

As this tool matures - it would be good to have a stable release branch rather than downloading from master.

Pass webdriver cookies to requests session

After authentication with webdriver, session cookies should be sent to a requests session to systematically download data.

This makes downloading data more reliable and (hopefully) quicker.

Setup multiprocess workers

Having multiple download workers would speed downloading but may also increase the risk of a temporary ban.

Datasets that are no longer being updated

Incorrect files will be moved if length of urls > the length of actual files that have been downloaded. For datasets that are no longer being updated, this will result in incorrect files being renamed and moved to the output directory.

Allow passing chrome options and flags

Static chrome options means that selenium can crash in different system configurations.

Currently - configuration options prevent headless downloading etc.

Files downloaded without column names

Files can be downloaded from Geoinsights without column headers. These files are successfully downloaded at the moment but should cause an error to be thrown as they cause headaches down the processing chain.

No way to detect end of a dataset update.

Currently, file download url dates are computed from earliest dataset date to datetime.now()

When downloading mobility data this means that the final download urls may not yet have been released. Other files in the user's downloads folder may be moved into the output directory if datasets are not returned from facebook.

Need to detect when a file has been downloaded from a url and when it has not, then only move the number of files that have actually been downloaded from the downloads folder to the output directory.

Improve test coverage

Test coverage is generally low - more tests should be written to cover file reading / writing specifically

Pull only the most recent files

Check the download date of files in the output directory and only generate download urls for dates that have not yet been downloaded

separate cli and scraping functionality

the library should be accessible from cli but also from a script. cli should be a thin wrapper to access the dataset downloading functionality of a pull_fb function.

Check step for downloaded files.

When there is no data available for some date, some data collections return an empty csv file. Others return nothing.

Once files have been moved, open all files and check for any empyt files, then remove them.

Store in utils (same procedure for all collections).

Get outdir from driver_prefs

Currently, outdir is passed in driver_prefs and -o (--outdir).

-o should be removed and the outdir should be taken from driver_prefs.

Read csv response as pandas

Try to read csv response as a pandas DataFrame and assert that it has >1 row.

This will prevent downloading non-csv files.

Slow downloading with requests session

This will ensure that files have enough time to be downloaded as the transition to a request session has significantly sped up download speeds. Suggest adding a 1 second wait time to each download.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.