Git Product home page Git Product logo

fm_data_tasks's Introduction

Foundation Models For Data Wrangling

This is the official repo for Can Foundation Models Wrangle Your Data?, which will be appearing in VLDB'23.

Please check out our blog post to learn more about this project and our motivations!

A sampling of these tasks can also be found at the HELM Benchmark. Please checkout this benchmark if you are interested in seeing how a wide range of models perform on these tasks!!!

We are excited to have you try out our methods on your own structured data tasks! If you have other data tasks where our methods could be useful, feel free to shoot as a note!

Contact: Avanika Narayan ([email protected])

Install

Download the code:

git clone [email protected]:HazyResearch/fm_data_tasks.git
cd fm_data_tasks

Install:

pip install poetry
poetry install
poetry run pre-commit install

or

make install

Download and unpack the data:

mkdir data
wget https://fm-data-tasks.s3.us-west-1.amazonaws.com/datasets.tar.gz -P data
tar xvf data/datasets.tar.gz -C data/

Setup

You need to set your OpenAI key to run GPT inference. We also let you change where the datasets are downloaded in case you want to run the code on other data. We use the environment variables

export OPENAI_API_KEY="<YOU API KEY>"
export DATASET_PATH="$PWD/data/datasets"

Run

To run inference, use

poetry run python3 -m fm_data_tasks.run_infernece --help

To see options. Importantly, the --dry_run flag will print out examples but not query OpenAI.

We cache all inputs/outputs in sqlite for the ability to rerun without having to require OpenAI. To override the cache add the --overwrite_cache flag.

To see a full set of scripts with output results for 200 examples samples of each dataset, see scripts/run_results.zsh.

Some examples are a follows.

To dry run run 10 examples for Fodors Zagats entity matching with random selection of 3 examples to add to the prompt,

python3 -m fm_data_tasks.run_inference \
    --dry_run \
    --num_run 10 \
    --k 3 \
    --sample_method random \
    --data_dir data/datasets/entity_matching/structured/Fodors-Zagats

To run 100 examples for 3 trials for Restaurant data imputation on the test data with manual prompt selection,

python3 -m fm_data_tasks.run_inference \
    --num_run 100 \
    --num_trials 3 \
    --do_test \
    --sample_method manual \
    --data_dir data/datasets/data_imputation/Restaurant

fm_data_tasks's People

Contributors

anarayan avatar lorr1 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.