Git Product home page Git Product logo

sota-extractor's Introduction

Automatic SOTA (state-of-the-art) extraction

Aggregate public SOTA tables that are shared under free licences.

Download the scraped data or run the scrapers yourself to get the latest data.

In the future, we are planning to automate the process of extracting tasks, datasets and results from papers.

Getting the data

The data is kept in the data directory. All data is shared under the CC-BY-SA-4 licence.

The data has been parsed into a consistent JSON format, described below.

JSON format description

The format consists of five primary data types: Task, Dataset, Sota, SotaRow and Link.

A valid JSON file is a list of Task objects. You can see examples in the data/tasks folder.

Task

A Task consists of the following fields:

  • task - name of the task (string)
  • description - short description of the task, in markdown (string)
  • subtasks - a list of zero or more Task objects that are children to this task (list)
  • datasets - a list of zero or more Dataset objects on which the tasks are evaluated (list)
  • source_link - an optional Link object to the original source of the task

Dataset

A Dataset consists of the following fields:

  • dataset - name of the dataset (string)
  • description - a short description in markdown (string)
  • subdatasets - zero or more children Dataset objects (e.g. dataset subsets or dataset partitions) (list)
  • dataset_links - zero or more Link objects, representing the links to the dataset download page or any other relevant external pages (list)
  • dataset_citations" - zero or more Link objects, representing the papers that are the primary citations for the dataset
  • sota - the Sota object representing the state-of-the-art table on this dataset

Link

A Link object describes a URL, and has these two fields:

  • title - title of the link, i.e. anchor text (string)
  • url - target URL (string)

Sota

A Sota object represents one state-of-the-art table, with these fields:

  • metrics - a list of metric names used to evaluate the methods (list of strings)
  • rows a list of rows in the SOTA table, a list of SotaRow objects (list)

SotaRow

A SotaRow object represents one line of the SOTA table, it has these fields:

  • model_name - Name of the model evaluated (string)
  • paper_title - Primary paper's title (string)
  • paper_url - Primary paper's URL (string)
  • paper_date - Paper date of publishing, if available (string)
  • code_links - a list of zero or more Link objects, with links to relevant code implementations (list)
  • model_links - a list of zero or more Link objects, with links to relevant pretrained model files (list)
  • metrics - a dictionary of values, where the keys are string from the parent Sota.rows list, and the values are the measured performance. (dictionary)

Running the scrapers

Installation

Requires Python 3.6+.

pip install -r requirements.txt

NLP-progress

NLP-progress is a hand-annotated collection of SOTA results from NLP tasks.

The scraper is part of the NLP-progress project.

Licence: MIT

EFF

EFF has annotated a set of SOTA results on a small number of tasks, and produced this great report.

To convert the current content run:

python -m scrapers.eff

Licence: CC-BY-SA-4

SQuAD

The Stanford Question Answering Dataset is an active project for evaluating the question answering task using a hidden test set.

To scrape the current content run:

python -m scrapers.squad

Licence: CC-BY-SA-4

RedditSota

The RedditSota repository lists the best method for a variety of tasks across all of ML.

To scrape the current content run:

python -m scrapers.redditsota

Licence: Apache-2

SNLI

The The Stanford Natural Language Inference (SNLI) Corpus is an active project for Natural Language Inference.

To scrape the current content run:

python -m scrapers.snli

Licence: CC-BY-SA

Cityscapes

Cityscapes is a benchmark for semantic segmentation.

To scrape the current content run:

python -m scrapers.cityscapes

Evaluating the SOTA extraction performance

In the future, this repository will also contain the automatic SOTA extraction pipeline. The aim is to automatically extract tasks, datasets and results from papers.

To evaluate the current prediction performance for all tasks:

python -m extractor.eval_all

The most current report can be seen here: eval_all_report.csv.

sota-extractor's People

Contributors

alefnula avatar gcucurull avatar mkardas avatar omarsar avatar rjt1990 avatar rstojnic avatar thatch avatar zeke avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sota-extractor's Issues

Year not found in evaluation-tables.json

Hi, I just noticed that in the case of https://paperswithcode.com/sota/visual-question-answering-on-gqa-test2019

The agent lxmert-adv-txt appears twice (that is correct from the JSON) but the year is 2020 (there is no date in evaluation-tables.json, see paper_date as null):

 {
              "code_links": [],
              "metrics": {
                "Accuracy": "61.12",
                "Binary": "78.07",
                "Consistency": "91.13",
                "Distribution": "5.55",
                "Open": "46.16",
                "Plausibility": "84.8",
                "Validity": "96.36"
              },
              "model_links": [],
              "model_name": "lxmert-adv-txt",
              "paper_date": null,
              "paper_title": "",
              "paper_url": "",
              "uses_additional_data": false
            },
            {
              "code_links": [],
              "metrics": {
                "Accuracy": "61.1",
                "Binary": "77.99",
                "Consistency": "91.08",
                "Distribution": "5.52",
                "Open": "46.19",
                "Plausibility": "84.82",
                "Validity": "96.36"
              },
              "model_links": [],
              "model_name": "lxmert-adv-txt",
              "paper_date": null,
              "paper_title": "",
              "paper_url": "",
              "uses_additional_data": false
            }

Where do you get the year from?
Thank you

Make paperswithcode website open source

I love paperswithcode, and know that I would love to contribute to it and add new features if the website was open source.

I would be very interested in adding more details about specific datasets (e.g. links/where they're hosted), along with possibly showing extra details like the affiliation of specific papers.

I'd also love to help work on a better system for automatically extracting results from new papers instead of relying 100% on crowdsourcing.

I think paperswithcode is a very valuable tool and could become an integral part of the ML community if you make it fully open source, encourage contributions, and focus on the features that the community wants.

Thanks!

Get all papers from one SOTA tables and extract all tables

Choose one SOTA table where it's easy to acquire the papers from arxiv (NOTE: can use the pwc database to translate titles into arxiv IDs).

Then, process all the papers using the pipeline from #1 and see if there is a way of clustering them according to overlap, or any other language cues.

Create a table extractor

Create a function that takes as input a LaTeX file, and extracts all tables in a consistent format.

Perhaps the right output format is a list of rows, as this is how the tables are specified within LateX.

Missing files in sdist

It appears that the manifest is missing at least one file necessary to build
from the sdist for version 0.0.10. You're in good company, about 5% of other
projects updated in the last year are also missing files.

+ /tmp/venv/bin/pip3 wheel --no-binary sota-extractor -w /tmp/ext sota-extractor==0.0.10
Looking in indexes: http://10.10.0.139:9191/root/pypi/+simple/
Collecting sota-extractor==0.0.10
  Downloading http://10.10.0.139:9191/root/pypi/%2Bf/2c9/40e98af84fb9b/sota-extractor-0.0.10.tar.gz (22 kB)
    ERROR: Command errored out with exit status 1:
     command: /tmp/venv/bin/python3 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-wheel-vx3qe4ay/sota-extractor/setup.py'"'"'; __file__='"'"'/tmp/pip-wheel-vx3qe4ay/sota-extractor/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-wheel-vx3qe4ay/sota-extractor/pip-egg-info
         cwd: /tmp/pip-wheel-vx3qe4ay/sota-extractor/
    Complete output (5 lines):
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-wheel-vx3qe4ay/sota-extractor/setup.py", line 26, in <module>
        install_requires=io.open("requirements.txt").read().splitlines(),
    FileNotFoundError: [Errno 2] No such file or directory: 'requirements.txt'
    ----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

How do you add a link for a reproducible example of a model?

Am not sure when this feature was introduced, but I love that Colab notebooks for models on Papers with Code are automatically referenced immediately below their repos (example):

Screen Shot 2022-02-23 at 10 25 34 AM

Am assuming that there is some sort of logic in the website that looks for colab.research.google.com links in models' READMEs? What would a user need to do to add a similar link for a demo of a model in HuggingFace's Spaces, or in a reproducible Codespaces or Replicate.ai instance?

Thank you!

cc: @alefnula @gcucurull @rstojnic

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.