Git Product home page Git Product logo

mmina's Introduction

MMInA: Benchmarking Multihop Multimodal Internet Agents

Project Page | Paper | Dataset

Overview

News

  • [04/15/2024] Release the paper and the codebase of MMInA.

Release Plan

  • More data subsets for multihop tasks
  • Enhanced arguments design for one-stop usage of MMInA
  • Paper, codebase, and dataset release

Installation

Prerequisites

MMInA is built-off the WebArena environment. The prerequisites are the same as WebArena.

To install the environment, you need to have the following dependencies installed:

# one-step installation  from the environment script
git clone https://github.com/shulin16/MMInA
conda env create -f environment.yml

# or install step by step with Python 3.10+
conda create -n mmina python=3.10; conda activate mmina
pip install -r requirements.txt
playwright install
pip install -e .

# optional, dev only
pip install -e ".[dev]"
mypy --install-types --non-interactive browser_env agents evaluation_harness
pip install pre-commit
pre-commit install

mmina Dataset Structure

The mmina dataset is a collection of tasks that require long-chain reasoning over multimodal information. The dataset is divided into several subsets, each of which contains tasks with different numbers of hops. The dataset is stored in the following structure:

Data Root
└── normal/ # All of them are 2-hop tasks.
    └── x.json
    ...
└── multi567/ # All 5-hop, 6-hop, 7-hop tasks are here.
    └── x.json
    ...
└── compare/ # All tasks in this folder need to answer a comparable question first.
    └── x.json
    ...
└── multipro/ # All 8-hop, 9-hop, 10-hop tasks are here.
    └── x.json
    ...
└── shopping/ # All tasks here are about items in OneStopMarket
    └── x.json
    ...
└── wikipedia/ # All tasks here are limited in wikipedia.
    └── x.json
    ...

To use our dataset, which is designed as multimodal web agent tasks, you can download from this Google Drive link. Please refer to this section for detailed instructions for download.

If you want to test different subsets of the dataset, you can specify the subset name in the domain argument when running the code. For example, if you want to test the shopping subset, you can set the domain argument as shopping.

Usage

Quick Start

1. Prepare the environment

You can modify the prepare.sh file to set the environment variable such as your working directory, API keys (from OpenAI, Google etc.). Then run the following command to prepare the environment.

bash prepare.sh

2. Download the data

cd $WORK_DIR
mkdir mmina
curl -o mmina.zip https://drive.google.com/file/d/1QBSxTXG3_RXhlUEyWQikqyOEit4deDj6/view?usp=drive_link
unzip mmina.zip && rm mmina.zip

3. Test the developed agents

If you want to try agents without history memories:

CUDA_VISIBLE_DEVICES=0 python run.py \
--test_start_idx 1 --test_end_idx 10 \
--provider custom --model MODEL_NAME \
--domain DOMAIN_NAME \
--result_dir RESULT_DIR 

If you want to try agents with history memories, you have to set the hist tag as True, and specify the history number and the history folder where the history data is stored. Usually

CUDA_VISIBLE_DEVICES=0 python run.py \
--test_start_idx 1 --test_end_idx 10 \
--provider custom --model MODEL_NAME \
--domain DOMAIN_NAME \
--result_dir RESULT_DIR \
--hist True --hist_num NUM --hist_fold HIST_FOLDER

4. Test your own agents

You can also implement customized LLMs or VLMs as agents to test out the long-chain reasoning ability of the models. After downloading the model weights, the agents should be implemented in agent.py under agent/ folder.

Remeber to initializa a new agent instance and modify the respective configs in run.py to test your own agents.

# Code snippets to initialize and customize the agent configs
llm_config.gen_config["temperature"] = args.temperature
llm_config.gen_config["top_p"] = args.top_p
llm_config.gen_config["context_length"] = args.context_length
llm_config.gen_config["max_tokens"] = args.max_tokens
llm_config.gen_config["stop_token"] = args.stop_token
llm_config.gen_config["max_obs_length"] = args.max_obs_length   

Citation

If you use our environment or data, please cite our paper:

@misc{zhang2024mmina,
      title={MMInA: Benchmarking Multihop Multimodal Internet Agents}, 
      author={Ziniu Zhang and Shulin Tian and Liangyu Chen and Ziwei Liu},
      year={2024},
      eprint={2404.09992},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Acknowledgement

Our benchmark is built upon the WebArena environment, which is a standalone, self-hostable web environment for building autonomous agents with textual inputs. We thank the authors for their great work and the open-source codebase.

mmina's People

Contributors

michaelzona avatar shulin16 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

mmina's Issues

Bugs

The code still has some bugs and hardcoded path or values. Could you clean it up? I actually cannot benchmark my model now.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.