Git Product home page Git Product logo

peekacross's Introduction

Improving Multi-Document Modeling via Cross-Document Question-Answering

This repository contains the accompanying code for the paper:

"Peek Across: Improving Multi-Document Modeling via Cross-Document Question-Answering ." Avi Caciularu, Arman Cohan, Ido Dagan, Jacob Goldberger and Arman Cohan. In ACL, 2023. [PDF]

You can either pretrain by yourself or use the pretrained QAmden model weights and tokenizer files, which are available on HuggingFace.

Pre-trained Model Usage

Code for loading and using the QAmden pre-trained model:

from transformers import AutoTokenizer, AutoModel
# load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('biu-nlp/QAmden')
model = AutoModel.from_pretrained('biu-nlp/QAmden')

Please note that during our pretraining we used the document separators (similarly as PRIMERA), which you might want to add to your data. The document separator is <doc-sep> (the last token in the vocabulary).

We also provide QAmden fine-tuned over the multinews dataset:

from transformers import AutoTokenizer, AutoModel
# load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('biu-nlp/QAmden')
model = AutoModel.from_pretrained('biu-nlp/QAmden-multinews')

Pre-training your own QAmden model

For generating the pre-training your own QAmden model:

  1. Download and untar the preprocessed newshead data.
  2. Process the data by running pretrain_preprocess_qasem.py.
  3. Filter the processed data and create the csv files by running preprocess_and_filter_data.py.

Instead, you can download and use the already preprocessed data:

from datasets import load_dataset
qamden_pretraining_dataset = load_dataset("biu-nlp/QAmden-pretraining")

Once you have the data, launch pre-training using the pretrain_qamden.py script.

Evaluating the QAmden model on multi-document summarization

Use the finetune_summarization.py script to evaluate over multi-news or over multi_x_science_sum.


Citation:

If you find our work useful, please cite the paper as:

@article{caciularu2023Peekacross,
  title={Peek Across: Improving Multi-Document Modeling via Cross-Document Question-Answering},
  author={Caciularu, Avi and Peters, Matthew E. and Goldberger, Jacob and Dagan, Ido and Cohan, Arman},
  journal={The Annual Meeting of the Association for Computational Linguistics (ACL 2023)},
  year={2023}
}

peekacross's People

Contributors

aviclu avatar

Stargazers

SY Tan avatar EricLee avatar  avatar  avatar Chris Hokamp avatar  avatar  avatar

Watchers

 avatar Xm avatar  avatar

peekacross's Issues

Request for bash script

Hi! Thanks for your great work, it means a lot for me.
I would appreciate if you could kindly provide information about the parameters you used?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.