Improving Multi-Document Modeling via Cross-Document Question-Answering

This repository contains the accompanying code for the paper:

"Peek Across: Improving Multi-Document Modeling via Cross-Document Question-Answering ." Avi Caciularu, Arman Cohan, Ido Dagan, Jacob Goldberger and Arman Cohan. In ACL, 2023. [PDF]

You can either pretrain by yourself or use the pretrained QAmden model weights and tokenizer files, which are available on HuggingFace.

Pre-trained Model Usage

Code for loading and using the QAmden pre-trained model:

from transformers import AutoTokenizer, AutoModel
# load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('biu-nlp/QAmden')
model = AutoModel.from_pretrained('biu-nlp/QAmden')

Please note that during our pretraining we used the document separators (similarly as PRIMERA), which you might want to add to your data. The document separator is <doc-sep> (the last token in the vocabulary).

We also provide QAmden fine-tuned over the multinews dataset:

from transformers import AutoTokenizer, AutoModel
# load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('biu-nlp/QAmden')
model = AutoModel.from_pretrained('biu-nlp/QAmden-multinews')

Pre-training your own QAmden model

For generating the pre-training your own QAmden model:

Download and untar the preprocessed newshead data.
Process the data by running pretrain_preprocess_qasem.py.
Filter the processed data and create the csv files by running preprocess_and_filter_data.py.

Instead, you can download and use the already preprocessed data:

from datasets import load_dataset
qamden_pretraining_dataset = load_dataset("biu-nlp/QAmden-pretraining")

Once you have the data, launch pre-training using the pretrain_qamden.py script.

Evaluating the QAmden model on multi-document summarization

Use the finetune_summarization.py script to evaluate over multi-news or over multi_x_science_sum.

Citation:

If you find our work useful, please cite the paper as:

@article{caciularu2023Peekacross,
  title={Peek Across: Improving Multi-Document Modeling via Cross-Document Question-Answering},
  author={Caciularu, Avi and Peters, Matthew E. and Goldberger, Jacob and Dagan, Ido and Cohan, Arman},
  journal={The Annual Meeting of the Association for Computational Linguistics (ACL 2023)},
  year={2023}
}

aviclu / peekacross Goto Github PK

peekacross's Introduction

Improving Multi-Document Modeling via Cross-Document Question-Answering

Pre-trained Model Usage

Pre-training your own QAmden model

Evaluating the QAmden model on multi-document summarization

Citation:

peekacross's People

Contributors

Stargazers

Watchers

peekacross's Issues

Request for bash script

Parameters used

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent