Git Product home page Git Product logo

apt's Introduction

APT

This is the repository based on a paper accepted in ACL 2021: Improving Paraphrase Detection with the Adversarial Paraphrasing Task.

Repository vs Paper

  • APP is the human split of the adversarial dataset (AP_H)
  • NAP is the neural network's attempt at the APT (AP_T5)

Packages needed:

All the packages used in this repository are listed below linked to their installation instructions. However, you might not need to install all if you want to run only some of the functionalities of the repository (for instance, you would not need any of the flask packages if you do not want to run the web-based APT). Please check the import statements in the scripts you want to run before installing packages to avoid installing unnecessary packages.

Main scripts in this repository

Here is a list of all scripts in this repository along with a brief description of what they do:

  • apt.py runs the web-based APT. This is what we used for our mTurk study.
  • graph.py generates the graphs which can be used to compare datasets.
  • nap_generation.py uses a fine-tuned T5 model to write paraphrases taking source sentences from MSRP and TwitterPPDB. The code to fine-tune the T5 model can be found inside paraphraser-for-apt/.

Model and Datasets

The fine-tuned T5 paraphraser can be accessed using huggingface as follows:

from transformers import T5Tokenizer, T5ForConditionalGeneration
paraphrasing_tokenizer = T5Tokenizer.from_pretrained("t5-base")
paraphrasing_model = T5ForConditionalGeneration.from_pretrained("coderpotter/T5-for-Adversarial-Paraphrasing")

# this function will take a sentence, top_k and top_p values based on beam search
def generate_paraphrases(sentence, top_k=120, top_p=0.95):
    text = "paraphrase: " + sentence + " </s>"
    encoding = paraphrasing_tokenizer.encode_plus(text, max_length=256, padding="max_length", return_tensors="pt")
    input_ids, attention_masks = encoding["input_ids"].to(device), encoding["attention_mask"].to(device)
    beam_outputs = paraphrasing_model.generate(
        input_ids=input_ids,
        attention_mask=attention_masks,
        do_sample=True,
        max_length=256,
        top_k=top_k,
        top_p=top_p,
        early_stopping=True,
        num_return_sequences=10,
    )
    final_outputs = []
    for beam_output in beam_outputs:
        sent = paraphrasing_tokenizer.decode(beam_output, skip_special_tokens=True, clean_up_tokenization_spaces=True)
        if sent.lower() != sentence.lower() and sent not in final_outputs:
            final_outputs.append(sent)
    return final_outputs

Please refer to nap_generation.py for ways to better utilize this model using concepts of top-k sampling and top-p sampling.

The fine-tuned paraphrase detector can be accessed through huggingface as follows:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("coderpotter/adversarial-paraphrasing-detector")
model = AutoModelForSequenceClassification.from_pretrained("coderpotter/adversarial-paraphrasing-detector")

The Adversarial dataset can be found here.

apt's People

Contributors

coderpotter avatar

Stargazers

KimMo avatar Calvin Huang avatar  avatar Arun Balajiee Lekshmi Narayanan avatar Darshan Tank avatar Holly Redman avatar mago avatar Aniket Singh Rajpoot avatar Chieh-Yang Huang avatar Jiacheng Cheng avatar Nicolas Brousse avatar  avatar  avatar Yan Sidyakin avatar Ryan Abdurohman avatar Abdessalem Boukil avatar Nishant Andoriya avatar pushkar avatar freyaya avatar Yao Dou avatar

Watchers

James Cloos avatar Elijah Malaby avatar Arun Balajiee Lekshmi Narayanan avatar sbsummer avatar  avatar

apt's Issues

calculating ...

After clicking on check, show "calculating..." until results come

dataset

Use a combination of eng-eng and MRPC and keep track of which sentence came from where

dollar amount

50 cents per sentence only if sentence is valid. check that first.
if invalid, print a msg saying "you won't get paid for this".
if they still agree, proceed as usual.

session or code

everyone gets a unique code. implement sessions or find another way.
final amt can't be more than $20 per session.
they can quit mid session whenever they want.
min amt $5 before quitting.

Disable submit

Submit should be disabled if check has never been pressed

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.