Git Product home page Git Product logo

conda-gen-text-detection's Introduction

ConDA-gen-text-detection

Code for the paper: ConDA: Contrastive Domain Adaptation for AI-generated Text Detection accepted at IJCNLP-AACL 2023 paper link.

๐ŸŒŸ Great News! [Nov 4, 2023] ๐ŸŒŸ Our paper won the Outstanding Paper Award at IJCNLP-AACL 2023 held in Bali, Indonesia.

ConDA Framework Diagram

Setup

Set up a separate environment and install requirements via pip install -r requirements.txt

Make directories for the models, output logs and huggingface model files.

mkdir models huggingface_repos output_logs

Download roberta-base from here and/or roberta-large from here and place these repositories in huggingface_repos.

contrast_training_with_da.py is the ConDA training script. The multi_domain_runner.py is the runner script for training ConDA models. Update the arguments in multi_domain_runner.py to train models as needed.

Use the evaluation.py script for evaluating models. Change arguments within the evaluation.py script as needed.

TuringBench

Link to the dataset website: link Link to the TuringBench paper: link

Files should be split into 3 jsonl splits: train, valid, test. Each line in the jsonl is a data instance with text and label fields.

Links to best performing models for each target generator

Here we provide links to pre-trained ConDA models for the best performing models:

Target Best performing source Dropbox Link
CTRL GROVER_mega link
FAIR_wmt19 GPT2_xl link
GPT2_xl FAIR_wmt19 link
GPT3 GROVER_mega link
GROVER_mega CTRL link
XLM GROVER_mega link
ChatGPT FAIR_wmt19 link

Citation

If you use (part of) this code, please cite our paper as:

@InProceedings{bhattacharjee-EtAl:2023:ijcnlp,
  author    = {Bhattacharjee, Amrita  and  Kumarage, Tharindu  and  Moraffah, Raha  and  Liu, Huan},
  title     = {ConDA: Contrastive Domain Adaptation for AI-generated Text Detection},
  booktitle      = {Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics},
  month          = {November},
  year           = {2023},
  address        = {Nusa Dua, Bali},
  publisher      = {Association for Computational Linguistics},
  pages     = {598--610},
  url       = {https://aclanthology.org/2023.ijcnlp-long.40}
}

Contact

For any questions, comments, and feedback, contact Amrita Bhattacharjee at [email protected]

conda-gen-text-detection's People

Contributors

amritabh avatar

Stargazers

Dave Rauchwerk avatar Jean-Louis Huynen avatar  avatar Giorgio Patrini avatar Jeff Baumes avatar  avatar Berkay Ugur Senocak avatar ZYM avatar suzj avatar  avatar  avatar  avatar Lรช Anh Duy avatar Tharindu Kumarage avatar MeilingLi avatar Andreas Susanto avatar Haodong Liang avatar Ashmit Chamoli avatar Mo Zhang avatar mars avatar Ruocheng Guo avatar Yichuan LI avatar Youngrok Song avatar Vegar Andreas Bergum avatar Jeonghun Baek avatar Siddhant Bhambri avatar Ujun Jeong avatar

Watchers

Kostas Georgiou avatar  avatar

conda-gen-text-detection's Issues

Assistance Request for ConDA-gen-text-detection Code Implementation

Background
I am a college student currently working on implementing the ConDA-gen-text-detection code from Amrita Bhattacharjee's GitHub repository. During this process, I have encountered some issues and would appreciate some guidance and assistance.

Process
I downloaded the data from TuringBench, specifically the TuringBench.zip file, and successfully processed all CSV files into real_dataset.jsonl and fake_dataset.jsonl files.
After completing the preprocessing steps, I generated the relevant scrambled text and proceeded to run the multi_domain_runner.py script, resulting in the creation of a .pt model file.
Problem
However, when attempting to evaluate the model using the evaluation.py script, I did not obtain the F1 score mentioned in Amrita Bhattacharjee's paper. I am using the robert-base coding tool for this implementation.

Request
I would greatly appreciate some insights into the correct approach for obtaining the results mentioned in the paper. There may be specific parameters or steps that I overlooked during the process.

Relevant Information
Encoder tool used: robert-base
Data source: TuringBench.zip file
Processed files: real_dataset.jsonl and fake_dataset.jsonl
Contact Information
Name: Ruifan Zhao
University: Mongolian University
Email: [email protected]
Thank you for your time and assistance. Looking forward to your response.

chatgpt data and title?

Apologies if I missed it somewhere. Are the chatgpt articles posted somewhere? Are the article titles used for generation posted somewhere? thanks!

Different Batch Sizes for src_loader and tgt_loader

Hi Amrita, thank you for the great work!

I was trying to apply your model in the repo to my own dataset. However, while I was running the training code, I encountered an issue:

The source training data contains 29080 items, while the target training data contains 3832 items. I set batch_size to 256. Therefore, in one of the batch for both src_loader and the tgt_loader, other than having 256 as the batch size, the batch size is less than 256 and different (for tgt_loader, the remaining size is 248). Thus, the size of negatives_mask in SimCLRContrastiveLoss is not compatible with batches of different sizes (e.g. 256 in a source batch, 248 in a target batch). This will cause error in denominator = self.negatives_mask * torch.exp(similarity_matrix / self.temperature). Can I ask how you tackled this issue? Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.