amritabh / conda-gen-text-detection Goto Github PK

Code for the paper: ConDA: Contrastive Domain Adaptation for AI-generated Text Detection

License: MIT License

Python 91.41% Jupyter Notebook 8.59%

conda-gen-text-detection's Introduction

ConDA-gen-text-detection

Code for the paper: ConDA: Contrastive Domain Adaptation for AI-generated Text Detection accepted at IJCNLP-AACL 2023 paper link.

🌟 Great News! [Nov 4, 2023] 🌟 Our paper won the Outstanding Paper Award at IJCNLP-AACL 2023 held in Bali, Indonesia.

Setup

Set up a separate environment and install requirements via pip install -r requirements.txt

Make directories for the models, output logs and huggingface model files.

mkdir models huggingface_repos output_logs

Download roberta-base from here and/or roberta-large from here and place these repositories in huggingface_repos.

contrast_training_with_da.py is the ConDA training script. The multi_domain_runner.py is the runner script for training ConDA models. Update the arguments in multi_domain_runner.py to train models as needed.

Use the evaluation.py script for evaluating models. Change arguments within the evaluation.py script as needed.

TuringBench

Link to the dataset website: link Link to the TuringBench paper: link

Files should be split into 3 jsonl splits: train, valid, test. Each line in the jsonl is a data instance with text and label fields.

Links to best performing models for each target generator

Here we provide links to pre-trained ConDA models for the best performing models:

Target	Best performing source	Dropbox Link
CTRL	GROVER_mega	link
FAIR_wmt19	GPT2_xl	link
GPT2_xl	FAIR_wmt19	link
GPT3	GROVER_mega	link
GROVER_mega	CTRL	link
XLM	GROVER_mega	link
ChatGPT	FAIR_wmt19	link

Citation

If you use (part of) this code, please cite our paper as:

@InProceedings{bhattacharjee-EtAl:2023:ijcnlp,
  author    = {Bhattacharjee, Amrita  and  Kumarage, Tharindu  and  Moraffah, Raha  and  Liu, Huan},
  title     = {ConDA: Contrastive Domain Adaptation for AI-generated Text Detection},
  booktitle      = {Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics},
  month          = {November},
  year           = {2023},
  address        = {Nusa Dua, Bali},
  publisher      = {Association for Computational Linguistics},
  pages     = {598--610},
  url       = {https://aclanthology.org/2023.ijcnlp-long.40}
}

Contact

For any questions, comments, and feedback, contact Amrita Bhattacharjee at [email protected]

conda-gen-text-detection's People

Contributors

Stargazers

Watchers

conda-gen-text-detection's Issues

Assistance Request for ConDA-gen-text-detection Code Implementation

Background
I am a college student currently working on implementing the ConDA-gen-text-detection code from Amrita Bhattacharjee's GitHub repository. During this process, I have encountered some issues and would appreciate some guidance and assistance.

Process
I downloaded the data from TuringBench, specifically the TuringBench.zip file, and successfully processed all CSV files into real_dataset.jsonl and fake_dataset.jsonl files.
After completing the preprocessing steps, I generated the relevant scrambled text and proceeded to run the multi_domain_runner.py script, resulting in the creation of a .pt model file.
Problem
However, when attempting to evaluate the model using the evaluation.py script, I did not obtain the F1 score mentioned in Amrita Bhattacharjee's paper. I am using the robert-base coding tool for this implementation.

Request
I would greatly appreciate some insights into the correct approach for obtaining the results mentioned in the paper. There may be specific parameters or steps that I overlooked during the process.

Relevant Information
Encoder tool used: robert-base
Data source: TuringBench.zip file
Processed files: real_dataset.jsonl and fake_dataset.jsonl
Contact Information
Name: Ruifan Zhao
University: Mongolian University
Email: [email protected]
Thank you for your time and assistance. Looking forward to your response.

chatgpt data and title?

Apologies if I missed it somewhere. Are the chatgpt articles posted somewhere? Are the article titles used for generation posted somewhere? thanks!

Different Batch Sizes for src_loader and tgt_loader

Hi Amrita, thank you for the great work!

I was trying to apply your model in the repo to my own dataset. However, while I was running the training code, I encountered an issue:

The source training data contains 29080 items, while the target training data contains 3832 items. I set batch_size to 256. Therefore, in one of the batch for both src_loader and the tgt_loader, other than having 256 as the batch size, the batch size is less than 256 and different (for tgt_loader, the remaining size is 248). Thus, the size of negatives_mask in SimCLRContrastiveLoss is not compatible with batches of different sizes (e.g. 256 in a source batch, 248 in a target batch). This will cause error in denominator = self.negatives_mask * torch.exp(similarity_matrix / self.temperature). Can I ask how you tackled this issue? Thanks!

amritabh / conda-gen-text-detection Goto Github PK

conda-gen-text-detection's Introduction

ConDA-gen-text-detection

🌟 Great News! [Nov 4, 2023] 🌟 Our paper won the Outstanding Paper Award at IJCNLP-AACL 2023 held in Bali, Indonesia.

Setup

TuringBench

Links to best performing models for each target generator

Citation

Contact

conda-gen-text-detection's People

Contributors

Stargazers

Watchers

conda-gen-text-detection's Issues

Assistance Request for ConDA-gen-text-detection Code Implementation

chatgpt data and title?

Different Batch Sizes for src_loader and tgt_loader

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent