acl-fflm's Introduction

🌳 Fingerprinting Fine-tuned Language Models in the wild

This is the code and dataset for our ACL (Findings) Paper - Fingerprinting Fine-tuned Language Models in the wild .
The paper is available here .

Clone the repo

git clone https://github.com/LCS2-IIITD/ACL-FFLM.git
pip3 install -r requirements.txt

Dataset

The dataset includes both organic and synthetic text.

Synthetic -
Collected from posts of r/SubSimulatorGPT2. Each user on the subreddit is a GPT2 small (345 MB) bot that is fine-tuned on 500k posts and comments from a particular subreddit (e.g., r/askmen, r/askreddit,r/askwomen). The bots generate posts on r/SubSimulatorGPT2, starting off with the main post followed by comments (and replies) from other bots. The bots also interact with each other by using the synthetic text in the preceding comment/reply as their prompt. In total, the sub-reddit contains 401,214 comments posted between June 2019 and January 2020 by 108 fine-tuned GPT2 LMs (or class).
Organic -
Collected from comments of 108 subreddits the GPT2 bots have been fine-tuned upon. We randomly collected about 2000 comments between the dates of June 2019 - Jan 2020.

The complete dataset is available here. Download the dataset as follows -

Download the 2 folders organic and synthetic, containing the comments from individual classes.
Store them in the data folder in the following format.

data
├── organic
├── synthetic
└── authors.csv

Note -
For the below TL;DR run you also need to download dataset.json and dataset.pkl files which contain pre-processed data.
Organize them in the dataset/synthetic folder as follows -

dataset
├── organic
├── synthetic
  ├── splits (Folder already present)
    ├── 6 (Folder already present)
      └── 108_800_100_200_dataset.json (File already present)
  ├── dataset.json (to be added via drive link)
  └── dataset.pkl (to be added via drive link)
└── authors.csv (File already present)

108_800_100_200_dataset.json is a custom dataset which contains the comment ID's, the labels and their separation into train, test and validation splits.
Upon running the models, the comments for each split are fetched from the dataset.json using the comment ID's in the 108_800_100_200_dataset.json file .

Running the code

TL;DR

You can skip the pre-processing and the Create Splits if you want to run the code on some custom datasets available in the dataset/synthetic....splits folder. Make sure to follow the instructions mentioned in the Note of the Dataset section for the setting up the dataset folders.

Pre-process the dataset

First, we pre-process the complete dataset using the data present in the folder create-splits. Select the type of data (organic/synthetic) you want to pre-process using the parameter synthetic in the file. By deafult the parameter is set for synthetic data i.e True. This would create a pre-processed dataset.json and dataset.pkl files in the dataset/[organic OR synthetic] folder.

Create Train, Test and Validation Splits

We create splits of train, test and validation data. The parameters such as min length of sentences (default 6), lowercase sentences, size of train (max and default 800/class), validation (max and default 100/class) and test (max and default 200/class),number of classes (max and default 108) can be set internally in the create_splits.py in the splits folder under the commented PARAMETERS Section.

cd create-splits.py
python3 create_splits.py

This creates a folder in the folder dataset/synthetic/splits/[min_len_of_sentence/min_nf_tokens = 6]/. The train, validation and test datasets are all stored in the same file with the filename [#CLASSES]_[#TRAIN_SET_SIZE]_[#VAL_SET_SIZE]_[#TEST_SET_SIZE]_dataset.json like 108_800_100_200_dataset.json.

Running the model

Now fix the same parameters in the seq_classification.py file. To train and test the best model (Fine-tuned GPT2/ RoBERTa) -

cd models/generate-embed/ft/
python3 seq_classification.py

A results folder will be generated which will contain the results of each epoch.

Note -
For the other models - pretrained and writeprints, first generate the embeddings using the files in the folders models/generate-embed/[pre-trained or writeprints]. The generated embeddings are stored in the results/generate-embed folder. Then, use the script in the models/classifiers/[pre-trained or writeprints] to train sklearn classifiers on generated embeddings. The final results will be in the results/classifiers/[pre-trained or writeprints] folder.

👪 Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change. For any detailed clarifications/issues, please email to nirav17072[at]iiitd[dot]ac[dot]in .

⚖️ License

MIT

acl-fflm's People

Contributors

Stargazers

Watchers

acl-fflm's Issues

dependency issues

Hi @nirav0999 and other folks, thanks for contributing the code. I was trying to install it to a fresh google colab. After loading all the code and data based on the instruction, I still got some errors when executing requirements.txt. This doesn't seem to block the later training and testing.

%cd ACL-FFLM
!pip3 install -r requirements.txt

... ...
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
yellowbrick 1.5 requires scikit-learn>=1.0.0, but you have scikit-learn 0.24.1 which is incompatible.
xarray 2022.12.0 requires numpy>=1.20, but you have numpy 1.18.5 which is incompatible.
xarray 2022.12.0 requires packaging>=21.3, but you have packaging 20.4 which is incompatible.
xarray 2022.12.0 requires pandas>=1.3, but you have pandas 1.1.4 which is incompatible.
xarray-einstats 0.4.0 requires numpy>=1.20, but you have numpy 1.18.5 which is incompatible.
xarray-einstats 0.4.0 requires scipy>=1.6, but you have scipy 1.4.1 which is incompatible.
torchvision 0.14.1+cu116 requires torch==1.13.1, but you have torch 1.4.0 which is incompatible.
torchtext 0.14.1 requires torch==1.13.1, but you have torch 1.4.0 which is incompatible.
torchaudio 0.13.1+cu116 requires torch==1.13.1, but you have torch 1.4.0 which is incompatible.
tifffile 2022.10.10 requires numpy>=1.19.2, but you have numpy 1.18.5 which is incompatible.
tables 3.7.0 requires numpy>=1.19.0, but you have numpy 1.18.5 which is incompatible.
pymc 4.1.4 requires cachetools>=4.2.1, but you have cachetools 4.1.1 which is incompatible.
pydata-google-auth 1.4.0 requires google-auth<3.0dev,>=1.25.0; python_version >= "3.6", but you have google-auth 1.23.0 which is incompatible.
proto-plus 1.22.1 requires protobuf<5.0.0dev,>=3.19.0, but you have protobuf 3.13.0 which is incompatible.
plotnine 0.8.0 requires numpy>=1.19.0, but you have numpy 1.18.5 which is incompatible.
plotnine 0.8.0 requires scipy>=1.5.0, but you have scipy 1.4.1 which is incompatible.
pathy 0.10.1 requires smart-open<7.0.0,>=5.2.1, but you have smart-open 3.0.0 which is incompatible.
pandas-gbq 0.17.9 requires google-auth>=1.25.0, but you have google-auth 1.23.0 which is incompatible.
pandas-gbq 0.17.9 requires pyarrow<10.0dev,>=3.0.0, but you have pyarrow 2.0.0 which is incompatible.
jaxlib 0.3.25+cuda11.cudnn805 requires numpy>=1.20, but you have numpy 1.18.5 which is incompatible.
jaxlib 0.3.25+cuda11.cudnn805 requires scipy>=1.5, but you have scipy 1.4.1 which is incompatible.
jax 0.3.25 requires numpy>=1.20, but you have numpy 1.18.5 which is incompatible.
jax 0.3.25 requires scipy>=1.5, but you have scipy 1.4.1 which is incompatible.
gym 0.25.2 requires importlib-metadata>=4.8.0; python_version < "3.10", but you have importlib-metadata 2.0.0 which is incompatible.
grpcio-status 1.48.2 requires grpcio>=1.48.2, but you have grpcio 1.33.2 which is incompatible.
googleapis-common-protos 1.57.1 requires protobuf!=3.20.0,!=3.20.1,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.19.5, but you have protobuf 3.13.0 which is incompatible.
google-colab 1.0.0 requires requests>=2.25.1, but you have requests 2.25.0 which is incompatible.
google-cloud-translate 3.8.4 requires protobuf!=3.20.0,!=3.20.1,!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.19.5, but you have protobuf 3.13.0 which is incompatible.
google-cloud-storage 2.7.0 requires google-auth<3.0dev,>=1.25.0, but you have google-auth 1.23.0 which is incompatible.
google-cloud-language 2.6.1 requires protobuf!=3.20.0,!=3.20.1,!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.19.5, but you have protobuf 3.13.0 which is incompatible.
google-cloud-firestore 2.7.3 requires protobuf!=3.20.0,!=3.20.1,!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.19.5, but you have protobuf 3.13.0 which is incompatible.
google-cloud-datastore 2.11.1 requires protobuf!=3.20.0,!=3.20.1,!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.19.5, but you have protobuf 3.13.0 which is incompatible.
google-cloud-core 2.3.2 requires google-auth<3.0dev,>=1.25.0, but you have google-auth 1.23.0 which is incompatible.
google-cloud-bigquery 3.4.1 requires grpcio<2.0dev,>=1.47.0, but you have grpcio 1.33.2 which is incompatible.
google-cloud-bigquery 3.4.1 requires protobuf!=3.20.0,!=3.20.1,!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.19.5, but you have protobuf 3.13.0 which is incompatible.
google-cloud-bigquery-storage 2.17.0 requires protobuf!=3.20.0,!=3.20.1,!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.19.5, but you have protobuf 3.13.0 which is incompatible.
google-api-core 2.11.0 requires google-auth<3.0dev,>=2.14.1, but you have google-auth 1.23.0 which is incompatible.
google-api-core 2.11.0 requires protobuf!=3.20.0,!=3.20.1,!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.19.5, but you have protobuf 3.13.0 which is incompatible.
fastai 2.7.10 requires torch<1.14,>=1.7, but you have torch 1.4.0 which is incompatible.
db-dtypes 1.0.5 requires pyarrow>=3.0.0, but you have pyarrow 2.0.0 which is incompatible.
confection 0.0.3 requires srsly<3.0.0,>=2.4.0, but you have srsly 1.0.5 which is incompatible.
cmdstanpy 1.0.8 requires numpy>=1.21, but you have numpy 1.18.5 which is incompatible.
Successfully installed Keras-2.4.3 Markdown-3.3.3 Pillow-8.0.1 PyYAML-5.3.1 absl-py-0.11.0 blis-0.7.4 boto3-1.16.14 botocore-1.19.14 cachetools-4.1.1 catalogue-1.0.0 certifi-2020.11.8 chardet-3.0.4 cycler-0.10.0 cymem-2.0.5 datasets-1.1.2 dill-0.3.3 en-core-web-sm-2.3.1 filelock-3.0.12 future-0.18.2 gast-0.3.3 gensim-3.8.3 google-auth-1.23.0 google-auth-oauthlib-0.4.2 grpcio-1.33.2 h5py-2.10.0 hypopt-1.0.9 importlib-metadata-2.0.0 jmespath-0.10.0 joblib-0.17.0 jsonlines-1.2.0 kiwisolver-1.3.1 matplotlib-3.3.2 multiprocess-0.70.11.1 murmurhash-1.0.5 nltk-3.5 numpy-1.18.5 oauthlib-3.1.0 packaging-20.4 pandas-1.1.4 plac-1.1.3 preshed-3.0.5 protobuf-3.13.0 pyarrow-2.0.0 pyparsing-2.4.7 python-dateutil-2.8.1 pytz-2020.4 readability-0.3.1 regex-2020.10.28 requests-2.25.0 requests-oauthlib-1.3.0 rsa-4.6 s3transfer-0.3.3 sacremoses-0.0.43 scikit-learn-0.24.1 scipy-1.4.1 sentencepiece-0.1.91 sklearn-0.0 smart-open-3.0.0 sortedcontainers-2.3.0 spacy-2.3.5 srsly-1.0.5 tensorboard-2.3.0 tensorboard-plugin-wit-1.7.0 tensorflow-2.3.1 tensorflow-estimator-2.3.0 tensorflow-gpu-2.3.1 termcolor-1.1.0 thinc-7.4.5 threadpoolctl-2.1.0 tokenizers-0.9.2 torch-1.4.0 tqdm-4.49.0 urllib3-1.25.11 wasabi-0.8.0 wget-3.2 wrapt-1.12.1 writeprints-0.2.1 xgboost-1.2.1 xxhash-2.0.0 zipp-3.4.0

Recommend Projects

lcs2-iiitd / acl-fflm Goto Github PK