rbroc / echo Goto Github PK

A Scalable and Explainable Approach to Discriminating Between Human and Artificially Generated Text

Home Page: https://cc.au.dk/en/clai/current-projects/a-scalable-and-explainable-approach-to-discriminating-between-human-and-artificially-generated-text

Python 0.64% Shell 0.02% Jupyter Notebook 16.51% HTML 82.83%

cognition linguistics machine-learning nlg large-language-models

echo's Introduction

ECHO

A Scalable and Explainable Approach to Discriminating Between Human and Artificially Generated Text

Road Map
Repository Overview
Usage
Datasets Overview
Models

Road Map

Refer to the project description here for more detailed information.

🚀 Prompting (Completed)
📈 Generating Data at Large Scale (Completed)
📊 Extracting Metrics (Completed)
🤖 Training Classifiers (In progress)
🧪 Experimental Design (Upcoming)

Repository Overview

The main contents of the repository is listed below.

	Description
`datasets`	Original datasets `human_datasets` which are described in the overview below and the generated `ai_datasets`.
`src`	Scripts for generating data, running PCA/computing distances, and extracting metrics. See `src/README.md` for greater detail.
`results`	Preliminary results (distance plots, length distributions etc.)
`metrics`	Text metrics for each dataset (human and ai), extracted with textdescriptives
`notes`	Jupyter notebooks used for meetings with the `echo` team to present progress
`tokens`	Place your `.txt` token here for the HuggingFace Hub to run `llama2` models.

Usage

The setup was tested on Ubuntu 22.04 (UCloud, Coder Python 1.87.2) using Python 3.10.12.

Setup

To install necessary requirements in a virtual environment (env), please run the setup.sh in the terminal:

bash setup.sh

Generating Text

To reproduce the generation of text implemented with vLLM, run in the terminal:

bash src/generate/run.sh

Note that this will run several models on all datasets for various temperatures.

If you wish to play around with individual models/datasets or use the Hugging Face pipeline implementation, please refer to the instructions in src/generate/README.md.

Running Other Parts of the Pipeline

To run other parts of the pipeline such as analysis or cleaning of data, please refer to the individual subfolders and their readmes. For instance, the src/metrics/README.md.

Datasets Overview

All datasets can be found under datasets/human_datasets

In each folder, data.ndjson contains the processed version of the dataset (lowercased). Each folder also contains additional files, used e.g., to generate or inspect the datasets.
Our datasets are sampled from the following datasets:

dailymail_cnn: https://huggingface.co/datasets/cnn_dailymail. This is a summarization dataset, which includes both extractive and abstractive summarization. Currently, 3000 examples have been sampled;
dailydialog: https://huggingface.co/datasets/daily_dialog. Dialog dataset. We sampled n-1 turns as context, and the last turn is tagged as human completion. Currently, 5000 examples have been sampled, with varying context length. This dataset also includes manual emotion and speech act annotations for both context and completions;
mrpc: https://paperswithcode.com/dataset/mrpc. Paraphrase corpus, from which we extract only examples that are manually labelled as paraphrases. Currently, we have 3900 examples;
stories: prompts and completions for story generation. The dataset is described here: https://aclanthology.org/P18-1082/. Currently, we have 5000 examples.

README files within each folder include further details for each dataset.

Preprocessing

For dailydialog, punctuation has been standardized and irregular transcription has been normalized (see datasets/dailydialog/utils.py). Text for all dataset is lowercased, but further preprocessing may be needed. Unprocessed datasets are kept under datasets/*/raw.ndjson.

Models

The currently used models for data generation (as per 19th March 2024):

llama-chat 7b (meta-llama/Llama-2-7b-chat-hf)
beluga 7b (stabilityai/StableBeluga-7B)
mistral 7b (mistralai/Mistral-7B-Instruct-v0.2)
llama-chat 13b (meta-llama/Llama-2-13b-chat-hf)

echo's People

Contributors

Stargazers

Watchers

Forkers

x-tabdeveloping

echo's Issues

Source datasets for NLG

(All these tasks will probably require prompt engineering, model-specific. Consider doing evaluation, either through external metrics or human validation)

Number of examples per dataset: cap at 5000 (expand if possible)

Paraphrasing:
MRPC: paraphrases, 5,801 examples

Summarizaton:
DailyMail / CNN: 300,000 examples (sample 5 for iteration)

Dialogue:
DailyDialog: multi-turn dialogues (15k).
Approach:

Randomly sample number of turns fed as context to the model.
Iteratively pass examples with increasing number of turns as context

Socratic Questions (context - smart question (HG) - AI generated (HG))
https://aclanthology.org/2023.eacl-main.12.pdf

Story Generation
GitHub: One GitHub for story generation: https://github.com/facebookresearch/fairseq/blob/main/examples/stories/README.md
Kaggle: Writing Prompts
https://www.kaggle.com/datasets/ratthachat/writing-prompts

Additional datasets
GEM: https://aclanthology.org/2021.gem-1.10.pdf

move preprocessing for dailydialog to script generating ndjson file

make ndjson files uniform

human_completions -> human_completion, and it should be a string, not a list

dailymail_cnn: weird cleaning or weird formatting?

DailyMail

In the meeting yesterday #42, @rdkm89 noticed that the cleaned dailymail_cnn version was weirdly formatted:

richard griffiths laid to rest at holy trinity church in stratford-upon-avon .
daniel radcliffe weeps as he leads tributes to beloved star of withnail and i .
nigel havers, lord fellowes and jack whitehall attend moving ceremony .
richard e. grant sends card: 'to my beloved .
uncle monty, chin chin'

However, our sampled raw data looks the same:

Richard Griffiths laid to rest at Holy Trinity church in Stratford-upon-Avon .
Daniel Radcliffe weeps as he leads tributes to beloved star of Withnail and I .
Nigel Havers, Lord Fellowes and Jack Whitehall attend moving ceremony .
Richard E. Grant sends card: 'To my beloved .
Uncle Monty, chin chin'

Looking at the HF dataset (https://huggingface.co/datasets/cnn_dailymail/viewer/3.0.0/train?q=Richard+Griffiths&row=212018), seems that it is formatted weirdly from the beginning

I'm thinking we can't really fix this - there should ideally not be a dot between the two sentences To my beloved . Uncle Monty but this is a semantic issue. However, I could clean the dataset so that there is no whitespace dots and words (source text is also like this - not just human completion).

Thoughts @rbroc ? (not urgent)

Models

Looking both at foundation and instruction tuning models. For this project, the latter is probably going to be the only target, as it would probably work better.

Available

Flan-T5: https://huggingface.co/google/flan-t5-xxl (on HuggingFace)
Falcon: https://huggingface.co/tiiuae/falcon-40b (both base model and instruction tuned). Note that there is also a 7b instruction-tuned and a 7b version (on HuggingFace)
Llama2 (probably just for comparison) and StableBeluga (https://huggingface.co/stabilityai/StableBeluga2)
Alpaca (library, for both foundation model and instruction-tuned model)

Maybe for later
Not open-source

GPT-4 (pricing 0.03$ / 1k tokens for prompts; 0.06 $ / 1k tokens completions) - (on hold, because instruction tuning version is not available)
PaLM - (on hold)
BARD
Open-source
Cerebras GPT: https://huggingface.co/cerebras/Cerebras-GPT-6.7B
Blender for dialogue

identify additional cognitive feature sets

would probably go for things like sensorimotor norms or potentially some pliers features. Not terribly urgent, but to be done after we're done with prompting)

ditching foundation models

this is just for the records -- no action needed right now.
we have decided to ditch foundation models for now, as they do not generate plausible outputs. we may consider resurrecting them later on (with whatever prompt we use for the other models) to use them as a baseline for the instruction-tuned models.

create notebook(s) to facilitate prompting exploration

find out whether we have access to all the models (see #1)

Remove / add new models if needed

how to inspect/compare extracted features at scale?

Currently, even with TextDescriptives only (i.e., no additional feature sets for cognitive features) we have dozens of features and it is unfeasible to look at their distributions individually, both in the context of simple data exploration and for actual modeling purposes.

There are a few options to deal with exploding dimensionality here:

Do nothing, and simply feed raw features to tree-based models for AI vs. human text discrimination, looking at which features are most predictive post hoc;
Apply dimensionality reduction a priori, e.g., through PCA, then visualize/compare (as a function of prompts, or as a function of whether text is produced by humans or models). This reduces dimensionality, but potentially also reduces interpretability;
Compare features one-by-one with statistical tests, e.g., across prompt types, or across human vs. models;

On top of this, additional output-driven dimensionality reduction (e.g., LASSO) could be applied to select for the feature set that is a) most affected by prompting; or b) most discriminative between human- and model-generated text

solve falcon dependency issues

release dataset as first milestone?

we could consider releasing the datasets (& dataset generation/expansion code) as first milestone for the project -- independently of the experiment and predictive modeling part of the project, this could be a valuable resource and a nice publication (LREC?) per se.

Dropping raw features with many zero values prior to PCA?

Problem

Should we remove columns with a large proportion of zero values prior to PCA?
Only the NA cols make PCA (scikit-learn) throw an error. Does it matter when doing PCA that a lot of values are zeroes in our case? Google is unclear on this.

Initially in #63, I had removed any column that has 90% or more zero values:

na_and_zero_cols = ['first_order_coherence', 'second_order_coherence', 'smog', 'pos_prop_SPACE', 'contains_lorem ipsum', 'duplicate_line_chr_fraction', 'duplicate_ngram_chr_fraction_10', 'duplicate_ngram_chr_fraction_7', 'duplicate_ngram_chr_fraction_8', 'duplicate_ngram_chr_fraction_9', 'duplicate_paragraph_chr_fraction', 'pos_prop_SYM', 'pos_prop_X', 'proportion_bullet_points', 'proportion_ellipsis', 'symbol_to_word_ratio_#']

However in #64, I went back to only removing cols that have NA:

na_cols = ['first_order_coherence', 'second_order_coherence', 'smog', 'pos_prop_SPACE']

Without having inspected the results thoroughly, I doubt it changes the actual classification results significantly. But it could make the interpretation of the individual PC components more difficult?

Code here for the curious

Code chunk from identify_NA_metrics.py

def identify_NA_metrics(df, percent_zero:float=None):
    '''
    Identify rows with NA metrics (and alternatively also high percentage of 0s)

    Args:
        df: dataframe to check
        percent_zero: threshold for percentage of 0s in a column to be considered for removal. Default is None (keep cols with many 0s) 
    '''
    # all na_cols 
    na_cols = df.columns[df.isna().any()].tolist()

    # check for NA values 
    if percent_zero is not None:
        if percent_zero < 0 or percent_zero > 1: # check if percent_zero is either 0, 1 or in between 
            raise ValueError("percent_zero must be either 0 or between 0 and 1")

        zero_cols = [col for col in df.columns if df[col].eq(0).sum() / len(df) >= percent_zero]
    else:
        zero_cols = []

    return na_cols + zero_cols

Small notes:

I identify NA/zero metrics by loading in metrics from all datasets at once, so that the columns removed are the same across datasets. That is, if stories have a NA value in first_order_coherence but the other datasets do not, it will still be removed for all other datasets.
I also remove pos_prop_PUNCT manually as we discussed that features related to SPACE and PUNCTUATION should be removed as we have manipulated those columns in our cleaning.

Refactoring: Weird steps in the pipeline (cleaning at various steps that could be streamlined)

The current pipeline is displayed in the image below.

Some steps that may need to be reconsidered

When extracting metrics (step 3) for both human and ai text, AI is lowercased / cleaned here first, but it could be done in a seperate step and saved / stored. The reason I haven't done this is that the repo will end up a little big.
When using the metrics for classification (step 4B), I only then remove the few faulty generations that are below minimum length. It should ideally be removed prior to this steps 4A and 4B to avoid any mistakes (accidentally including them in other analysis work).

features quantifying overlap / relation between source and completion

Also low priority, but there might be signal in metrics quantifying e.g., the amount of overlap between words used in the prompt and in the completion. Good to start from TextDescriptives only, I think, but keeping this open in case we decide to expand feature sets later.

preprocess/normalize text across all datasets

e.g., make sure punctuation is used sensibly, and that there are no weird prefixes or other features that may cause artefacts when comparing model- & human-generated text. Done for dailydialog already, would be great to do for other datasets too.

principled experimentation with prompting

Create a few different prompt options (2 to 5) for each task
Qualitatively inspect how models behave with each of these
Run generation at scale with all prompt options
Run text through text descriptives to look at overall distributional differences (how to do the comparison is an open question, see #13)

DailyDialog: Regenerate dataset with correct lengths + extract new metrics for it

Regenerating DailyDialog

Dailydialog was created with incorrect min and max lengths and will therefore be regenerated as discussed in the March meeting (#51.) As it has not been urgent, focus has been on building the metrics extraction pipeline.

Dailydialog will be regenerated at some point before the classifiers have been finalised. NB. Remember to also extract metrics for the dataset again when it has been regenerated for all language models.

flag weird model-generated text

Related to #51, this is just to keep in mind that, at the moment, we are okay keeping examples where models don't fully follow instructions, or start producing gibberish. But later on, we might consider flagging (e.g., using few-shot learning with SetFit) weird examples -- both to quantify how many there are, and compare how the amount of weird stuff changes across datasets, models and decoding parameters.

data cleanup before fitting models

there's some dataset-specific stuff like " < newline > " annotations in WritingPrompts which we may want to standardize and remove before fitting predictive models at scale (this should not affect median distances between human and LLM completions used for prompt selection, but we may also later want to recompute these medians to provide "cleaner" absolute values in the paper)

Computing Perplexity outside of TextDescriptives (and Entropy)

Discussed with Yuri today (7/08/24). From the meeting notes in April (23/04/24), I noted down that we were considering computing perplexity using HF's evaluate library, using a baseline model like GPT-2 to serve as an "oracle" for perplexity.

Some notes for future meetings:

Picking a baseline model & general thoughts about interpretation of the metric

The approach entails that the perplexity will change based on the baseline model? For instance using a model that has seen much more data than GPT-2 may produce lower perplexity scores than GPT-2 for the same text.

Therefore the interpretation would not be about whether the text has high or low perplexity in general, but rather whether the models (and humans) have higher or lower perplexity relative to each other.

With that being said, I'm still unsure about the importance of choosing a model (should I just run with GPT-2?).

Plan for Entropy

Planning to just compute entropy by taking the log of perplexity given that they are directly related. Formula here.

Note that the formula in the link above expresses perplexity as $\text{Perplexity(X)}=2^{H(X)}$, but the HF readme explains that the perplexity "is defined as the exponentiated average negative log-likelihood of a sequence, calculated with exponent base e." (see link). Therefore I'm just taking np.log(perplexity) to compute it rather than np.log2(perplexity)).

But which units are these perplexities and entropy scores in?

Landing on prompts

Notes mostly for documentation purposes!

The new prompts have been selected and I'm in the midst of generating data across models and temperatures (currently 1, 1.5 and 2). See below for notes on this process.

New Task Prompts

#2.0 prompts
"dailymail_cnn_21": "summarize this in a few sentences: ",
"mrpc_21": "paraphrase this: ",
"stories_21": "write a story based on this: ", 
"stories_22": "continue the story: ", 
"dailydialog_21": "continue the conversation between A and B by writing a single response to the latest speaker. write only a concise response and nothing else: ",

Principles behind the task prompts

I have tried to keep prompts as short and general as possible. That is, something that a user would write and not something that requires a lot of engineering, and the generations seem ok (with the some expected issues e.g., models producing "Sure! I will paraphrase this!"). The exception is dailydialog requires more specific instructions.

For the stories dataset, the prompts lead the models be repetitive and to start with "Title: ". The bsc thesis project used the prompt "continue the story" as it seemed to produce it less. As an addition, I have included the prompt "write a story based on this" to see how the models would fare with this. It was inspired from the instructions on the subreddit that the dataset was based on ("If you see a prompt you like, simply write a short story based on it.")

Removing the System Prompt from Llama2

We initially had a custom system prompt like this:

"llama2_chat": "You are an AI, but you do not deviate from the task prompt and you do not small talk. Never begin your response with 'Sure, here is my response: ' or anything of the like. It is important that you finish without getting cut off."

This custom prompt was created to see if we could get around the model producing completions beginning with e.g., "Sure, I'll paraphrase ...", but since it produces such artefacts regardless, I think it is best to remove it as the devs behind llama now recommend removing it per default due to performance issues (see their repository). This would also adhere to our principle of using these models as a regular user would do (HuggingFace also has system prompts as an extra feature not as a default on their llama interactive space)

Moving on from here

If you have any comments (@rbroc), please let me know. My current plan is to finish generating data with varying temperatures with the prompts above, and then we can manually inspect data and only regenerate/redo prompts if something looks very very weird.

UPDATE 14/03

We settled on all prompts with 21 suffix as mentioned in #51:

#2.0 prompts
"dailymail_cnn_21": "summarize this in a few sentences: ",
"mrpc_21": "paraphrase this: ",
"stories_21": "write a story based on this: ", 
"dailydialog_21": "continue the conversation between A and B by writing a single response to the latest speaker. write only a concise response and nothing else: ",

rbroc / echo Goto Github PK

echo's Introduction

ECHO

Table of Contents

Road Map

Repository Overview

Usage

Setup

Generating Text

Running Other Parts of the Pipeline

Datasets Overview

Preprocessing

Models

echo's People

Contributors

Stargazers

Watchers

Forkers

echo's Issues

DailyMail

Problem

Small notes:

Regenerating DailyDialog

Picking a baseline model & general thoughts about interpretation of the metric

Plan for Entropy

New Task Prompts

Principles behind the task prompts

Removing the System Prompt from Llama2

Moving on from here

UPDATE 14/03

Recommend Projects

Recommend Topics

Recommend Org