Git Product home page Git Product logo

rbroc / echo Goto Github PK

View Code? Open in Web Editor NEW
2.0 3.0 1.0 528.7 MB

A Scalable and Explainable Approach to Discriminating Between Human and Artificially Generated Text

Home Page: https://cc.au.dk/en/clai/current-projects/a-scalable-and-explainable-approach-to-discriminating-between-human-and-artificially-generated-text

Python 0.64% Shell 0.02% Jupyter Notebook 16.51% HTML 82.83%
cognition linguistics machine-learning nlg large-language-models

echo's Introduction

ECHO

A Scalable and Explainable Approach to Discriminating Between Human and Artificially Generated Text

Table of Contents

  1. Road Map
  2. Repository Overview
  3. Usage
  4. Datasets Overview
  5. Models

Road Map

Refer to the project description here for more detailed information.

  1. ๐Ÿš€ Prompting (Completed)

  2. ๐Ÿ“ˆ Generating Data at Large Scale (Completed)

  3. ๐Ÿ“Š Extracting Metrics (Completed)

  4. ๐Ÿค– Training Classifiers (In progress)

  5. ๐Ÿงช Experimental Design (Upcoming)

Repository Overview

The main contents of the repository is listed below.

Description
datasets Original datasets human_datasets which are described in the overview below and the generated ai_datasets.
src Scripts for generating data, running PCA/computing distances, and extracting metrics. See src/README.md for greater detail.
results Preliminary results (distance plots, length distributions etc.)
metrics Text metrics for each dataset (human and ai), extracted with textdescriptives
notes Jupyter notebooks used for meetings with the echo team to present progress
tokens Place your .txt token here for the HuggingFace Hub to run llama2 models.

Usage

The setup was tested on Ubuntu 22.04 (UCloud, Coder Python 1.87.2) using Python 3.10.12.

Setup

To install necessary requirements in a virtual environment (env), please run the setup.sh in the terminal:

bash setup.sh

Generating Text

To reproduce the generation of text implemented with vLLM, run in the terminal:

bash src/generate/run.sh

Note that this will run several models on all datasets for various temperatures.

If you wish to play around with individual models/datasets or use the Hugging Face pipeline implementation, please refer to the instructions in src/generate/README.md.

Running Other Parts of the Pipeline

To run other parts of the pipeline such as analysis or cleaning of data, please refer to the individual subfolders and their readmes. For instance, the src/metrics/README.md.

Datasets Overview

All datasets can be found under datasets/human_datasets

In each folder, data.ndjson contains the processed version of the dataset (lowercased). Each folder also contains additional files, used e.g., to generate or inspect the datasets.
Our datasets are sampled from the following datasets:

  • dailymail_cnn: https://huggingface.co/datasets/cnn_dailymail. This is a summarization dataset, which includes both extractive and abstractive summarization. Currently, 3000 examples have been sampled;
  • dailydialog: https://huggingface.co/datasets/daily_dialog. Dialog dataset. We sampled n-1 turns as context, and the last turn is tagged as human completion. Currently, 5000 examples have been sampled, with varying context length. This dataset also includes manual emotion and speech act annotations for both context and completions;
  • mrpc: https://paperswithcode.com/dataset/mrpc. Paraphrase corpus, from which we extract only examples that are manually labelled as paraphrases. Currently, we have 3900 examples;
  • stories: prompts and completions for story generation. The dataset is described here: https://aclanthology.org/P18-1082/. Currently, we have 5000 examples.

README files within each folder include further details for each dataset.

Preprocessing

For dailydialog, punctuation has been standardized and irregular transcription has been normalized (see datasets/dailydialog/utils.py). Text for all dataset is lowercased, but further preprocessing may be needed. Unprocessed datasets are kept under datasets/*/raw.ndjson.

Models

The currently used models for data generation (as per 19th March 2024):

  1. llama-chat 7b (meta-llama/Llama-2-7b-chat-hf)
  2. beluga 7b (stabilityai/StableBeluga-7B)
  3. mistral 7b (mistralai/Mistral-7B-Instruct-v0.2)
  4. llama-chat 13b (meta-llama/Llama-2-13b-chat-hf)

echo's People

Contributors

andersroen avatar minaalmasi avatar rbkhb avatar rbroc avatar rdkm89 avatar yuri-bizzoni avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

x-tabdeveloping

echo's Issues

Source datasets for NLG

(All these tasks will probably require prompt engineering, model-specific. Consider doing evaluation, either through external metrics or human validation)

Number of examples per dataset: cap at 5000 (expand if possible)

Paraphrasing:
MRPC: paraphrases, 5,801 examples

Summarizaton:
DailyMail / CNN: 300,000 examples (sample 5 for iteration)

Dialogue:
DailyDialog: multi-turn dialogues (15k).
Approach:

  • Randomly sample number of turns fed as context to the model.
  • Iteratively pass examples with increasing number of turns as context

Socratic Questions (context - smart question (HG) - AI generated (HG))
https://aclanthology.org/2023.eacl-main.12.pdf

Story Generation
GitHub: One GitHub for story generation: https://github.com/facebookresearch/fairseq/blob/main/examples/stories/README.md
Kaggle: Writing Prompts
https://www.kaggle.com/datasets/ratthachat/writing-prompts

Additional datasets
GEM: https://aclanthology.org/2021.gem-1.10.pdf

dailymail_cnn: weird cleaning or weird formatting?

DailyMail

In the meeting yesterday #42, @rdkm89 noticed that the cleaned dailymail_cnn version was weirdly formatted:

richard griffiths laid to rest at holy trinity church in stratford-upon-avon .
daniel radcliffe weeps as he leads tributes to beloved star of withnail and i .
nigel havers, lord fellowes and jack whitehall attend moving ceremony .
richard e. grant sends card: 'to my beloved .
uncle monty, chin chin'

However, our sampled raw data looks the same:

Richard Griffiths laid to rest at Holy Trinity church in Stratford-upon-Avon .
Daniel Radcliffe weeps as he leads tributes to beloved star of Withnail and I .
Nigel Havers, Lord Fellowes and Jack Whitehall attend moving ceremony .
Richard E. Grant sends card: 'To my beloved .
Uncle Monty, chin chin'

Looking at the HF dataset (https://huggingface.co/datasets/cnn_dailymail/viewer/3.0.0/train?q=Richard+Griffiths&row=212018), seems that it is formatted weirdly from the beginning
hf_data

I'm thinking we can't really fix this - there should ideally not be a dot between the two sentences To my beloved . Uncle Monty but this is a semantic issue. However, I could clean the dataset so that there is no whitespace dots and words (source text is also like this - not just human completion).

Thoughts @rbroc ? (not urgent)

Models

Looking both at foundation and instruction tuning models. For this project, the latter is probably going to be the only target, as it would probably work better.

Available

Maybe for later
Not open-source

  • GPT-4 (pricing 0.03$ / 1k tokens for prompts; 0.06 $ / 1k tokens completions) - (on hold, because instruction tuning version is not available)
  • PaLM - (on hold)
  • BARD
    Open-source
  • Cerebras GPT: https://huggingface.co/cerebras/Cerebras-GPT-6.7B
  • Blender for dialogue

ditching foundation models

this is just for the records -- no action needed right now.
we have decided to ditch foundation models for now, as they do not generate plausible outputs. we may consider resurrecting them later on (with whatever prompt we use for the other models) to use them as a baseline for the instruction-tuned models.

how to inspect/compare extracted features at scale?

Currently, even with TextDescriptives only (i.e., no additional feature sets for cognitive features) we have dozens of features and it is unfeasible to look at their distributions individually, both in the context of simple data exploration and for actual modeling purposes.

There are a few options to deal with exploding dimensionality here:

  • Do nothing, and simply feed raw features to tree-based models for AI vs. human text discrimination, looking at which features are most predictive post hoc;
  • Apply dimensionality reduction a priori, e.g., through PCA, then visualize/compare (as a function of prompts, or as a function of whether text is produced by humans or models). This reduces dimensionality, but potentially also reduces interpretability;
  • Compare features one-by-one with statistical tests, e.g., across prompt types, or across human vs. models;

On top of this, additional output-driven dimensionality reduction (e.g., LASSO) could be applied to select for the feature set that is a) most affected by prompting; or b) most discriminative between human- and model-generated text

release dataset as first milestone?

we could consider releasing the datasets (& dataset generation/expansion code) as first milestone for the project -- independently of the experiment and predictive modeling part of the project, this could be a valuable resource and a nice publication (LREC?) per se.

Dropping raw features with many zero values prior to PCA?

Problem

Should we remove columns with a large proportion of zero values prior to PCA?
Only the NA cols make PCA (scikit-learn) throw an error. Does it matter when doing PCA that a lot of values are zeroes in our case? Google is unclear on this.

Initially in #63, I had removed any column that has 90% or more zero values:

na_and_zero_cols = ['first_order_coherence', 'second_order_coherence', 'smog', 'pos_prop_SPACE', 'contains_lorem ipsum', 'duplicate_line_chr_fraction', 'duplicate_ngram_chr_fraction_10', 'duplicate_ngram_chr_fraction_7', 'duplicate_ngram_chr_fraction_8', 'duplicate_ngram_chr_fraction_9', 'duplicate_paragraph_chr_fraction', 'pos_prop_SYM', 'pos_prop_X', 'proportion_bullet_points', 'proportion_ellipsis', 'symbol_to_word_ratio_#']

However in #64, I went back to only removing cols that have NA:

na_cols = ['first_order_coherence', 'second_order_coherence', 'smog', 'pos_prop_SPACE']

Without having inspected the results thoroughly, I doubt it changes the actual classification results significantly. But it could make the interpretation of the individual PC components more difficult?

Code here for the curious

Code chunk from identify_NA_metrics.py

def identify_NA_metrics(df, percent_zero:float=None):
    '''
    Identify rows with NA metrics (and alternatively also high percentage of 0s)

    Args:
        df: dataframe to check
        percent_zero: threshold for percentage of 0s in a column to be considered for removal. Default is None (keep cols with many 0s) 
    '''
    # all na_cols 
    na_cols = df.columns[df.isna().any()].tolist()

    # check for NA values 
    if percent_zero is not None:
        if percent_zero < 0 or percent_zero > 1: # check if percent_zero is either 0, 1 or in between 
            raise ValueError("percent_zero must be either 0 or between 0 and 1")

        zero_cols = [col for col in df.columns if df[col].eq(0).sum() / len(df) >= percent_zero]
    else:
        zero_cols = []

    return na_cols + zero_cols

Small notes:

  1. I identify NA/zero metrics by loading in metrics from all datasets at once, so that the columns removed are the same across datasets. That is, if stories have a NA value in first_order_coherence but the other datasets do not, it will still be removed for all other datasets.
  2. I also remove pos_prop_PUNCT manually as we discussed that features related to SPACE and PUNCTUATION should be removed as we have manipulated those columns in our cleaning.

Refactoring: Weird steps in the pipeline (cleaning at various steps that could be streamlined)

The current pipeline is displayed in the image below.

Some steps that may need to be reconsidered

  • When extracting metrics (step 3) for both human and ai text, AI is lowercased / cleaned here first, but it could be done in a seperate step and saved / stored. The reason I haven't done this is that the repo will end up a little big.
  • When using the metrics for classification (step 4B), I only then remove the few faulty generations that are below minimum length. It should ideally be removed prior to this steps 4A and 4B to avoid any mistakes (accidentally including them in other analysis work).
Screenshot 2024-08-06 at 11 02 02

preprocess/normalize text across all datasets

e.g., make sure punctuation is used sensibly, and that there are no weird prefixes or other features that may cause artefacts when comparing model- & human-generated text. Done for dailydialog already, would be great to do for other datasets too.

principled experimentation with prompting

  • Create a few different prompt options (2 to 5) for each task
  • Qualitatively inspect how models behave with each of these
  • Run generation at scale with all prompt options
  • Run text through text descriptives to look at overall distributional differences (how to do the comparison is an open question, see #13)

DailyDialog: Regenerate dataset with correct lengths + extract new metrics for it

Regenerating DailyDialog

Dailydialog was created with incorrect min and max lengths and will therefore be regenerated as discussed in the March meeting (#51.) As it has not been urgent, focus has been on building the metrics extraction pipeline.

Dailydialog will be regenerated at some point before the classifiers have been finalised. NB. Remember to also extract metrics for the dataset again when it has been regenerated for all language models.

flag weird model-generated text

Related to #51, this is just to keep in mind that, at the moment, we are okay keeping examples where models don't fully follow instructions, or start producing gibberish. But later on, we might consider flagging (e.g., using few-shot learning with SetFit) weird examples -- both to quantify how many there are, and compare how the amount of weird stuff changes across datasets, models and decoding parameters.

data cleanup before fitting models

there's some dataset-specific stuff like " < newline > " annotations in WritingPrompts which we may want to standardize and remove before fitting predictive models at scale (this should not affect median distances between human and LLM completions used for prompt selection, but we may also later want to recompute these medians to provide "cleaner" absolute values in the paper)

Computing Perplexity outside of TextDescriptives (and Entropy)

Discussed with Yuri today (7/08/24). From the meeting notes in April (23/04/24), I noted down that we were considering computing perplexity using HF's evaluate library, using a baseline model like GPT-2 to serve as an "oracle" for perplexity.

Some notes for future meetings:

Picking a baseline model & general thoughts about interpretation of the metric

The approach entails that the perplexity will change based on the baseline model? For instance using a model that has seen much more data than GPT-2 may produce lower perplexity scores than GPT-2 for the same text.

Therefore the interpretation would not be about whether the text has high or low perplexity in general, but rather whether the models (and humans) have higher or lower perplexity relative to each other.

With that being said, I'm still unsure about the importance of choosing a model (should I just run with GPT-2?).

Plan for Entropy

Planning to just compute entropy by taking the log of perplexity given that they are directly related. Formula here.

Note that the formula in the link above expresses perplexity as $\text{Perplexity(X)}=2^{H(X)}$, but the HF readme explains that the perplexity "is defined as the exponentiated average negative log-likelihood of a sequence, calculated with exponent base e." (see link). Therefore I'm just taking np.log(perplexity) to compute it rather than np.log2(perplexity)).

But which units are these perplexities and entropy scores in?

Landing on prompts

Notes mostly for documentation purposes!

The new prompts have been selected and I'm in the midst of generating data across models and temperatures (currently 1, 1.5 and 2). See below for notes on this process.

New Task Prompts

#2.0 prompts
"dailymail_cnn_21": "summarize this in a few sentences: ",
"mrpc_21": "paraphrase this: ",
"stories_21": "write a story based on this: ", 
"stories_22": "continue the story: ", 
"dailydialog_21": "continue the conversation between A and B by writing a single response to the latest speaker. write only a concise response and nothing else: ", 

Principles behind the task prompts

I have tried to keep prompts as short and general as possible. That is, something that a user would write and not something that requires a lot of engineering, and the generations seem ok (with the some expected issues e.g., models producing "Sure! I will paraphrase this!"). The exception is dailydialog requires more specific instructions.

For the stories dataset, the prompts lead the models be repetitive and to start with "Title: ". The bsc thesis project used the prompt "continue the story" as it seemed to produce it less. As an addition, I have included the prompt "write a story based on this" to see how the models would fare with this. It was inspired from the instructions on the subreddit that the dataset was based on ("If you see a prompt you like, simply write a short story based on it.")

Removing the System Prompt from Llama2

We initially had a custom system prompt like this:

"llama2_chat": "You are an AI, but you do not deviate from the task prompt and you do not small talk. Never begin your response with 'Sure, here is my response: ' or anything of the like. It is important that you finish without getting cut off."

This custom prompt was created to see if we could get around the model producing completions beginning with e.g., "Sure, I'll paraphrase ...", but since it produces such artefacts regardless, I think it is best to remove it as the devs behind llama now recommend removing it per default due to performance issues (see their repository). This would also adhere to our principle of using these models as a regular user would do (HuggingFace also has system prompts as an extra feature not as a default on their llama interactive space)

Moving on from here

If you have any comments (@rbroc), please let me know. My current plan is to finish generating data with varying temperatures with the prompts above, and then we can manually inspect data and only regenerate/redo prompts if something looks very very weird.

UPDATE 14/03

We settled on all prompts with 21 suffix as mentioned in #51:

#2.0 prompts
"dailymail_cnn_21": "summarize this in a few sentences: ",
"mrpc_21": "paraphrase this: ",
"stories_21": "write a story based on this: ", 
"dailydialog_21": "continue the conversation between A and B by writing a single response to the latest speaker. write only a concise response and nothing else: ", 

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.