pygmalionai / data-toolbox Goto Github PK

View Code? Open in Web Editor NEW

29.0 29.0 9.0 886 KB

Our data munging code.

License: GNU Affero General Public License v3.0

Python 100.00%

data-toolbox's People

Stargazers

Watchers

Forkers

lloorree silver-f0x scootcho ausboss dosier silverriver alignment-lab-ai berraermenek schwabischesbauernbrot

data-toolbox's Issues

Generate synthetic negative data

To test the implementation of the CRINGE loss in our training code, we need some examples of what the model should not generate.

I have some filters in the data-toolbox that drop training examples based on certain criteria (e.g.: messages are too similar to each other indicating looping, or messages are too short on average). If we add a flag to generate using only these dropped examples, we can build a training set of negative examples that we can use to test.

multilingual

Hi guys.

If I translate the datasets, will they work with pygmalion? I want to translate the datasets into portuguese.

Investigate Chain of Hindsight for fine-tuning

Paper, code, summary in the form of a Twitter thread. Claims to beat supervised fine-tuning (what we're currently doing) and RLHF (what we're not doing due to data and compute constraints at the moment).

If we're to faithfully follow the paper, we'll need multiple generations for a given prompt. RankGen + the existing models can help us generate synthetic data for this.

Implement data handling for RP forum dumps

Summary

Scope of this task is to implement support for Enjin forum dumps in the data-toolbox.

Source file formats

Source files are SQLite3 databases generated by the encuum utility. Please reach out to me in private (via email, Matrix, Discord, etc.) for example files if you're interested in tackling this implementation.

Implementation details

An EnjinDataset class should be implemented under toolbox/datasets/enjim.py, following the general format of the other datasets. Threads should map to Episodes, and posts within threads should map to Turns within the Episode.

An EnjinVDM should then be implemented under toolbox/modules/enjim_pdm.py. Feel free to look at the light_pdm.py file in that same folder for an example.

A lot of data processing will then need to take place. Off the top of my head:

BBcode will need to be converted to its nearest Markdown representation, or dropped entirely if it's too excessive (e.g. different font colors, images)
Irrelevant threads and posts need to be pruned (e.g. non-roleplay, announcements and so on)
Overly short posts will need to be carefully pruned (usually OOC talk)

...and maybe more. This will probably be the trickiest part. Feel free to reach out, we can discuss these points here or in the Matrix.

Investigate and possibly include some ParlAI data

Theoretically, all the data used to train Meta's BlenderBot 3 is available in the ParlAI library (repository). In practice, a significant amount of "teachers" and "tasks" there are broken, so the configurations they've released which are supposed to replicate the BB3 training data doesn't actually work.

Still, a significant amount of teachers can be made to work with small changes, and others work out of the box. We already have plenty of open-ended conversational data, so I'd like to see if we can find some good data for doing external knowledge grounding (so we can use it for long-term memories/internet search/world info/etc.).

Always yield the first messge of each thread as an independent thread

https://github.com/PygmalionAI/data-toolbox/blob/21687af22a849fca5fd2fa2d890f0911249fe4d4/toolbox/datasets/rp_forums.py#L61C49-L61C49

if the thread has the following message:
m1
m2
m3
m4

The current code with yield two threads:
m1
and
m2, m3, m4

Improve training data

I haven't been able to make any significant improvements on the models by twiddling around with hyperparameters and training objectives ever since around experiment 2, so I'm going to shift into focusing on improving the training data instead.

Some relevant points to consider:

I'd like to improve handling of example dialogue.
- As of now this is being done in a really stupid way: during training the data is mostly discarded, and during inference time it's handled as regular chat history. Obviously not ideal.
I'd like the model to stick closer to the example dialogues, even if the user responds in a completely different format.
- As of now, if you give a character a short greeting, it'll get stuck responding with short messages.
- If you add some example dialogue where the character is very descriptive and detailed, it'll stay that way for the first few messages and then degrade to short responses again as the example dialogue is pushed out of the chat history.
- Ideally, I'd like characters to follow the format in their example dialogue more closely even after a lot of conversation.
It might be worth looking into new data, even if it's non-conversational.
- I'm thinking this might help the model generate more creative and interesting responses, given that dialogue datasets are usually boring as hell (hey. how have you been? good. thanks. ok nice talking to you)
It would be nice to add some special tokens to be able to inject external knowledge that the model could use to ground its responses on.
- That way, Kobold users can make use of Author's Notes and World Info, and we could make use of internet search/long-term memory stores/whatever else on the official service.
- This might imply looking up new datasets to add (BlenderBot 3 paper might be useful here thanks to the retrieval and grounding modules + their relevant datasets) or generating synthetic data, so this is more of a long/medium-term goal instead.

Implement VN + VNDB data handling

Summary

Scope of this task is to implement support for Visual Novel data in the data-toolbox, augmenting it with external information sourced from VNDB.

Source file formats

Each VN will be comprised of one or two files. Assuming {title} as the VN's title, there will be a mandatory {title}.txt file which contains the actual script text. Here's a made-up example:

{name of character}: {text said by character}
{name of character}: {text said by character}
{name of character}: {text said by character}
some narration text
================================================================================
{name of character}: {text said by character}
{name of character}: {text said by character}
narration text
{name of character}: {text said by character}

The sequence of === characters separate episodes from each other.

The VN might optionally also have a {title}.chars.json file, where each key is the name of a character seen in the .txt file, and their VNDB character ID. An example:

{
 "name of character": "c67681",
 "name of character": "c52103",
 "name of character": "c11620"
}

Implementation details

A VisualNovelDataset class should be implemented under toolbox/datasets/visual_novels.py, following the general format of the other datasets. It should yield individual episodes (a.k.a. sequences of dialog that have been separated by the === lines), accompanied by the relevant characters if a matching .chars.json is found. Feel free to structure this how you feel is best, but I recommend basing the implementation off any of the other datasets in that folder.

A VisualNovelPDM should then be implemented under toolbox/modules/visual_novel_pdm.py. Again, basing off of an existing PDM is likely a good call - I'd suggest looking at LightPDM. The catch here is that the generated Episodes should contain persona data whenever possible. The way this should be done is by using the VNDB character IDs specified in the matching .chars.json file to look up character information in the VNDB databases. These are made available for download here.

What specific character data to include is still undecided, we can discuss this here or in the Matrix.

Publish/Share the share_gpt.json file

Hi, based on your toolbox it seems pygmallion has the sharegpt dataset. Since the get operation for this is now closed can you share the data you gathered for other model generation research purposes? Thanks!

pygmalionai / data-toolbox Goto Github PK

data-toolbox's People

Stargazers

Watchers

Forkers

data-toolbox's Issues

Summary

Source file formats

Implementation details

Summary

Source file formats

Implementation details

Recommend Projects

Recommend Topics

Recommend Org