Git Product home page Git Product logo

data-toolbox's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

data-toolbox's Issues

Generate synthetic negative data

To test the implementation of the CRINGE loss in our training code, we need some examples of what the model should not generate.

I have some filters in the data-toolbox that drop training examples based on certain criteria (e.g.: messages are too similar to each other indicating looping, or messages are too short on average). If we add a flag to generate using only these dropped examples, we can build a training set of negative examples that we can use to test.

multilingual

Hi guys.

If I translate the datasets, will they work with pygmalion? I want to translate the datasets into portuguese.

Implement data handling for RP forum dumps

Summary

Scope of this task is to implement support for Enjin forum dumps in the data-toolbox.

Source file formats

Source files are SQLite3 databases generated by the encuum utility. Please reach out to me in private (via email, Matrix, Discord, etc.) for example files if you're interested in tackling this implementation.

Implementation details

An EnjinDataset class should be implemented under toolbox/datasets/enjim.py, following the general format of the other datasets. Threads should map to Episodes, and posts within threads should map to Turns within the Episode.

An EnjinVDM should then be implemented under toolbox/modules/enjim_pdm.py. Feel free to look at the light_pdm.py file in that same folder for an example.

A lot of data processing will then need to take place. Off the top of my head:

  • BBcode will need to be converted to its nearest Markdown representation, or dropped entirely if it's too excessive (e.g. different font colors, images)
  • Irrelevant threads and posts need to be pruned (e.g. non-roleplay, announcements and so on)
  • Overly short posts will need to be carefully pruned (usually OOC talk)

...and maybe more. This will probably be the trickiest part. Feel free to reach out, we can discuss these points here or in the Matrix.

Investigate and possibly include some ParlAI data

Theoretically, all the data used to train Meta's BlenderBot 3 is available in the ParlAI library (repository). In practice, a significant amount of "teachers" and "tasks" there are broken, so the configurations they've released which are supposed to replicate the BB3 training data doesn't actually work.

Still, a significant amount of teachers can be made to work with small changes, and others work out of the box. We already have plenty of open-ended conversational data, so I'd like to see if we can find some good data for doing external knowledge grounding (so we can use it for long-term memories/internet search/world info/etc.).

Improve training data

I haven't been able to make any significant improvements on the models by twiddling around with hyperparameters and training objectives ever since around experiment 2, so I'm going to shift into focusing on improving the training data instead.

Some relevant points to consider:

  • I'd like to improve handling of example dialogue.
    • As of now this is being done in a really stupid way: during training the data is mostly discarded, and during inference time it's handled as regular chat history. Obviously not ideal.
  • I'd like the model to stick closer to the example dialogues, even if the user responds in a completely different format.
    • As of now, if you give a character a short greeting, it'll get stuck responding with short messages.
    • If you add some example dialogue where the character is very descriptive and detailed, it'll stay that way for the first few messages and then degrade to short responses again as the example dialogue is pushed out of the chat history.
    • Ideally, I'd like characters to follow the format in their example dialogue more closely even after a lot of conversation.
  • It might be worth looking into new data, even if it's non-conversational.
    • I'm thinking this might help the model generate more creative and interesting responses, given that dialogue datasets are usually boring as hell (hey. how have you been? good. thanks. ok nice talking to you)
  • It would be nice to add some special tokens to be able to inject external knowledge that the model could use to ground its responses on.
    • That way, Kobold users can make use of Author's Notes and World Info, and we could make use of internet search/long-term memory stores/whatever else on the official service.
    • This might imply looking up new datasets to add (BlenderBot 3 paper might be useful here thanks to the retrieval and grounding modules + their relevant datasets) or generating synthetic data, so this is more of a long/medium-term goal instead.

Implement VN + VNDB data handling

Summary

Scope of this task is to implement support for Visual Novel data in the data-toolbox, augmenting it with external information sourced from VNDB.

Source file formats

Each VN will be comprised of one or two files. Assuming {title} as the VN's title, there will be a mandatory {title}.txt file which contains the actual script text. Here's a made-up example:

{name of character}: {text said by character}
{name of character}: {text said by character}
{name of character}: {text said by character}
some narration text
================================================================================
{name of character}: {text said by character}
{name of character}: {text said by character}
narration text
{name of character}: {text said by character}

The sequence of === characters separate episodes from each other.

The VN might optionally also have a {title}.chars.json file, where each key is the name of a character seen in the .txt file, and their VNDB character ID. An example:

{
 "name of character": "c67681",
 "name of character": "c52103",
 "name of character": "c11620"
}

Implementation details

A VisualNovelDataset class should be implemented under toolbox/datasets/visual_novels.py, following the general format of the other datasets. It should yield individual episodes (a.k.a. sequences of dialog that have been separated by the === lines), accompanied by the relevant characters if a matching .chars.json is found. Feel free to structure this how you feel is best, but I recommend basing the implementation off any of the other datasets in that folder.

A VisualNovelPDM should then be implemented under toolbox/modules/visual_novel_pdm.py. Again, basing off of an existing PDM is likely a good call - I'd suggest looking at LightPDM. The catch here is that the generated Episodes should contain persona data whenever possible. The way this should be done is by using the VNDB character IDs specified in the matching .chars.json file to look up character information in the VNDB databases. These are made available for download here.

What specific character data to include is still undecided, we can discuss this here or in the Matrix.

Publish/Share the share_gpt.json file

Hi, based on your toolbox it seems pygmallion has the sharegpt dataset. Since the get operation for this is now closed can you share the data you gathered for other model generation research purposes? Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.