pygmalionai / data-toolbox Goto Github PK
View Code? Open in Web Editor NEWOur data munging code.
License: GNU Affero General Public License v3.0
Our data munging code.
License: GNU Affero General Public License v3.0
To test the implementation of the CRINGE loss in our training code, we need some examples of what the model should not generate.
I have some filters in the data-toolbox
that drop training examples based on certain criteria (e.g.: messages are too similar to each other indicating looping, or messages are too short on average). If we add a flag to generate using only these dropped examples, we can build a training set of negative examples that we can use to test.
Hi guys.
If I translate the datasets, will they work with pygmalion? I want to translate the datasets into portuguese.
Paper, code, summary in the form of a Twitter thread. Claims to beat supervised fine-tuning (what we're currently doing) and RLHF (what we're not doing due to data and compute constraints at the moment).
If we're to faithfully follow the paper, we'll need multiple generations for a given prompt. RankGen + the existing models can help us generate synthetic data for this.
Scope of this task is to implement support for Enjin forum dumps in the data-toolbox
.
Source files are SQLite3 databases generated by the encuum utility. Please reach out to me in private (via email, Matrix, Discord, etc.) for example files if you're interested in tackling this implementation.
An EnjinDataset
class should be implemented under toolbox/datasets/enjim.py
, following the general format of the other datasets. Threads should map to Episode
s, and posts within threads should map to Turn
s within the Episode
.
An EnjinVDM
should then be implemented under toolbox/modules/enjim_pdm.py
. Feel free to look at the light_pdm.py
file in that same folder for an example.
A lot of data processing will then need to take place. Off the top of my head:
...and maybe more. This will probably be the trickiest part. Feel free to reach out, we can discuss these points here or in the Matrix.
Theoretically, all the data used to train Meta's BlenderBot 3 is available in the ParlAI library (repository). In practice, a significant amount of "teachers" and "tasks" there are broken, so the configurations they've released which are supposed to replicate the BB3 training data doesn't actually work.
Still, a significant amount of teachers can be made to work with small changes, and others work out of the box. We already have plenty of open-ended conversational data, so I'd like to see if we can find some good data for doing external knowledge grounding (so we can use it for long-term memories/internet search/world info/etc.).
if the thread has the following message:
m1
m2
m3
m4
The current code with yield two threads:
m1
and
m2, m3, m4
I haven't been able to make any significant improvements on the models by twiddling around with hyperparameters and training objectives ever since around experiment 2, so I'm going to shift into focusing on improving the training data instead.
Some relevant points to consider:
Scope of this task is to implement support for Visual Novel data in the data-toolbox
, augmenting it with external information sourced from VNDB.
Each VN will be comprised of one or two files. Assuming {title}
as the VN's title, there will be a mandatory {title}.txt
file which contains the actual script text. Here's a made-up example:
{name of character}: {text said by character}
{name of character}: {text said by character}
{name of character}: {text said by character}
some narration text
================================================================================
{name of character}: {text said by character}
{name of character}: {text said by character}
narration text
{name of character}: {text said by character}
The sequence of ===
characters separate episodes from each other.
The VN might optionally also have a {title}.chars.json
file, where each key is the name of a character seen in the .txt
file, and their VNDB character ID. An example:
{
"name of character": "c67681",
"name of character": "c52103",
"name of character": "c11620"
}
A VisualNovelDataset
class should be implemented under toolbox/datasets/visual_novels.py
, following the general format of the other datasets. It should yield individual episodes (a.k.a. sequences of dialog that have been separated by the ===
lines), accompanied by the relevant characters if a matching .chars.json
is found. Feel free to structure this how you feel is best, but I recommend basing the implementation off any of the other datasets in that folder.
A VisualNovelPDM
should then be implemented under toolbox/modules/visual_novel_pdm.py
. Again, basing off of an existing PDM is likely a good call - I'd suggest looking at LightPDM
. The catch here is that the generated Episode
s should contain persona data whenever possible. The way this should be done is by using the VNDB character IDs specified in the matching .chars.json
file to look up character information in the VNDB databases. These are made available for download here.
What specific character data to include is still undecided, we can discuss this here or in the Matrix.
Hi, based on your toolbox it seems pygmallion has the sharegpt dataset. Since the get operation for this is now closed can you share the data you gathered for other model generation research purposes? Thanks!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.