Hi, this is a great project. Can you provide some sample data for local development te

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

sample data for local development testing about lamda-rlhf-pytorch HOT 4 OPEN

conceptofmind commented on May 25, 2024

sample data for local development testing

from lamda-rlhf-pytorch.

Comments (4)

conceptofmind commented on May 25, 2024 2

Hi @htthYjh ,

The repository initially consisted of just the pre-training architecture but I am actively updating it on a daily basis. The full repository when completed will allow for scalable and distributed pre-training.

I am working on reproducing the OPT and GODEL pre-training corpora. I will be uploading both of them to Huggingface datasets. Currently, there is a Huggingface streaming data loader implemented which allows you to use the Pile dataset by EleutherAI for pre-training. I will be updating the repository to include a local environment data loader to go along with the streaming one.

Best,

Enrico

from lamda-rlhf-pytorch.

htthYjh commented on May 25, 2024

Thank you so much, great work, looking forward to your progress.

from lamda-rlhf-pytorch.

conceptofmind commented on May 25, 2024

Hi @htthYjh ,

I rebuilt the data loader to work locally: https://github.com/conceptofmind/LaMDA-pytorch/blob/main/lamda_pytorch/build_dataloader.py

A few things you are going to have to take into consideration if you are going to use the provided Pile dataset:

The Pile dataset is over 1TB of data. You will likely need a storage device with up to 2TB of space for everything including the tokenizer and saved models.
If you want to use a different dataset, I provided a configuration file with different fields that can be adjusted. You can find a bunch of different text generation datasets on the Huggingface Datasets hub. I will still be uploading the GODEL conversational dataset as well.

The configuration for the data loader looks like this:

"""
Configuration for data loader.
"""

use_huggingface: bool = field(
    default = True,
    metadata = {'help': 'Whether to use huggingface datasets'}
)

train_dataset_name: Optional[str] = field(
    default="the_pile", 
    metadata={"help": "Path to Hugging Face training dataset."}
)

eval_dataset_name: Optional[str] = field(
    default="the_pile", 
    metadata={"help": "Path to Hugging Face validation dataset."}
)

choose_train_split: Optional[str] = field(
    default="train", 
    metadata={"help": "Choose Hugging Face training dataset split."}
)

choose_eval_split: Optional[str] = field(
    default="train", 
    metadata={"help": "Choose Hugging Face validation dataset split."}
)

remove_train_columns: ClassVar[list[str]] = field(
    default = ['meta'], 
    metadata={"help": "Train dataset columns to remove."}
)

remove_eval_columns: ClassVar[list[str]] = field(
    default = ['meta'], 
    metadata={"help": "Validation dataset columns to remove."}
)

seed: Optional[int] = field(
    default=42, 
    metadata={"help": "Random seed used for reproducibility."}
)

tokenizer_name: Optional[str] = field(
    default="gpt2",
    metadata={"help": "Tokenizer name."}
)

tokenizer_seq_length: Optional[int] = field(
    default=512, 
    metadata={"help": "Sequence lengths used for tokenizing examples."}
)

select_input_string: Optional[str] = field(
    default="text", 
    metadata={"help": "Select the key to used as the input string column."}
)

batch_size: Optional[int] = field(
    default=16, 
    metadata={"help": "Batch size for training and validation."}
)

save_to_path: Optional[str] = field(
    default="''", 
    metadata={"help": "Save the dataset to local disk."}
)

Let me know if this solves your issue.

Best,

Enrico

from lamda-rlhf-pytorch.

htthYjh commented on May 25, 2024

Ok,let me check,thank you very much

from lamda-rlhf-pytorch.

sample data for local development testing about lamda-rlhf-pytorch HOT 4 OPEN

Comments (4)

Related Issues (6)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent