Hi 🤗 I would like to pre-train a T5 Base model with <a href="https:

I am also interested to do what <a class="user-mention notranslate" data-hovercard-typ

Maybe <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

I ended up using TensorFlow datasets (<a href="https://www.tensorflow.org/datasets/add

Using seqio for T5X Dataset Generation about seqio HOT 7 CLOSED

google commented on April 28, 2024 1

Using seqio for T5X Dataset Generation

from seqio.

Comments (7)

PhilipMay commented on April 28, 2024 1

I am also interested to do what @stefan-it described.
Maybe @lmthang or @clarkkev you could also help us here?

from seqio.

gauravmishra commented on April 28, 2024 1

Sorry for the late response; you don't necessarily need to process your text data source into TFRecrords; instead you can create a SeqIO Task with a TextLineDataSource [1]. Also, you can set your custom vocab in the output_features dict while registering your SeqIO Task.

Hope this helps; happy to answer any follow up questions!

[1] https://github.com/google/seqio/blob/main/seqio/dataset_providers.py#L549

from seqio.

stefan-it commented on April 28, 2024

Maybe @adarob could help here 😅

from seqio.

adarob commented on April 28, 2024

That being said, it's important to note that having only a few file shards can result in overfitting if your job restarts frequently. I'd recommend using SaveCheckpointConfig.save_dataset = True if this is the case.

from seqio.

stefan-it commented on April 28, 2024

Hi @gauravmishra and @adarob,

thanks for your hints! I've just written the following task, but I think something is wrong with the preprocessor (mapping):

import functools
import seqio
TaskRegistry = seqio.TaskRegistry

from t5.data import preprocessors

SPM_VOCAB = "/home/stefan/Repositories/seqio/t5-base-german-wikipedia/spiece.model"

DEFAULT_OUTPUT_FEATURES = {
    "inputs": seqio.Feature(
        vocabulary=seqio.SentencePieceVocabulary(SPM_VOCAB), add_eos=True,
        required=False),
    "targets": seqio.Feature(
        vocabulary=seqio.SentencePieceVocabulary(SPM_VOCAB), add_eos=True)
}

TaskRegistry.add(
    "secret_corpus",
    source=seqio.TextLineDataSource({"train": "/home/stefan/Repositories/seqio/train.txt",
                                     "validation": "/home/stefan/Repositories/seqio/validation.txt"}),
    preprocessors=[
        functools.partial(
            preprocessors.rekey, key_map={
                "inputs": None,
                "targets": None
            }),
        seqio.preprocessors.tokenize,
        seqio.CacheDatasetPlaceholder(),
        preprocessors.span_corruption,
        seqio.preprocessors.append_eos_after_trim,

    ],
    output_features=DEFAULT_OUTPUT_FEATURES,
    metric_fns=[])

The train and validation file is a normal text file with one sentence per line.

But after using:

dataset = seqio.get_mixture_or_task("secret_corpus").get_dataset(
    sequence_length={"inputs": 512, "targets": 512},
    split="validation",
    shuffle=True,
    num_epochs=1,
    #shard_info=seqio.ShardInfo(index=0, num_shards=10),
    use_cached=False,
    seed=42
)

no examples are returned:

for _, ex in zip(range(5), dataset.as_numpy_iterator()):
  print(ex)

Would be awesome if you have another hint 🤗

from seqio.

gauravmishra commented on April 28, 2024

Hi Stefan, I think the issue is with your rekey preprocessor. The key_map maps new_keys to old_keys, and in your case, both inputs and targets are being mapped to None, so there's no data left after this step. You should either point inputs and targets to an existing key from your data source, or remove the re-key preprocessor. Let me know how it goes!

from seqio.

stefan-it commented on April 28, 2024

I ended up using TensorFlow datasets (https://www.tensorflow.org/datasets/add_dataset) and writing a custom recipe for it is very easy. I uploaded it to my GCP bucket and was able to train models with T5 library, so I'm happy now and closing this issue 🤗

from seqio.

Using seqio for T5X Dataset Generation about seqio HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent