Git Product home page Git Product logo

Comments (7)

PhilipMay avatar PhilipMay commented on April 28, 2024 1

I am also interested to do what @stefan-it described.
Maybe @lmthang or @clarkkev you could also help us here?

from seqio.

gauravmishra avatar gauravmishra commented on April 28, 2024 1

Sorry for the late response; you don't necessarily need to process your text data source into TFRecrords; instead you can create a SeqIO Task with a TextLineDataSource [1]. Also, you can set your custom vocab in the output_features dict while registering your SeqIO Task.

Hope this helps; happy to answer any follow up questions!

[1] https://github.com/google/seqio/blob/main/seqio/dataset_providers.py#L549

from seqio.

stefan-it avatar stefan-it commented on April 28, 2024

Maybe @adarob could help here 😅

from seqio.

adarob avatar adarob commented on April 28, 2024

That being said, it's important to note that having only a few file shards can result in overfitting if your job restarts frequently. I'd recommend using SaveCheckpointConfig.save_dataset = True if this is the case.

from seqio.

stefan-it avatar stefan-it commented on April 28, 2024

Hi @gauravmishra and @adarob,

thanks for your hints! I've just written the following task, but I think something is wrong with the preprocessor (mapping):

import functools
import seqio
TaskRegistry = seqio.TaskRegistry

from t5.data import preprocessors

SPM_VOCAB = "/home/stefan/Repositories/seqio/t5-base-german-wikipedia/spiece.model"

DEFAULT_OUTPUT_FEATURES = {
    "inputs": seqio.Feature(
        vocabulary=seqio.SentencePieceVocabulary(SPM_VOCAB), add_eos=True,
        required=False),
    "targets": seqio.Feature(
        vocabulary=seqio.SentencePieceVocabulary(SPM_VOCAB), add_eos=True)
}

TaskRegistry.add(
    "secret_corpus",
    source=seqio.TextLineDataSource({"train": "/home/stefan/Repositories/seqio/train.txt",
                                     "validation": "/home/stefan/Repositories/seqio/validation.txt"}),
    preprocessors=[
        functools.partial(
            preprocessors.rekey, key_map={
                "inputs": None,
                "targets": None
            }),
        seqio.preprocessors.tokenize,
        seqio.CacheDatasetPlaceholder(),
        preprocessors.span_corruption,
        seqio.preprocessors.append_eos_after_trim,

    ],
    output_features=DEFAULT_OUTPUT_FEATURES,
    metric_fns=[])

The train and validation file is a normal text file with one sentence per line.

But after using:

dataset = seqio.get_mixture_or_task("secret_corpus").get_dataset(
    sequence_length={"inputs": 512, "targets": 512},
    split="validation",
    shuffle=True,
    num_epochs=1,
    #shard_info=seqio.ShardInfo(index=0, num_shards=10),
    use_cached=False,
    seed=42
)

no examples are returned:

for _, ex in zip(range(5), dataset.as_numpy_iterator()):
  print(ex)

Would be awesome if you have another hint 🤗

from seqio.

gauravmishra avatar gauravmishra commented on April 28, 2024

Hi Stefan, I think the issue is with your rekey preprocessor. The key_map maps new_keys to old_keys, and in your case, both inputs and targets are being mapped to None, so there's no data left after this step. You should either point inputs and targets to an existing key from your data source, or remove the re-key preprocessor. Let me know how it goes!

from seqio.

stefan-it avatar stefan-it commented on April 28, 2024

I ended up using TensorFlow datasets (https://www.tensorflow.org/datasets/add_dataset) and writing a custom recipe for it is very easy. I uploaded it to my GCP bucket and was able to train models with T5 library, so I'm happy now and closing this issue 🤗

from seqio.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.