google / seqio Goto Github PK

View Code? Open in Web Editor NEW

529.0 15.0 57.0 2.26 MB

Task-based datasets, preprocessing, and evaluation for sequence models.

License: Apache License 2.0

Python 95.35% Jupyter Notebook 4.65%

seqio's Issues

Please include installation instructions

Please include installation instructions on the README file

Support for "Deterministic Pipelines"

Hi, is the support for "Deterministic Pipelines"---as described in https://arxiv.org/abs/2203.17189 section 3.2---now available through the open-source seqio?

Thanks!

Add method to directly add tasks/mixtures.

Currently MixtureRegistry and TaskRegistry have an add method that takes the arguments to construct a Mixture / Task. This does not seem to play well if one wants to subclass Mixture or Task with a class that takes different arguments and add them to the registries. Concrete example of a subclass of Task: turning a Mixture back into a task so that it looks "atomic" when one tries to add it back into a Mixture. Would it be possible to have a method to add to the Registrie(s) directly an object, without having to pass
the arguments that the constructor uses?

`tokenize_and_append_eos` needs another requrired input (`output_features`)

tokenize_and_append_eos needs another requrired input (output_features) how can I use this function as preprocessor, how to pass output features?

It's the way I tired to use it

preprocessors=[
          functools.partial(
              t5.data.preprocessors.parse_tsv,
              field_names=["input_text", "target_text"]),
          seqio.preprocessors.tokenize_and_append_eos,
    ],

Possible ByteVocabulary Bug

464   def _decode_tf(self, ids):
465     """Decode in TensorFlow.
466 
467     Args:
468       ids: a 1d tf.Tensor with dtype tf.int32
469     Returns:
470       a tf Scalar with dtype tf.string
471     """
472     return tf.py_function(func=self.decode, inp=[ids], Tout=tf.string)

The param 'ids' passed to _decode_tf above is a 1d tf.Tensor, and on line 472 it is wrapped into a list, but when self.decode is called with the param [ids], it throws the error in the list comprehension on line 100 (shown below):

File "/home/ubuntu/anaconda3/envs/google_t5/lib/python3.7/site-packages/seqio/vocabularies.py", line 100, in
decode
for i in clean_ids

File "/home/ubuntu/anaconda3/envs/google_t5/lib/python3.7/site-packages/seqio/vocabularies.py", line 100, in

for i in clean_ids

File "/home/ubuntu/anaconda3/envs/google_t5/lib/python3.7/site-packages/tensorflow/python/framework/ops.py",
line 1007, in bool
return bool(self._numpy())

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

 92   def decode(self, ids: Iterable[int]):
 93     """Detokenizes int32 iterable to a string, up through first EOS."""
 94     clean_ids = list(ids)
 95 
 96     if self.unk_id is not None:
 97       vocab_size = self._base_vocab_size
 98       clean_ids = [
 99           self.unk_id if i >= vocab_size else i
100           for i in clean_ids
101       ]
102 
103     if self.eos_id is not None and self.eos_id in clean_ids:
104       clean_ids = clean_ids[:clean_ids.index(self.eos_id) + 1]
105 
106     return self._decode(clean_ids)

HuggingFace Tokenizers compatibility

Hi, I have been trying to get SeqIO to work with HuggingFace's tokenizers for a bit but have been running into trouble with non-t5 based tokenizers. Specifically, it seems that, because they are not sentencepiece tokenizers, tokenizers for models such as GPT-2 are incompatible with SeqIO's SentencePieceVocabulary as they only have the vocab files:

{
  'vocab_file': 'vocab.json',
  'merges_file': 'merges.txt',
  'tokenizer_file': 'tokenizer.json'
}

Is there a currently supported way to use these tokenizers with SeqIO? Or would I need to make my own vocab class?

seqio 0.0.13 cannot be installed on Apple Silicon due to transitive tensorflow dependency of clu

db4d4b0 added clu as a dependency of seqio.

With this change, we can no longer install seqio on Apple Silicon machines (e.g. M1, M2). This is because clu requires tensorflow (https://github.com/google/CommonLoopUtils/blob/85f9d28556f2684e2c5f2e412cbef5119d6682ba/setup.py#L54) but on Apple Silicon tensorflow should be installed as tensorflow-macos based on the instructions at https://developer.apple.com/metal/tensorflow-plugin/.

A simple fix is to update the clu tensorflow line in the setup.py to tensorflow; platform_machine == 'x86_64'. However, that project doesn't accept GitHub issues or contributions so I am creating an issue here.

import seqio

Hello why I am getting this warning for just importing seqio

import seqio
2022-07-29 10:38:38.223245: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-07-29 10:38:38.223295: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.

ValueError: mutable default <class 'seqio.vocabularies.PassThroughVocabulary'> for field vocabulary is not allowed: use default_factory

Traceback (most recent call last):
File "/scenic/scenic/projects/vid2seq/vid2seq_test.py", line 13, in
from scenic.projects.vid2seq import trainer
File "/anaconda3/envs/vid2seq/lib/python3.11/site-packages/scenic/projects/vid2seq/trainer.py", line 26, in
from scenic.projects.t5 import model as t5_model
File "/anaconda3/envs/vid2seq/lib/python3.11/site-packages/scenic/projects/t5/model.py", line 29, in
from scenic.projects.t5 import layers
File "/anaconda3/envs/vid2seq/lib/python3.11/site-packages/scenic/projects/t5/layers.py", line 9, in
from t5x import decoding
File "/anaconda3/envs/vid2seq/lib/python3.11/site-packages/t5x/init.py", line 17, in
import t5x.adafactor
File "/anaconda3/envs/vid2seq/lib/python3.11/site-packages/t5x/adafactor.py", line 64, in
from t5x import utils
File "/anaconda3/envs/vid2seq/lib/python3.11/site-packages/t5x/utils.py", line 44, in
import seqio
File "/anaconda3/envs/vid2seq/lib/python3.11/site-packages/seqio/init.py", line 18, in
from seqio.dataset_providers import *
File "/anaconda3/envs/vid2seq/lib/python3.11/site-packages/seqio/dataset_providers.py", line 60, in
@dataclasses.dataclass(frozen=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/anaconda3/envs/vid2seq/lib/python3.11/dataclasses.py", line 1213, in wrap
return _process_class(cls, init, repr, eq, order, unsafe_hash,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/anaconda3/envs/vid2seq/lib/python3.11/dataclasses.py", line 958, in _process_class
cls_fields.append(_get_field(cls, name, type, kw_only))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/anaconda3/envs/vid2seq/lib/python3.11/dataclasses.py", line 815, in _get_field
raise ValueError(f'mutable default {type(f.default)} for field '
ValueError: mutable default <class 'seqio.vocabularies.PassThroughVocabulary'> for field vocabulary is not allowed: use default_factory

seqio.get_mixture_or_task('bool_q_template_0_no_opt_five_shot') failed

when I use seqio.get_mixture_or_task('bool_q_template_0_no_opt_five_shot') this method, it ocurred a error:

I used run_example.py this script and I don't modify other files, who help me resolve it.

seqio_cache_tasks fails on DataflowRunner

When trying to cache a dataset that does not fit DirectRunner (e.g google-research/text-to-text-transfer-transformer#323 (comment)) on Cloud Dataflow without any requirements.txt, like

python -m seqio.scripts.cache_tasks_main \
 --module_import="..." \
 --tasks="${TASK_NAME}" \
 --output_cache_dir="${BUCKET}/cache" \
 --alsologtostderr \
 --pipeline_options="--runner=DataflowRunner,--project=$PROJECT,--region=$REGION,--job_name=$TASK_NAME,--staging_location=$BUCKET/binaries,--temp_location=$BUCKET/tmp,--experiments=shuffle_mode=appliance"

it fails with ModuleNotFoundError: No module named 'seqio'.

If seqio added with

echo seqio > /tmp/beam_requirements.txt

# and run the same, adding to `--pipeline_options`
--requirements_file=/tmp/beam_requirements.txt

it fails with

subprocess.CalledProcessError: Command '['.../.venv/bin/python', '-m', 'pip', 'download', '--dest', '..../pip-tmp
/dataflow-requirements-cache', '-r', '/tmp/beam_requirements.txt', '--exists-action', 'i', '--no-binary', ':all:']' returned non-zero exit status 1.

 Pip install failed for package: -r
 Output from execution of subprocess: b"ERROR: Could not find a version that satisfies the requirement tensorflow-text (from versions: none)\
nERROR: No matching distribution found for tensorflow-text

This seems to be cause by seqio depending on tensorflow-text, which does not have any source release artifacts.

But requirements cache in Apache Beam seem to be populated with --no-binary :all: before making it available to the workers.

A try on a clean venv results in the same:

pip3 install  --no-binary :all: --no-deps tensorflow-text==2.6.0
ERROR: Could not find a version that satisfies the requirement tensorflow-text==2.6.0 (from versions: none)
ERROR: No matching distribution found for tensorflow-text==2.6.0

Am I doing something wrong, or how does everyone work this around? Would appreciate a hand here.

unimax sampling ?

is the new unimax sampling based mixture implementation available ?

from: https://openreview.net/pdf?id=kXwdL1cWOAi

Tokenizer is not behaving as expected on special tokens (doesn't recognize `pad` and `eos` tokens)

Looks like tokens like eos and pad do not get tokenized correctly:

Repro:

In [1]: import seqio

In [2]: vocab = seqio.SentencePieceVocabulary('gs://t5-data/vocabs/cc_all.32000/sentencepiece.model')

In [3]: vocab.tokenizer.id_to_piece(0)
Out[3]: '<pad>'

In [4]: vocab.tokenizer.id_to_piece(1)
Out[4]: '</s>'

In [5]: vocab.encode(vocab.tokenizer.id_to_piece(1))
Out[5]: [3, 2, 87, 7, 3155]

In [6]: vocab.tokenizer.id_to_piece(vocab.tokenizer.encode(vocab.tokenizer.id_to_piece(1)))
Out[6]: ['▁', '<unk>', '/', 's', '>']

It breaks down the special tokens into wordpieces.

unable to train mt5 from t5x using mixtures ValueError: Dataset is missing an expected feature during input_validation validation: 'inputs'

Hey there,

I am currently pretraining mt5 model on 23 different languages. but when i create a mixture and set the mixture name in t5x .gin config file for training on the mixture i get the following error.

ValueError: Dataset is missing an expected feature during input_validation validation: 'inputs'

However when i individually ran the independent tasks by setting them in the gin file everything works fine.

the following is how my task.py file looks like.

TaskRegistry = seqio.TaskRegistry
vocabulary = seqio.SentencePieceVocabulary('gs://t5-data/vocabs/mc4.250000.100extra/sentencepiece.model', extra_ids=0)


DEFAULT_OUTPUT_FEATURES = {
    "inputs": seqio.Feature(
        vocabulary=vocabulary, add_eos=True,
        required=False),
    "targets": seqio.Feature(
        vocabulary=vocabulary, add_eos=True)
}



def gen_dataset(split, shuffle=False, seed=None, column="text", dataset_path=None):
    dataset = load_dataset(dataset_path, streaming=True, use_auth_token=True)
    if shuffle:
        if seed:
            dataset = dataset.shuffle(seed=seed)
        else:
            dataset = dataset.shuffle()
    while True:
        for item in dataset[str(split)]:
            yield item[column]


def dataset_fn(split, shuffle_files, seed=None, dataset_path=None):
    return tf.data.Dataset.from_generator(
        functools.partial(gen_dataset, split, shuffle_files, seed, dataset_path=dataset_path),
        output_signature=tf.TensorSpec(shape=(), dtype=tf.string, name=dataset_path)
    )


@utils.map_over_dataset
def target_to_key(x, key_map, target_key):
    """Assign the value from the dataset to target_key in key_map"""
    return {**key_map, target_key: x}


TaskRegistry.add(
    "urdu_span_curruption",
    source=seqio.FunctionDataSource(
        dataset_fn=functools.partial(dataset_fn, dataset_path='StephennFernandes/ciil_mega_corpus_urdu'),
        splits=("train", "validation"),
        caching_permitted=False),
    preprocessors=[
        functools.partial(
            target_to_key, key_map={
                "inputs": None,
                "targets": None,
            }, target_key="targets"),
        seqio.preprocessors.tokenize,
        # seqio.CacheDatasetPlaceholder(),
        preprocessors.span_corruption, 
        seqio.preprocessors.append_eos_after_trim,
    ],
    output_features={"targets": DEFAULT_OUTPUT_FEATURES["targets"]},
    metric_fns=[]
)
 ### similar multiple languages are loaded here ### 


#seqio mixture 3.5 
seqio.MixtureRegistry.add(
  "ciil_mix_3.5",
  ["assamese_span_curruption", "bengali_span_curruption", 
  "bhisnupuriya_span_curruption", "bodo_span_curruption", 
  "divehi_span_curruption", "dogri_span_curruption", 
  "english_span_curruption", "gujarati_span_curruption",
  "hindi_span_curruption", "kannada_span_curruption", 
  "kashmiri_span_curruption", "konkani_span_curruption", 
  "maithili_span_curruption", "malayalam_span_curruption",
  "manipuri_span_curruption", "marathi_span_curruption",
  "nepali_span_curruption", "odia_span_curruption",
  "panjabi_span_curruption", "sanskrit_span_curruption",
  "tamil_span_curruption", "telugu_span_curruption",
   "urdu_span_curruption" ],
  default_rate=3.5
)

upon running the mt5 model with the mixture name in the .gin file i get the following error:

File "/home/stephen/Desktop/t5x_final_test/t5x/t5x/train.py", line 744, in _main
    train_using_gin()
  File "/home/stephen/anaconda3/lib/python3.9/site-packages/gin/config.py", line 1605, in gin_wrapper
    utils.augment_exception_message_and_reraise(e, err_str)
  File "/home/stephen/anaconda3/lib/python3.9/site-packages/gin/utils.py", line 41, in augment_exception_message_and_reraise
    raise proxy.with_traceback(exception.__traceback__) from None
  File "/home/stephen/anaconda3/lib/python3.9/site-packages/gin/config.py", line 1582, in gin_wrapper
    return fn(*new_args, **new_kwargs)
  File "/home/stephen/Desktop/t5x_final_test/t5x/t5x/train.py", line 249, in train
    train_ds = get_dataset_fn(train_dataset_cfg, ds_shard_id, num_ds_shards,
  File "/home/stephen/Desktop/t5x_final_test/t5x/t5x/utils.py", line 1366, in get_dataset
    return get_dataset_inner(cfg, shard_info, feature_converter_cls, seed,
  File "/home/stephen/Desktop/t5x_final_test/t5x/t5x/utils.py", line 1387, in get_dataset_inner
    ds = seqio.get_dataset(
  File "/home/stephen/anaconda3/lib/python3.9/site-packages/seqio/dataset_providers.py", line 1681, in get_dataset
    ds = feature_converter(
  File "/home/stephen/anaconda3/lib/python3.9/site-packages/seqio/feature_converters.py", line 404, in __call__
    ds = self._validate_dataset(
  File "/home/stephen/anaconda3/lib/python3.9/site-packages/seqio/feature_converters.py", line 294, in _validate_dataset
    raise ValueError("Dataset is missing an expected feature during "
ValueError: Dataset is missing an expected feature during input_validation validation: 'inputs'

Concatenating Tasks?

Is there a way to concatenate multiple Tasks? Mixtures sample from component Tasks until one of them runs out of examples. Is there a variant that uses all of the examples from both Tasks in each epoch?

import seqio: AttributeError: module 'typing' has no attribute 'get_origin'

import seqio
Traceback (most recent call last):
File "", line 1, in
File "/usr/local/lib/python3.7/dist-packages/seqio/init.py", line 18, in
from seqio.dataset_providers import *
File "/usr/local/lib/python3.7/dist-packages/seqio/dataset_providers.py", line 38, in
import pyglove as pg
File "/usr/local/lib/python3.7/dist-packages/pyglove/init.py", line 30, in
from pyglove.core import *
File "/usr/local/lib/python3.7/dist-packages/pyglove/core/init.py", line 56, in
from pyglove.core import symbolic
File "/usr/local/lib/python3.7/dist-packages/pyglove/core/symbolic/init.py", line 93, in
from pyglove.core.symbolic.diff import diff
File "/usr/local/lib/python3.7/dist-packages/pyglove/core/symbolic/diff.py", line 153, in
(pg_typing.StrKey(), pg_typing.Object(Diff), 'Child node.')
File "/usr/local/lib/python3.7/dist-packages/pyglove/core/typing/value_specs.py", line 1279, in init
schema_or_field_list, allow_nonconst_keys=True)
File "/usr/local/lib/python3.7/dist-packages/pyglove/core/typing/class_schema.py", line 1179, in create_schema
value = ValueSpec.from_annotation(maybe_value_spec, True)
File "/usr/local/lib/python3.7/dist-packages/pyglove/core/typing/value_specs.py", line 2131, in _from_annotation
origin = typing.get_origin(annotation)
AttributeError: module 'typing' has no attribute 'get_origin'

FunctionDataSource does not allow function with 3 positional arguments thus shuffling does not work

During creation it checks if function has only 2 positional arguments. For shuffling to be used it should also accept a third argument, seed or seeds. Otherwise an exception is thrown when trying to pass shuffle=True to get_dataset().

seqio/seqio/dataset_providers.py

Line 341 in 71e47ac

_validate_args(dataset_fn, ["split", "shuffle_files"])

Also it only allows seed and not seeds later. But this never comes into effect since the whole things fails during creation.

seqio/seqio/dataset_providers.py

Line 373 in 71e47ac

_validate_args(self._dataset_fn, ["split", "shuffle_files", "seed"])

Dataset performance

I am having difficult time getting my data pipeline to the throughput levels that I would like before starting training with the t5x library.

Initially I planned to use a mixture of ~40 tasks (1-2 TB text) for training and started doing some benchmarking following general TPU and dataset performance tips. Here are some useful guides that I tried to follow:

https://codelabs.developers.google.com/codelabs/keras-flowers-data#4
https://www.tensorflow.org/guide/data_performance
seqio (dataset_providers.py) and t5x (train.py) source code

All of my datasets/tasks are json line files (output from earlier dataflow jobs) varying from 200 to 1000 files.

I used colab notebooks or an E2 32 cpu instance during my benchmarking experiments where I mounted my bucket which has all the ~40 datasets that I plan to use. I sampled 16 different files as training files for each task source because it is recommended not to have to read too many files form GCS.

FileDataSource

I switched from FunctionDataSource to FileDataSource , This is mainly to use individuals files during sharding without needing to read all the data which I assume would be slower especially for larger datasets.

import json
@tf.autograph.experimental.do_not_convert
def read_file_fn(file):
  """
  """
  def _read_json(file):
    # file = file.numpy().decode()
    with open(file) as f:
      for line in f:
        yield json.loads(line)['text']
  
  return tf.data.Dataset.from_generator(_read_json, args=(file,),
      output_signature=tf.TensorSpec(
          shape=(), dtype=tf.string, name=name)
  )      

source = seqio.FileDataSource(
  read_file_fn = read_file_fn,
  split_to_filepattern=dict(train=train_files, validation=validation_files))

Here we can see the reading and deserialization performance of a single task source.

dataset = source.get_dataset("train", shard_info=seqio.ShardInfo(0,16))
tfds.benchmark(dataset, num_iter=10000)
Examples/sec (First included) 1622.67 ex/sec (total: 10001 ex, 6.16 sec)
Examples/sec (First only) 0.95 ex/sec (total: 1 ex, 1.05 sec)
Examples/sec (First excluded) 1954.66 ex/sec (total: 10000 ex, 5.12 sec)

Single Task

Then I register my seqio tasks with full pipeline (including preprocessors) and test the performance of a single task.

dataset = seqio.get_mixture_or_task('task').get_dataset(
                    sequence_length={"inputs": 512, "targets": 512},
                    split="train",
                    shuffle=False,
                    num_epochs=1,
                    shard_info=seqio.ShardInfo(index=0, num_shards=16),
                    use_cached=False,
                    seed=42)
tfds.benchmark(dataset, num_iter=10000)
Examples/sec (First included) 485.21 ex/sec (total: 10001 ex, 20.61 sec)
Examples/sec (First only) 0.47 ex/sec (total: 1 ex, 2.11 sec)
Examples/sec (First excluded) 540.50 ex/sec (total: 10000 ex, 18.50 sec)

Mixture

When I benchmark the performance of the mixture it drops significantly (10x).

dataset = seqio.get_mixture_or_task("maana_version1.0_mixture").get_dataset(
                    sequence_length={"inputs": 512, "targets": 512},
                    split="train",
                    shuffle=False,
                    num_epochs=1,
                    shard_info=seqio.ShardInfo(index=0, num_shards=16),
                    use_cached=False,
                    seed=42)
tfds.benchmark(dataset, num_iter=10000)
Examples/sec (First included) 140.55 ex/sec (total: 10001 ex, 71.16 sec)
Examples/sec (First only) 0.09 ex/sec (total: 1 ex, 11.49 sec)
Examples/sec (First excluded) 167.60 ex/sec (total: 10000 ex, 59.67 sec)

Follow Up Thoughts

Please let me know if you have any feedback regarding the following comments and questions:

In my experiments reading from GCS vs local files didn't differ much. So streaming directly from GCS is probably the better option (not having to download TB size data) as long as bucket is in the same zone as TPU and number of files is not too much. Documents state (10s to 100s MB) and (10s to 100s files), in my case I have datasets with 200-1000 files (100 MB-1 GB range), should I reduce the number of files maybe by making each 1 GB, would this help pipeline performance?
I also experimented with TFExampleDataSource vs FileDataSource didn't see any performance gain from TFExample compared to json. Is there an absolute best way to store data for seqio pipeline performance, e.g. would registering a tfds be better - as explained here? In my experience dataflow jobs output number of files equal to the number of workers, so it can be much higher than 100s. Is this ok or should we keep the number of files in 128-256 range?
This is more of a T5X question but still might be related. My understanding is that when we get dataset from a mixture each task is iterated and if there is shard info specified that shard is returned as data, later same sample_fn is used for sampling from these task datasets with the given rates. I don't fully know how data parallelism plays together with model parallelism in t5x and maybe it might depend on the model size and # of tpus cores we have. Is it correct to assume each TPU core is a worker and data gets distributed to them when sharding? So would it make sense to have as many files as a multiple of core numbers (e.g. 8x for v3-8, 32x for v3-32). I also read that batch is automatically distributed across tpu cores when doing computation that is why I guess 8 x 128 is emphasized, then does it mean we don't need to necessarily care about number of files / sharding and still can use a single source file?

Notes from codelab:

The rule of thumb is to split your data across several (10s to 100s) larg-ish files (10s to 100s of MB). If you have too many files, thousands of files for example, the time to access each file might start getting in the way. If you have too few files, like one or two, then you are not getting the benefits of streaming from multiple files in parallel.

Different preprocessors for each dataset split

Hi,

I'm working on an STS task and using a seqio.TfdsDataSource task and t5.data.preprocessors.stsb as preprocessor. After seeing some generated examples I realize that both training and eval data are preprocessed, so metrics calculated on test split use a processed version of the target instead of the original values. Using t5.data.preprocessors.stsb as example, during training a label value of 3.25 is converted to "3.2"; during eval steps, I'd like to just convert this value to float without rounding ("3.25").

Is there a way to apply different preprocessors for each dataset split? It would be ideal for the evaluation metrics functions to consume gold labels as close as possible to the original values. The postprocessing section on README mentions is_target argument for postprocessing functions, but I could't find a similar instruction for preprocessor functions.

Thanks,
Marcos

How to just use the mixture functionality in seqio

Hey there, I've been wanting to pretrain MT5 on Huggingface training script as mentioned here: https://github.com/huggingface/transformers/blob/main/examples/flax/language-modeling/run_t5_mlm_flax.py

But sadly the Huggingface script doesn't support a mixture to pretrain MT5 in such a way that the model generalise well on low-resource as well as high-resource langauges.

Hence I've been wanting to use the mixture functionality of seqio, but sadly upon using it i have to tokenize the model into the T5 sentencepiece vocabulary and seqio tasks does all the preprocessing.

The Huggingface trainer takes care of the preprocessing maping the dataset to the tokenizer etc.

My question is is there a way where i could only just use the mixture functionality of seqio without actually doing any preprocessing on the incoming datasets.

I was wondering if there is a way to feed in multiple datasets, get an output dataset (in text str format) which is only an appropriate mixture of all samples of the datsets, passed by the mixture function. which i could then use to pretrain on the HF trainer and then do all the preprocessing on it in HF trainer

how to decide ideal mixture rates ?

what is the best way to decide on which mixture ratio is optimal?

In the mT5 paper the alpha value 0.3 gave the best balance between ideal performance for high and low resource languages.

However I am pretraining mT5 on Indian languages, and I have a diverse variety of indian multi-lingual corpus, where Hindi has 60M+ samples and Kashmiri has around 100k samples.

So I wanted to know if I could h-param tune somehow on t5x, or would just using alpha=0.3 work fine in my use case?

Dataset seeking for restarting from a T5X crashed run using HuggingFace datasets

Re-opening here as suggested by @adarob in google-research/t5x#421 (comment).

I wrote some hacky support for HuggingFace datasets using seqio.FunctionDataSource, specifically for pretraining and further pretraining models using T5X.

def gen_dataset(split, shuffle=False, seed=None, column="text", dataset_params=None):
    dataset = load_dataset(**dataset_params)
    if shuffle:
        if seed:
            dataset = dataset.shuffle(seed=seed)
        else:
            dataset = dataset.shuffle()
    while True:  # TODO: add for...loop over num_epochs
        for item in dataset[str(split)]:
            yield item[column]

def dataset_fn(split, shuffle_files, seed=None, dataset_params=None):
    return tf.data.Dataset.from_generator(
        functools.partial(gen_dataset, split, shuffle_files, seed, dataset_params=dataset_params),
        output_signature=tf.TensorSpec(shape=(), dtype=tf.string, name=dataset_name)
    )

dataset_name = 'NbAiLab/NCC'
dataset_params = {"path": dataset_name, "streaming": True}
dataset_shapes = {"train": 20830348, "validation": 473079}
source = seqio.FunctionDataSource(
    dataset_fn=functools.partial(dataset_fn, dataset_params=dataset_params),
    splits=("train", "validation"),
    caching_permitted=False,
    num_input_examples=dataset_shapes,
)

But unfortunately, as I face constant random crashes during training (google-research/t5x#366), I need a way to seek to the right dataset batch to properly continue training.

I see there's a continue_from_last_checkpoint variable in get_dataset(), bit it seems is not used for anything yet.

Is there a way to pass in the needed information to get_dataset_fn() so I can write the logic without using any hard-coded global variables?

How to choose minimum sequence length while avoiding truncation

Hi,

I have a task that uses seqio.TfdsDataSource as its source and a pipeline with preprocessors final steps that looks like this: [..., seqio.preprocessors.tokenize, seqio.CacheDatasetPlaceholder(), seqio.preprocessors.append_eos_after_trim].

I have cached this task, so I know the maximum token lengths for both inputs and targets.

My question is: when training a model with t5.models.mesh_transformer_main using this task and providing gin bindings for utils.run.sequence_length, should I use the values I see on the cached stats, or should I add +1 to account for the EOS token? My goal is to avoid data truncation by specifying smaller sequence lengths than what my data requires.

(P.S.: I know this is also related to the t5 repository, but I opened the issue here because I think my question is related to the seqio.preprocessors.append_eos_after_trim function. If you think it would be more appropriate to open this issue in another repository, please let me know, and I can change it.)

Thanks in advance,
Marcos

TfdsDataProvider gives error with non-None tfds_data_dir

SeqIO provides access to TFDS through TfdsDataProvider, which takes tfds_data_dir as an argument. However, it is not currently possible to use a non-None tfds_data_dir with TfdsDataProvider.

The issue can be traced to LazyTfdsLoader, which uses tfds.load with the hardcoded setting try_gcs=True. As noted in the TFDS docs, this is equivalent to setting data_dir='gs://tfds-data/datasets'. Consequently, TFDS raises an error when passing try_gcs=True and a non-None data_dir to tfds.load, as would occur when using non-None tfds_data_dir with TfdsDataProvider.

I believe allowing a non-None tfds_data_dir would be helpful in many scenarios. For example, many large datasets available through TFDS are hosted in locations other than gs://tfds-data/datasets, and in formats other than tfrecords. When downloading and preprocessing such datasets on preemptable VMs it is desirable to specify a data_dir to allow one to save tfrecords to the cloud as detailed here. This allows users to avoid incurring the full download/processing delay on subsequent occasions: only some tfrecord shards need to be downloaded per host, and the downloads can overlap with model training. In this case, however, one must set try_gcs=False, to avoid the TFDS error.

Rather than exposing the try_gcs option to the user, the implementation of LazyTfdsLoader can automatically set try_gcs=False when data_dir is not None. This way, it would be the user's responsibility to specify a tfds_data_dir or not when instantiating TfdsDataProvider, just like they are nominally able to do right now. The only downside is that if the data is available as tfrecords at gs://tfds-data/datasets and the user specifies their own data_dir, the try_gcs=False forces a potentially unnecessary download. However, a warning can be added to the docstring to mention this consequence of specifying tfds_data_dir.

Can we make this happen? I can open a PR with this simple change!

Using a registered task to add another

Suppose I have a task registered as follows:

seqio.TaskRegistry.add(
    "task_1",
    source=seqio.TfdsDataSource(tfds_name="c4/en:3.0.1", splits=["train", "validation"]),
    preprocessors=[
        preprocess1,
        seqio.preprocessors.tokenize,
        seqio.CacheDatasetPlaceholder(),
        preprocess2,
        preprocess3,
],
output_features=...

Is it possible to add another task that starts with the cached part of task_1, i.e., the part before seqio.CacheDatasetPlaceholder(), and only vary preprocess2 and preprocess3?

I'm looking for something like

seqio.TaskRegistry.add(
"task_2",
source=seqio.CachedTaskSource("task_1", ...),
preprocessors=[
    preprocess2_modified,
    preprocess3_modified,
],
...

How to apply the huggingface tokenizer in seqio.vocabulary

Hello.

I would like to use the huggingface tokenizer to seqio.vocabulary in t5x.

I inherited seqio.vocabulary and created my BBPEVocabulary. However, the values 'inputs' and 'targets' are not accessed as text in tf.data.Dataset.map. Because huggingface tokenizer can get a string but tf.data.Dataset give tf.tensor like Tensor("args_0:0", shape=(), dtype=string).

Since the seqio.sentencepice module load module by using so file in tf_text.sentencepiece, I don't know how to handle it inside.

I would like to ask you about how to get and process tf.tensor as text in order to use huggingface tokenizer in tf.data.Dataset map.

I am attaching the code I used below.

Thank you:)

seqio/custom_task.py

from src.vocabularies import BBPEVocabulary
bbpe_vocab = BBPEVocabulary('custom_path')

seqio.TaskRegistry.add(
    "my_span_corruption_task",
    source=seqio.TFExampleDataSource(
        split_to_filepattern={"train": os.path.join('[MY_TF_RECORD_PATH]', "*train.tfrecord*")},
        feature_description={"text": tf.io.FixedLenFeature([], tf.string)}
    ),
    preprocessors=[
        functools.partial(
            preprocessors.rekey, key_map={
                "inputs": None,
                "targets": "text"
            }),
        preprocessors.tokenize,
        seqio.CacheDatasetPlaceholder(),
        preprocessors.span_corruption,
        seqio.preprocessors.append_eos_after_trim,

    ],
    output_features=BBPE_OUTPUT_FEATURES,
    metric_fns=[])

seqio/preprocessors.py

def tokenize(dataset: tf.data.Dataset,
             output_features: OutputFeaturesType,
             copy_pretokenized: bool = True,
             with_eos: bool = False) -> tf.data.Dataset:
  tokenize_fn = functools.partial(
      tokenize_impl,
      output_features=output_features,
      copy_pretokenized=copy_pretokenized,
      with_eos=with_eos)
  return utils.map_over_dataset(fn=tokenize_fn)(dataset)

def tokenize_impl(features: Mapping[str, tf.Tensor],
                  output_features: OutputFeaturesType,
                  copy_pretokenized: bool = True,
                  with_eos: bool = False) -> Mapping[str, tf.Tensor]:
  ret = {}
  for k, v in features.items():
    if k in output_features:
      if copy_pretokenized:
        ret[f'{k}_pretokenized'] = v
      vocab = output_features[k].vocabulary
      v = vocab.encode_tf(v) # In this line, the `v` value type is "tf.tensor", and I can't obtain text of `v`
      ...[omitted]...

    ret[k] = v
  print(f'tokenize_impl | complete | return : {ret}')
  return ret

Add Evaluator example

Would it be possible to add an example of how to run evaluation? The Evaluator section of the README is currently empty. Thank you!

caching tasks goes out of memory due to apache beam

Trying to cache tasks from magenta/MT3 repository, only with 200 examples it takes around 30GB of memory while caching at the very end of processing.
Without caching it trains just fine even with 1000 train examples train dataset.

Using seqio for T5X Dataset Generation

Hi 🤗

I would like to pre-train a T5 Base model with T5X library.

When I understand the pre-training process correctly, I need TFRecords stored on a cloud bucket for that training (like it is done for BERT pre-training).

Now I have the following questions:

How is possible to generate such a dataset from an own corpus. Corpus is a plain text file (each line = one sentence). I have also a T5 compatible vocab (sentencepiece model), because I don't want to use the existing T5 or mT5 vocabs.

Many thanks advance!

Unimax sampler implementation?

I tried searching the code for seqio mixtures generated using the newly released unimax sampler.

I am trying to pretrain a custom umUL2 model. if i could perhaps know how to implement unimax it would be of great help. Thanks

google / seqio Goto Github PK

seqio's Issues

FileDataSource

Single Task

Mixture

Follow Up Thoughts

Recommend Projects

Recommend Topics

Recommend Org