google / seqio Goto Github PK
View Code? Open in Web Editor NEWTask-based datasets, preprocessing, and evaluation for sequence models.
License: Apache License 2.0
Task-based datasets, preprocessing, and evaluation for sequence models.
License: Apache License 2.0
Please include installation instructions on the README file
Hi, is the support for "Deterministic Pipelines"---as described in https://arxiv.org/abs/2203.17189 section 3.2---now available through the open-source seqio?
Thanks!
Currently MixtureRegistry
and TaskRegistry
have an add
method that takes the arguments to construct a Mixture
/ Task
. This does not seem to play well if one wants to subclass Mixture
or Task
with a class that takes different arguments and add them to the registries. Concrete example of a subclass of Task
: turning a Mixture
back into a task so that it looks "atomic" when one tries to add it back into a Mixture
. Would it be possible to have a method to add to the Registrie(s) directly an object, without having to pass
the arguments that the constructor uses?
tokenize_and_append_eos
needs another requrired input (output_features
) how can I use this function as preprocessor, how to pass output features?
It's the way I tired to use it
preprocessors=[
functools.partial(
t5.data.preprocessors.parse_tsv,
field_names=["input_text", "target_text"]),
seqio.preprocessors.tokenize_and_append_eos,
],
464 def _decode_tf(self, ids):
465 """Decode in TensorFlow.
466
467 Args:
468 ids: a 1d tf.Tensor with dtype tf.int32
469 Returns:
470 a tf Scalar with dtype tf.string
471 """
472 return tf.py_function(func=self.decode, inp=[ids], Tout=tf.string)
The param 'ids' passed to _decode_tf above is a 1d tf.Tensor, and on line 472 it is wrapped into a list, but when self.decode is called with the param [ids], it throws the error in the list comprehension on line 100 (shown below):
File "/home/ubuntu/anaconda3/envs/google_t5/lib/python3.7/site-packages/seqio/vocabularies.py", line 100, in
decode
for i in clean_ids
File "/home/ubuntu/anaconda3/envs/google_t5/lib/python3.7/site-packages/seqio/vocabularies.py", line 100, in
for i in clean_ids
File "/home/ubuntu/anaconda3/envs/google_t5/lib/python3.7/site-packages/tensorflow/python/framework/ops.py",
line 1007, in bool
return bool(self._numpy())
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
92 def decode(self, ids: Iterable[int]):
93 """Detokenizes int32 iterable to a string, up through first EOS."""
94 clean_ids = list(ids)
95
96 if self.unk_id is not None:
97 vocab_size = self._base_vocab_size
98 clean_ids = [
99 self.unk_id if i >= vocab_size else i
100 for i in clean_ids
101 ]
102
103 if self.eos_id is not None and self.eos_id in clean_ids:
104 clean_ids = clean_ids[:clean_ids.index(self.eos_id) + 1]
105
106 return self._decode(clean_ids)
Hi, I have been trying to get SeqIO to work with HuggingFace's tokenizers for a bit but have been running into trouble with non-t5 based tokenizers. Specifically, it seems that, because they are not sentencepiece tokenizers, tokenizers for models such as GPT-2 are incompatible with SeqIO's SentencePieceVocabulary
as they only have the vocab files:
{
'vocab_file': 'vocab.json',
'merges_file': 'merges.txt',
'tokenizer_file': 'tokenizer.json'
}
Is there a currently supported way to use these tokenizers with SeqIO? Or would I need to make my own vocab class?
db4d4b0 added clu
as a dependency of seqio
.
With this change, we can no longer install seqio
on Apple Silicon machines (e.g. M1, M2). This is because clu
requires tensorflow
(https://github.com/google/CommonLoopUtils/blob/85f9d28556f2684e2c5f2e412cbef5119d6682ba/setup.py#L54) but on Apple Silicon tensorflow
should be installed as tensorflow-macos
based on the instructions at https://developer.apple.com/metal/tensorflow-plugin/.
A simple fix is to update the clu
tensorflow
line in the setup.py to tensorflow; platform_machine == 'x86_64'
. However, that project doesn't accept GitHub issues or contributions so I am creating an issue here.
Hello why I am getting this warning for just importing seqio
import seqio
2022-07-29 10:38:38.223245: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-07-29 10:38:38.223295: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Traceback (most recent call last):
File "/scenic/scenic/projects/vid2seq/vid2seq_test.py", line 13, in
from scenic.projects.vid2seq import trainer
File "/anaconda3/envs/vid2seq/lib/python3.11/site-packages/scenic/projects/vid2seq/trainer.py", line 26, in
from scenic.projects.t5 import model as t5_model
File "/anaconda3/envs/vid2seq/lib/python3.11/site-packages/scenic/projects/t5/model.py", line 29, in
from scenic.projects.t5 import layers
File "/anaconda3/envs/vid2seq/lib/python3.11/site-packages/scenic/projects/t5/layers.py", line 9, in
from t5x import decoding
File "/anaconda3/envs/vid2seq/lib/python3.11/site-packages/t5x/init.py", line 17, in
import t5x.adafactor
File "/anaconda3/envs/vid2seq/lib/python3.11/site-packages/t5x/adafactor.py", line 64, in
from t5x import utils
File "/anaconda3/envs/vid2seq/lib/python3.11/site-packages/t5x/utils.py", line 44, in
import seqio
File "/anaconda3/envs/vid2seq/lib/python3.11/site-packages/seqio/init.py", line 18, in
from seqio.dataset_providers import *
File "/anaconda3/envs/vid2seq/lib/python3.11/site-packages/seqio/dataset_providers.py", line 60, in
@dataclasses.dataclass(frozen=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/anaconda3/envs/vid2seq/lib/python3.11/dataclasses.py", line 1213, in wrap
return _process_class(cls, init, repr, eq, order, unsafe_hash,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/anaconda3/envs/vid2seq/lib/python3.11/dataclasses.py", line 958, in _process_class
cls_fields.append(_get_field(cls, name, type, kw_only))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/anaconda3/envs/vid2seq/lib/python3.11/dataclasses.py", line 815, in _get_field
raise ValueError(f'mutable default {type(f.default)} for field '
ValueError: mutable default <class 'seqio.vocabularies.PassThroughVocabulary'> for field vocabulary is not allowed: use default_factory
When trying to cache a dataset that does not fit DirectRunner (e.g google-research/text-to-text-transfer-transformer#323 (comment)) on Cloud Dataflow without any requirements.txt
, like
python -m seqio.scripts.cache_tasks_main \
--module_import="..." \
--tasks="${TASK_NAME}" \
--output_cache_dir="${BUCKET}/cache" \
--alsologtostderr \
--pipeline_options="--runner=DataflowRunner,--project=$PROJECT,--region=$REGION,--job_name=$TASK_NAME,--staging_location=$BUCKET/binaries,--temp_location=$BUCKET/tmp,--experiments=shuffle_mode=appliance"
it fails with ModuleNotFoundError: No module named 'seqio'
.
If seqio
added with
echo seqio > /tmp/beam_requirements.txt
# and run the same, adding to `--pipeline_options`
--requirements_file=/tmp/beam_requirements.txt
it fails with
subprocess.CalledProcessError: Command '['.../.venv/bin/python', '-m', 'pip', 'download', '--dest', '..../pip-tmp
/dataflow-requirements-cache', '-r', '/tmp/beam_requirements.txt', '--exists-action', 'i', '--no-binary', ':all:']' returned non-zero exit status 1.
Pip install failed for package: -r
Output from execution of subprocess: b"ERROR: Could not find a version that satisfies the requirement tensorflow-text (from versions: none)\
nERROR: No matching distribution found for tensorflow-text
This seems to be cause by seqio
depending on tensorflow-text
, which does not have any source release artifacts.
But requirements cache in Apache Beam seem to be populated with --no-binary :all:
before making it available to the workers.
A try on a clean venv results in the same:
pip3 install --no-binary :all: --no-deps tensorflow-text==2.6.0
ERROR: Could not find a version that satisfies the requirement tensorflow-text==2.6.0 (from versions: none)
ERROR: No matching distribution found for tensorflow-text==2.6.0
Am I doing something wrong, or how does everyone work this around? Would appreciate a hand here.
is the new unimax sampling based mixture implementation available ?
Looks like tokens like eos
and pad
do not get tokenized correctly:
Repro:
In [1]: import seqio
In [2]: vocab = seqio.SentencePieceVocabulary('gs://t5-data/vocabs/cc_all.32000/sentencepiece.model')
In [3]: vocab.tokenizer.id_to_piece(0)
Out[3]: '<pad>'
In [4]: vocab.tokenizer.id_to_piece(1)
Out[4]: '</s>'
In [5]: vocab.encode(vocab.tokenizer.id_to_piece(1))
Out[5]: [3, 2, 87, 7, 3155]
In [6]: vocab.tokenizer.id_to_piece(vocab.tokenizer.encode(vocab.tokenizer.id_to_piece(1)))
Out[6]: ['โ', '<unk>', '/', 's', '>']
It breaks down the special tokens into wordpieces.
Hey there,
I am currently pretraining mt5 model on 23 different languages. but when i create a mixture and set the mixture name in t5x .gin config file for training on the mixture i get the following error.
ValueError: Dataset is missing an expected feature during input_validation validation: 'inputs'
However when i individually ran the independent tasks by setting them in the gin file everything works fine.
the following is how my task.py file looks like.
TaskRegistry = seqio.TaskRegistry
vocabulary = seqio.SentencePieceVocabulary('gs://t5-data/vocabs/mc4.250000.100extra/sentencepiece.model', extra_ids=0)
DEFAULT_OUTPUT_FEATURES = {
"inputs": seqio.Feature(
vocabulary=vocabulary, add_eos=True,
required=False),
"targets": seqio.Feature(
vocabulary=vocabulary, add_eos=True)
}
def gen_dataset(split, shuffle=False, seed=None, column="text", dataset_path=None):
dataset = load_dataset(dataset_path, streaming=True, use_auth_token=True)
if shuffle:
if seed:
dataset = dataset.shuffle(seed=seed)
else:
dataset = dataset.shuffle()
while True:
for item in dataset[str(split)]:
yield item[column]
def dataset_fn(split, shuffle_files, seed=None, dataset_path=None):
return tf.data.Dataset.from_generator(
functools.partial(gen_dataset, split, shuffle_files, seed, dataset_path=dataset_path),
output_signature=tf.TensorSpec(shape=(), dtype=tf.string, name=dataset_path)
)
@utils.map_over_dataset
def target_to_key(x, key_map, target_key):
"""Assign the value from the dataset to target_key in key_map"""
return {**key_map, target_key: x}
TaskRegistry.add(
"urdu_span_curruption",
source=seqio.FunctionDataSource(
dataset_fn=functools.partial(dataset_fn, dataset_path='StephennFernandes/ciil_mega_corpus_urdu'),
splits=("train", "validation"),
caching_permitted=False),
preprocessors=[
functools.partial(
target_to_key, key_map={
"inputs": None,
"targets": None,
}, target_key="targets"),
seqio.preprocessors.tokenize,
# seqio.CacheDatasetPlaceholder(),
preprocessors.span_corruption,
seqio.preprocessors.append_eos_after_trim,
],
output_features={"targets": DEFAULT_OUTPUT_FEATURES["targets"]},
metric_fns=[]
)
### similar multiple languages are loaded here ###
#seqio mixture 3.5
seqio.MixtureRegistry.add(
"ciil_mix_3.5",
["assamese_span_curruption", "bengali_span_curruption",
"bhisnupuriya_span_curruption", "bodo_span_curruption",
"divehi_span_curruption", "dogri_span_curruption",
"english_span_curruption", "gujarati_span_curruption",
"hindi_span_curruption", "kannada_span_curruption",
"kashmiri_span_curruption", "konkani_span_curruption",
"maithili_span_curruption", "malayalam_span_curruption",
"manipuri_span_curruption", "marathi_span_curruption",
"nepali_span_curruption", "odia_span_curruption",
"panjabi_span_curruption", "sanskrit_span_curruption",
"tamil_span_curruption", "telugu_span_curruption",
"urdu_span_curruption" ],
default_rate=3.5
)
upon running the mt5 model with the mixture name in the .gin file i get the following error:
File "/home/stephen/Desktop/t5x_final_test/t5x/t5x/train.py", line 744, in _main
train_using_gin()
File "/home/stephen/anaconda3/lib/python3.9/site-packages/gin/config.py", line 1605, in gin_wrapper
utils.augment_exception_message_and_reraise(e, err_str)
File "/home/stephen/anaconda3/lib/python3.9/site-packages/gin/utils.py", line 41, in augment_exception_message_and_reraise
raise proxy.with_traceback(exception.__traceback__) from None
File "/home/stephen/anaconda3/lib/python3.9/site-packages/gin/config.py", line 1582, in gin_wrapper
return fn(*new_args, **new_kwargs)
File "/home/stephen/Desktop/t5x_final_test/t5x/t5x/train.py", line 249, in train
train_ds = get_dataset_fn(train_dataset_cfg, ds_shard_id, num_ds_shards,
File "/home/stephen/Desktop/t5x_final_test/t5x/t5x/utils.py", line 1366, in get_dataset
return get_dataset_inner(cfg, shard_info, feature_converter_cls, seed,
File "/home/stephen/Desktop/t5x_final_test/t5x/t5x/utils.py", line 1387, in get_dataset_inner
ds = seqio.get_dataset(
File "/home/stephen/anaconda3/lib/python3.9/site-packages/seqio/dataset_providers.py", line 1681, in get_dataset
ds = feature_converter(
File "/home/stephen/anaconda3/lib/python3.9/site-packages/seqio/feature_converters.py", line 404, in __call__
ds = self._validate_dataset(
File "/home/stephen/anaconda3/lib/python3.9/site-packages/seqio/feature_converters.py", line 294, in _validate_dataset
raise ValueError("Dataset is missing an expected feature during "
ValueError: Dataset is missing an expected feature during input_validation validation: 'inputs'
Is there a way to concatenate multiple Tasks? Mixtures sample from component Tasks until one of them runs out of examples. Is there a variant that uses all of the examples from both Tasks in each epoch?
import seqio
Traceback (most recent call last):
File "", line 1, in
File "/usr/local/lib/python3.7/dist-packages/seqio/init.py", line 18, in
from seqio.dataset_providers import *
File "/usr/local/lib/python3.7/dist-packages/seqio/dataset_providers.py", line 38, in
import pyglove as pg
File "/usr/local/lib/python3.7/dist-packages/pyglove/init.py", line 30, in
from pyglove.core import *
File "/usr/local/lib/python3.7/dist-packages/pyglove/core/init.py", line 56, in
from pyglove.core import symbolic
File "/usr/local/lib/python3.7/dist-packages/pyglove/core/symbolic/init.py", line 93, in
from pyglove.core.symbolic.diff import diff
File "/usr/local/lib/python3.7/dist-packages/pyglove/core/symbolic/diff.py", line 153, in
(pg_typing.StrKey(), pg_typing.Object(Diff), 'Child node.')
File "/usr/local/lib/python3.7/dist-packages/pyglove/core/typing/value_specs.py", line 1279, in init
schema_or_field_list, allow_nonconst_keys=True)
File "/usr/local/lib/python3.7/dist-packages/pyglove/core/typing/class_schema.py", line 1179, in create_schema
value = ValueSpec.from_annotation(maybe_value_spec, True)
File "/usr/local/lib/python3.7/dist-packages/pyglove/core/typing/value_specs.py", line 2131, in _from_annotation
origin = typing.get_origin(annotation)
AttributeError: module 'typing' has no attribute 'get_origin'
During creation it checks if function has only 2 positional arguments. For shuffling to be used it should also accept a third argument, seed
or seeds
. Otherwise an exception is thrown when trying to pass shuffle=True
to get_dataset()
.
seqio/seqio/dataset_providers.py
Line 341 in 71e47ac
Also it only allows seed
and not seeds
later. But this never comes into effect since the whole things fails during creation.
seqio/seqio/dataset_providers.py
Line 373 in 71e47ac
I am having difficult time getting my data pipeline to the throughput levels that I would like before starting training with the t5x library.
Initially I planned to use a mixture of ~40 tasks (1-2 TB text) for training and started doing some benchmarking following general TPU and dataset performance tips. Here are some useful guides that I tried to follow:
All of my datasets/tasks are json line files (output from earlier dataflow jobs) varying from 200 to 1000 files.
I used colab notebooks or an E2 32 cpu instance during my benchmarking experiments where I mounted my bucket which has all the ~40 datasets that I plan to use. I sampled 16 different files as training files for each task source because it is recommended not to have to read too many files form GCS.
I switched from FunctionDataSource
to FileDataSource
, This is mainly to use individuals files during sharding without needing to read all the data which I assume would be slower especially for larger datasets.
import json
@tf.autograph.experimental.do_not_convert
def read_file_fn(file):
"""
"""
def _read_json(file):
# file = file.numpy().decode()
with open(file) as f:
for line in f:
yield json.loads(line)['text']
return tf.data.Dataset.from_generator(_read_json, args=(file,),
output_signature=tf.TensorSpec(
shape=(), dtype=tf.string, name=name)
)
source = seqio.FileDataSource(
read_file_fn = read_file_fn,
split_to_filepattern=dict(train=train_files, validation=validation_files))
Here we can see the reading and deserialization performance of a single task source.
dataset = source.get_dataset("train", shard_info=seqio.ShardInfo(0,16))
tfds.benchmark(dataset, num_iter=10000)
Examples/sec (First included) 1622.67 ex/sec (total: 10001 ex, 6.16 sec)
Examples/sec (First only) 0.95 ex/sec (total: 1 ex, 1.05 sec)
Examples/sec (First excluded) 1954.66 ex/sec (total: 10000 ex, 5.12 sec)
Then I register my seqio tasks with full pipeline (including preprocessors) and test the performance of a single task.
dataset = seqio.get_mixture_or_task('task').get_dataset(
sequence_length={"inputs": 512, "targets": 512},
split="train",
shuffle=False,
num_epochs=1,
shard_info=seqio.ShardInfo(index=0, num_shards=16),
use_cached=False,
seed=42)
tfds.benchmark(dataset, num_iter=10000)
Examples/sec (First included) 485.21 ex/sec (total: 10001 ex, 20.61 sec)
Examples/sec (First only) 0.47 ex/sec (total: 1 ex, 2.11 sec)
Examples/sec (First excluded) 540.50 ex/sec (total: 10000 ex, 18.50 sec)
When I benchmark the performance of the mixture it drops significantly (10x).
dataset = seqio.get_mixture_or_task("maana_version1.0_mixture").get_dataset(
sequence_length={"inputs": 512, "targets": 512},
split="train",
shuffle=False,
num_epochs=1,
shard_info=seqio.ShardInfo(index=0, num_shards=16),
use_cached=False,
seed=42)
tfds.benchmark(dataset, num_iter=10000)
Examples/sec (First included) 140.55 ex/sec (total: 10001 ex, 71.16 sec)
Examples/sec (First only) 0.09 ex/sec (total: 1 ex, 11.49 sec)
Examples/sec (First excluded) 167.60 ex/sec (total: 10000 ex, 59.67 sec)
Please let me know if you have any feedback regarding the following comments and questions:
In my experiments reading from GCS vs local files didn't differ much. So streaming directly from GCS is probably the better option (not having to download TB size data) as long as bucket is in the same zone as TPU and number of files is not too much. Documents state (10s to 100s MB) and (10s to 100s files), in my case I have datasets with 200-1000 files (100 MB-1 GB range), should I reduce the number of files maybe by making each 1 GB, would this help pipeline performance?
I also experimented with TFExampleDataSource
vs FileDataSource
didn't see any performance gain from TFExample compared to json. Is there an absolute best way to store data for seqio pipeline performance, e.g. would registering a tfds be better - as explained here? In my experience dataflow jobs output number of files equal to the number of workers, so it can be much higher than 100s. Is this ok or should we keep the number of files in 128-256 range?
This is more of a T5X question but still might be related. My understanding is that when we get dataset from a mixture each task is iterated and if there is shard info specified that shard is returned as data, later same sample_fn
is used for sampling from these task datasets with the given rates. I don't fully know how data parallelism plays together with model parallelism in t5x and maybe it might depend on the model size and # of tpus cores we have. Is it correct to assume each TPU core is a worker and data gets distributed to them when sharding? So would it make sense to have as many files as a multiple of core numbers (e.g. 8x for v3-8, 32x for v3-32). I also read that batch is automatically distributed across tpu cores when doing computation that is why I guess 8 x 128 is emphasized, then does it mean we don't need to necessarily care about number of files / sharding and still can use a single source file?
Notes from codelab:
The rule of thumb is to split your data across several (10s to 100s) larg-ish files (10s to 100s of MB). If you have too many files, thousands of files for example, the time to access each file might start getting in the way. If you have too few files, like one or two, then you are not getting the benefits of streaming from multiple files in parallel.
Hi,
I'm working on an STS task and using a seqio.TfdsDataSource
task and t5.data.preprocessors.stsb
as preprocessor. After seeing some generated examples I realize that both training and eval data are preprocessed, so metrics calculated on test split use a processed version of the target instead of the original values. Using t5.data.preprocessors.stsb
as example, during training a label value of 3.25 is converted to "3.2"; during eval steps, I'd like to just convert this value to float without rounding ("3.25").
Is there a way to apply different preprocessors for each dataset split? It would be ideal for the evaluation metrics functions to consume gold labels as close as possible to the original values. The postprocessing section on README mentions is_target
argument for postprocessing functions, but I could't find a similar instruction for preprocessor functions.
Thanks,
Marcos
Hey there, I've been wanting to pretrain MT5 on Huggingface training script as mentioned here: https://github.com/huggingface/transformers/blob/main/examples/flax/language-modeling/run_t5_mlm_flax.py
But sadly the Huggingface script doesn't support a mixture to pretrain MT5 in such a way that the model generalise well on low-resource as well as high-resource langauges.
Hence I've been wanting to use the mixture functionality of seqio, but sadly upon using it i have to tokenize the model into the T5 sentencepiece vocabulary and seqio tasks does all the preprocessing.
The Huggingface trainer takes care of the preprocessing maping the dataset to the tokenizer etc.
My question is is there a way where i could only just use the mixture functionality of seqio without actually doing any preprocessing on the incoming datasets.
I was wondering if there is a way to feed in multiple datasets, get an output dataset (in text str format) which is only an appropriate mixture of all samples of the datsets, passed by the mixture function. which i could then use to pretrain on the HF trainer and then do all the preprocessing on it in HF trainer
what is the best way to decide on which mixture ratio is optimal?
In the mT5 paper the alpha value 0.3 gave the best balance between ideal performance for high and low resource languages.
However I am pretraining mT5 on Indian languages, and I have a diverse variety of indian multi-lingual corpus, where Hindi has 60M+ samples and Kashmiri has around 100k samples.
So I wanted to know if I could h-param tune somehow on t5x, or would just using alpha=0.3 work fine in my use case?
Re-opening here as suggested by @adarob in google-research/t5x#421 (comment).
I wrote some hacky support for HuggingFace datasets using seqio.FunctionDataSource
, specifically for pretraining and further pretraining models using T5X.
def gen_dataset(split, shuffle=False, seed=None, column="text", dataset_params=None):
dataset = load_dataset(**dataset_params)
if shuffle:
if seed:
dataset = dataset.shuffle(seed=seed)
else:
dataset = dataset.shuffle()
while True: # TODO: add for...loop over num_epochs
for item in dataset[str(split)]:
yield item[column]
def dataset_fn(split, shuffle_files, seed=None, dataset_params=None):
return tf.data.Dataset.from_generator(
functools.partial(gen_dataset, split, shuffle_files, seed, dataset_params=dataset_params),
output_signature=tf.TensorSpec(shape=(), dtype=tf.string, name=dataset_name)
)
dataset_name = 'NbAiLab/NCC'
dataset_params = {"path": dataset_name, "streaming": True}
dataset_shapes = {"train": 20830348, "validation": 473079}
source = seqio.FunctionDataSource(
dataset_fn=functools.partial(dataset_fn, dataset_params=dataset_params),
splits=("train", "validation"),
caching_permitted=False,
num_input_examples=dataset_shapes,
)
But unfortunately, as I face constant random crashes during training (google-research/t5x#366), I need a way to seek to the right dataset batch to properly continue training.
I see there's a continue_from_last_checkpoint
variable in get_dataset()
, bit it seems is not used for anything yet.
Is there a way to pass in the needed information to get_dataset_fn()
so I can write the logic without using any hard-coded global variables?
Hi,
I have a task that uses seqio.TfdsDataSource
as its source and a pipeline with preprocessors final steps that looks like this: [...
, seqio.preprocessors.tokenize
, seqio.CacheDatasetPlaceholder()
, seqio.preprocessors.append_eos_after_trim
].
I have cached this task, so I know the maximum token lengths for both inputs and targets.
My question is: when training a model with t5.models.mesh_transformer_main
using this task and providing gin bindings for utils.run.sequence_length
, should I use the values I see on the cached stats, or should I add +1 to account for the EOS token? My goal is to avoid data truncation by specifying smaller sequence lengths than what my data requires.
(P.S.: I know this is also related to the t5 repository, but I opened the issue here because I think my question is related to the seqio.preprocessors.append_eos_after_trim
function. If you think it would be more appropriate to open this issue in another repository, please let me know, and I can change it.)
Thanks in advance,
Marcos
SeqIO provides access to TFDS through TfdsDataProvider
, which takes tfds_data_dir
as an argument. However, it is not currently possible to use a non-None tfds_data_dir
with TfdsDataProvider.
The issue can be traced to LazyTfdsLoader
, which uses tfds.load
with the hardcoded setting try_gcs=True
. As noted in the TFDS docs, this is equivalent to setting data_dir='gs://tfds-data/datasets'
. Consequently, TFDS raises an error when passing try_gcs=True
and a non-None data_dir
to tfds.load
, as would occur when using non-None tfds_data_dir
with TfdsDataProvider
.
I believe allowing a non-None tfds_data_dir
would be helpful in many scenarios. For example, many large datasets available through TFDS are hosted in locations other than gs://tfds-data/datasets
, and in formats other than tfrecords. When downloading and preprocessing such datasets on preemptable VMs it is desirable to specify a data_dir
to allow one to save tfrecords to the cloud as detailed here. This allows users to avoid incurring the full download/processing delay on subsequent occasions: only some tfrecord shards need to be downloaded per host, and the downloads can overlap with model training. In this case, however, one must set try_gcs=False
, to avoid the TFDS error.
Rather than exposing the try_gcs
option to the user, the implementation of LazyTfdsLoader
can automatically set try_gcs=False
when data_dir
is not None. This way, it would be the user's responsibility to specify a tfds_data_dir
or not when instantiating TfdsDataProvider
, just like they are nominally able to do right now. The only downside is that if the data is available as tfrecords at gs://tfds-data/datasets
and the user specifies their own data_dir
, the try_gcs=False forces a potentially unnecessary download. However, a warning can be added to the docstring to mention this consequence of specifying tfds_data_dir
.
Can we make this happen? I can open a PR with this simple change!
Suppose I have a task registered as follows:
seqio.TaskRegistry.add(
"task_1",
source=seqio.TfdsDataSource(tfds_name="c4/en:3.0.1", splits=["train", "validation"]),
preprocessors=[
preprocess1,
seqio.preprocessors.tokenize,
seqio.CacheDatasetPlaceholder(),
preprocess2,
preprocess3,
],
output_features=...
Is it possible to add another task that starts with the cached part of task_1
, i.e., the part before seqio.CacheDatasetPlaceholder()
, and only vary preprocess2
and preprocess3
?
I'm looking for something like
seqio.TaskRegistry.add(
"task_2",
source=seqio.CachedTaskSource("task_1", ...),
preprocessors=[
preprocess2_modified,
preprocess3_modified,
],
...
Hello.
I would like to use the huggingface tokenizer to seqio.vocabulary in t5x.
I inherited seqio.vocabulary and created my BBPEVocabulary. However, the values 'inputs' and 'targets' are not accessed as text in tf.data.Dataset.map. Because huggingface tokenizer can get a string but tf.data.Dataset give tf.tensor like Tensor("args_0:0", shape=(), dtype=string)
.
Since the seqio.sentencepice module load module by using so
file in tf_text.sentencepiece, I don't know how to handle it inside.
I would like to ask you about how to get and process tf.tensor as text in order to use huggingface tokenizer in tf.data.Dataset map.
I am attaching the code I used below.
Thank you:)
seqio/custom_task.py
from src.vocabularies import BBPEVocabulary
bbpe_vocab = BBPEVocabulary('custom_path')
seqio.TaskRegistry.add(
"my_span_corruption_task",
source=seqio.TFExampleDataSource(
split_to_filepattern={"train": os.path.join('[MY_TF_RECORD_PATH]', "*train.tfrecord*")},
feature_description={"text": tf.io.FixedLenFeature([], tf.string)}
),
preprocessors=[
functools.partial(
preprocessors.rekey, key_map={
"inputs": None,
"targets": "text"
}),
preprocessors.tokenize,
seqio.CacheDatasetPlaceholder(),
preprocessors.span_corruption,
seqio.preprocessors.append_eos_after_trim,
],
output_features=BBPE_OUTPUT_FEATURES,
metric_fns=[])
seqio/preprocessors.py
def tokenize(dataset: tf.data.Dataset,
output_features: OutputFeaturesType,
copy_pretokenized: bool = True,
with_eos: bool = False) -> tf.data.Dataset:
tokenize_fn = functools.partial(
tokenize_impl,
output_features=output_features,
copy_pretokenized=copy_pretokenized,
with_eos=with_eos)
return utils.map_over_dataset(fn=tokenize_fn)(dataset)
def tokenize_impl(features: Mapping[str, tf.Tensor],
output_features: OutputFeaturesType,
copy_pretokenized: bool = True,
with_eos: bool = False) -> Mapping[str, tf.Tensor]:
ret = {}
for k, v in features.items():
if k in output_features:
if copy_pretokenized:
ret[f'{k}_pretokenized'] = v
vocab = output_features[k].vocabulary
v = vocab.encode_tf(v) # In this line, the `v` value type is "tf.tensor", and I can't obtain text of `v`
...[omitted]...
ret[k] = v
print(f'tokenize_impl | complete | return : {ret}')
return ret
Would it be possible to add an example of how to run evaluation? The Evaluator
section of the README is currently empty. Thank you!
Trying to cache tasks from magenta/MT3 repository, only with 200 examples it takes around 30GB of memory while caching at the very end of processing.
Without caching it trains just fine even with 1000 train examples train dataset.
Hi ๐ค
I would like to pre-train a T5 Base model with T5X library.
When I understand the pre-training process correctly, I need TFRecords stored on a cloud bucket for that training (like it is done for BERT pre-training).
Now I have the following questions:
How is possible to generate such a dataset from an own corpus. Corpus is a plain text file (each line = one sentence). I have also a T5 compatible vocab (sentencepiece model), because I don't want to use the existing T5 or mT5 vocabs.
Many thanks advance!
I tried searching the code for seqio mixtures generated using the newly released unimax sampler.
I am trying to pretrain a custom umUL2 model. if i could perhaps know how to implement unimax it would be of great help. Thanks
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.