Comments (8)
@versae hey there, did you make any progress on this ? actually i too am facing issue in making seqio compatible with huggingface for pre-training LM's. if possiable could you also please send me alink to your HF pretraining script
from seqio.
Hi @StephennFernandes, no, not really 😢 Right now we have all our code here, but we do manual calculation (guesstimates) of how many samples skip at each restart.
from seqio.
Hey there, thanks a ton for replying. Sadly the repo you have linked isn't available. Apparently it's private. Could you please make it public and/or accessible to me please.
from seqio.
Sorry, can't. But here's a similar one I've also been working on: https://github.com/bertin-project/bertin-t5x
from seqio.
Hi, please take a look at https://github.com/google-research/t5x/blob/main/docs/usage/pretrain.md#deterministic-training-no-toc , this has instructions to make your data pipeline deterministic, i.e. reproducible and recoverable.
from seqio.
@gauravmishra , the seqio's dataset_fn
returns return tf.data.Dataset.from_generator(...)
but i needed to output from seqio to be compatible with huggingface's transformer training script is there a way to return it in any other format, thats compatible with huggingface's training. btw i would be making a mixture for multiple languages.
from seqio.
Hi Stephen, currently SeqIO only supports tf.data.Datasets as Task/Mixture outputs. The way to go would be to create a shim to convert tf.data.Datasets into a HF-compatible format (this may already exist, but I'm not sure).
from seqio.
@gauravmishra , thanks for replying. actually, i have built a hacky way to returning the output from seqio.get_mixture_or_task().get_dataset()
as .as_numpy_iterator()
which lets me have numpy values.
the follwoing is the code for the same.
import functools
import seqio
import tensorflow as tf
import t5.data
from datasets import load_dataset
from t5.data import postprocessors
from t5.data import preprocessors
from t5.evaluation import metrics
from seqio import FunctionDataSource, utils
TaskRegistry = seqio.TaskRegistry
DEFAULT_OUTPUT_FEATURES = {
"inputs": seqio.Feature(
vocabulary=t5.data.get_default_vocabulary(), add_eos=True,
required=False),
"targets": seqio.Feature(
vocabulary=t5.data.get_default_vocabulary(), add_eos=True)
}
def gen_dataset(split, shuffle=False, seed=None, column="text", dataset_params=None):
dataset = load_dataset(**dataset_params)
if shuffle:
if seed:
dataset = dataset.shuffle(seed=seed)
else:
dataset = dataset.shuffle()
while True:
for item in dataset[str(split)]:
yield item[column]
def dataset_fn(split, shuffle_files, seed=None, dataset_params=None):
return tf.data.Dataset.from_generator(
functools.partial(gen_dataset, split, shuffle_files, seed, dataset_params=dataset_params),
output_signature=tf.TensorSpec(shape=(), dtype=tf.string, name=dataset_name)
)
@utils.map_over_dataset
def target_to_key(x, key_map, target_key):
"""Assign the value from the dataset to target_key in key_map"""
return {**key_map, target_key: x}
dataset_name = 'oscar-corpus/OSCAR-2109'
subset= 'mr'
dataset_params = {"path": dataset_name, "language":subset, "use_auth_token":True}
dataset_shapes = None
TaskRegistry.add(
"oscar_marathi_corpus",
source=seqio.FunctionDataSource(
dataset_fn=functools.partial(dataset_fn, dataset_params=dataset_params),
splits=("train", "validation"),
caching_permitted=False,
num_input_examples=dataset_shapes,
),
preprocessors=[
functools.partial(
target_to_key, key_map={
"inputs": None,
"targets": None,
}, target_key="targets"),
seqio.preprocessors.tokenize,
# seqio.CacheDatasetPlaceholder(),
preprocessors.span_corruption,
seqio.preprocessors.append_eos_after_trim,
],
output_features={"targets": seqio.Feature(vocabulary=t5.data.get_default_vocabulary(), add_eos=True)},
metric_fns=[]
)
dataset = seqio.get_mixture_or_task("oscar_marathi_corpus").get_dataset(
sequence_length={"inputs": 512, "targets": 512},
split="train",
shuffle=True,
num_epochs=1,
use_cached=False,
seed=42
)
for _, ex in zip(range(5), dataset.as_numpy_iterator()):
print(ex)
but the thing is it returns values as input IDs after the preprocessing done on the dataset.
But the Huggingface T5 trainer does take care of all the preprocessing and other steps needed.
I actually need the output in actual raw text string format. which i could then use to preprocess in the huggingface training script.
I only need to use the mixture functionality from seqio and avoiding all the preprocessing, tokenization etc.
In summary i only need a way to feed in raw text samples from multiple langugaes, use the mixture from seqio and get back an iterator thats outputs samples which are mixture of all the languages. (in raw text form)
is there a way of actually obtaining that ?
if not, then do you know of any way i could obtain the Mixture Functionality without using seqio ?
from seqio.
Related Issues (20)
- Please include installation instructions HOT 2
- import seqio
- how to decide ideal mixture rates ? HOT 1
- FunctionDataSource does not allow function with 3 positional arguments thus shuffling does not work HOT 2
- unable to train mt5 from t5x using mixtures ValueError: Dataset is missing an expected feature during input_validation validation: 'inputs' HOT 3
- Tokenizer is not behaving as expected on special tokens (doesn't recognize `pad` and `eos` tokens) HOT 1
- Using a registered task to add another HOT 1
- seqio 0.0.13 cannot be installed on Apple Silicon due to transitive tensorflow dependency of clu HOT 2
- How to apply the huggingface tokenizer in seqio.vocabulary
- Different preprocessors for each dataset split HOT 2
- import seqio: AttributeError: module 'typing' has no attribute 'get_origin' HOT 1
- Concatenating Tasks? HOT 2
- caching tasks goes out of memory due to apache beam HOT 2
- How to choose minimum sequence length while avoiding truncation
- TfdsDataProvider gives error with non-None tfds_data_dir HOT 2
- Dataset performance
- seqio.get_mixture_or_task('bool_q_template_0_no_opt_five_shot') failed
- unimax sampling ?
- ValueError: mutable default <class 'seqio.vocabularies.PassThroughVocabulary'> for field vocabulary is not allowed: use default_factory HOT 3
- Unimax sampler implementation?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from seqio.