Git Product home page Git Product logo

Comments (20)

NaxAlpha avatar NaxAlpha commented on June 17, 2024 3

Just to add here, (it is a hypothesis but) TPU has both a normal (ARM based or x86 based) CPU along with the accelerator (matrix multiplication units) and on cloud console it does not provide details about how much the accelerator is being used but it shows how much the CPU is being used.

In case of T5, TF-DS graph is uploaded to TPU, it uses TPU's CPU to execute non-deep learning ops like loading from GCS and preprocessing/tokenizing and then uses accelerator to do the deep learning training etc.

If you want to see how much TPU's accelerator is being used, you can use TPU profiling. In my case CPU was being used <0.1% but accelerator was being used at ~45%.

from text-to-text-transfer-transformer.

adarob avatar adarob commented on June 17, 2024 1

from text-to-text-transfer-transformer.

adarob avatar adarob commented on June 17, 2024

This means the model must be is I/O bound, due in part to its small size. We do tokenization and packing on the fly by default. I have a TODO to add support for cacheing smaller datasets in memory post-tokenization. Let me see if I can get to it today and you can try it out.

from text-to-text-transfer-transformer.

anatoly-khomenko avatar anatoly-khomenko commented on June 17, 2024

Thank you @adarob , let me know if I can help you to implement something. I'm launching the T5 for one of my current working tasks and I'm eager to make it train faster.

from text-to-text-transfer-transformer.

adarob avatar adarob commented on June 17, 2024

Can you try using the latest commit to see if this improves? It will cache the dataset on the first pass so the it will be much faster after that.

from text-to-text-transfer-transformer.

anatoly-khomenko avatar anatoly-khomenko commented on June 17, 2024

@adarob
Hi Adam,
Thank you for the update. I'm running it now (had to update python to 3.6 to be able to run the latest and it took a good part of the day).
From what I see it did not improve, timing parameters are pretty much the same (global_step/sec: 0.401189, examples/sec: 821.635) , but strangely the TPU load decreased to 1%.

image

from text-to-text-transfer-transformer.

t5-copybara avatar t5-copybara commented on June 17, 2024

from text-to-text-transfer-transformer.

anatoly-khomenko avatar anatoly-khomenko commented on June 17, 2024

Hi @t5-copybara,
As far as I understand it is here:
https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/utils.py#L654

I will comment out condition and keep ds = ds.cache()

Please, let me know if this is the correct approach.

Thank you!

from text-to-text-transfer-transformer.

anatoly-khomenko avatar anatoly-khomenko commented on June 17, 2024

@t5-copybara ,

I have found that I could specify use_cached in command line parameters like this:

--gin_param="mesh_train_dataset_fn.use_cached = True"

This is the complete line:

t5_mesh_transformer --tpu="${TPU_NAME}" --gcp_project="${PROJECT}" --tpu_zone="${ZONE}" --model_dir="${MODEL_DIR}" --t5_tfds_data_dir="${DATA_DIR}" --gin_file="dataset.gin" --gin_param="mesh_train_dataset_fn.use_cached = True" --gin_param="utils.tpu_mesh_shape.model_parallelism = 1" --gin_param="utils.tpu_mesh_shape.tpu_topology = '2x2'" --gin_param="MIXTURE_NAME = 'super_glue_boolq_v102'" --gin_file="gs://t5-data/pretrained_models/small/operative_config.gin"

When I run it though, I get the exception here:

https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/utils.py#L602

Do you know where I can specify cache directories?

from text-to-text-transfer-transformer.

anatoly-khomenko avatar anatoly-khomenko commented on June 17, 2024

Ok, have found out that I can specify cache directories using:
--additional_task_cache_dirs="${CACHE_DIR}"
but in this case the cache does not get created as well.

This is the message I get:

22:18:44.970715 140004609750016 utils.py:584] 'super_glue_boolq_v102' does not exist in any task cache directories (searched ['gs://uniquebucketname/t5-boolq-data-dir-cache/super_glue_boolq_v102']).

Giving up for now.

from text-to-text-transfer-transformer.

adarob avatar adarob commented on June 17, 2024

The offline use_cached stuff is only supported on our internal infrastructure for the time being. What I added for you is something that will do the caching on the fly. You should be able to explicitly enable it as you mentioned above ("I will comment out condition and keep ds = ds.cache()"). Are you sure you're actually using this new code when you run and not what's in the pip package?

from text-to-text-transfer-transformer.

anatoly-khomenko avatar anatoly-khomenko commented on June 17, 2024

@adarob
Hi Adam,
I'll give it a try on another run. But I see several places with ds = ds.cache(), do I have to make the change to all of them?

I'm using latest master by installing from source with command:

pip install --upgrade -e ./text-to-text-transfer-transformer

And I see all the recent fixes there, so I'm pretty sure to be using the most recent version.

Will let you know as soon as I run it again.

from text-to-text-transfer-transformer.

f-lng avatar f-lng commented on June 17, 2024

I am currently finetuning the large model on a 6GB TSV file and get a TPU usage of < 1% . Anything new here?

from text-to-text-transfer-transformer.

anatoly-khomenko avatar anatoly-khomenko commented on June 17, 2024

@f-lng , @adarob ,
I have ended up using the notebook provided here: https://github.com/google-research/text-to-text-transfer-transformer/blob/master/notebooks/t5-trivia.ipynb

It seems to be using TPU more effectively. At least training that took more than 12 hours using the script from the issue, is accomplished in less than 4 hours using the notebook.

Size of the dataset is under 500Mb though.

@adarob , thank you for providing the notebook!

from text-to-text-transfer-transformer.

f-lng avatar f-lng commented on June 17, 2024

@adarob I was not aware that the size of the TSV file could be an issue, I assumed that the code would just read chunks of it. Thank you for clarifying, I will try to pre-shard it.

@anatoly-khomenko Thanks for letting me know, I will have a look at the notebook as well.

from text-to-text-transfer-transformer.

adarob avatar adarob commented on June 17, 2024

from text-to-text-transfer-transformer.

f-lng avatar f-lng commented on June 17, 2024

@adarob I did now take 7.000 examples from my dataset and precomputed 7 TFRecords files from it.

To create the TFRecords I directly used the T5 'Task' ( https://pastebin.com/36pG5ne4 )

I then adjusted your _get_cached_dataset function (and made sure its called) to load it ( https://pastebin.com/cvTzxEXh ) . The debug print is showing, so the function is called and the in-memory caching is also working.

I am using code that has been adapted from your Notebook to train the model ( https://pastebin.com/8am5S5H2 )

However, I am still getting a speed of ~50 examples/second and a TPU-CPU usage of <= 0.15% (sic) during training most of the time, with some spikes (1-2%)

grafik

( The naive approach of putting my huge TSV into the command line util gave me ~90 examples / sec. )

I do not have a lot of experience with the TF ecosystem except from some hacking around in Tensor2Tensor, and none with TPUs, so perhaps I am missing something important ?!

Btw I just checked, the buckets, the TPU and the VM are all in us-central1(-a) and it is a TPU v3-8.

from text-to-text-transfer-transformer.

f-lng avatar f-lng commented on June 17, 2024

@adarob I did another experiment and set tokens_per_batch to 1024^2 and trained on ~250k datapoints. The examples/second stayed at ~50. (Also note that I got OOM errors with such high batch sizes when training using the CLI but did not get one this time. )

from text-to-text-transfer-transformer.

caffeinetoomuch avatar caffeinetoomuch commented on June 17, 2024

@adarob @anatoly-khomenko I am having a similar issue and I am trying to finetune on the GPU. When training, GPU usage is less than 7% and there is a huge usage of CPU(It has to be I/O bound). I also ended up using the following notebook even with the example data provided:
https://github.com/google-research/text-to-text-transfer-transformer/blob/master/notebooks/t5-trivia.ipynb
Moreover, I even tried parameter mesh_train_dataset_fn.use_cached = True.
Any suggestion or correction on what I might be doing wrong?

from text-to-text-transfer-transformer.

adarob avatar adarob commented on June 17, 2024

use_cached=True won't work unless you have run the cache_tasks_main preprocessing (https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/cache_tasks_main.py). You should also check that your data is being stored in the same region as your TPU/GPU. I'm not sure what else could be causing this issue.

from text-to-text-transfer-transformer.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.