The problem Seems like TPU utilization is not effective. The CPU l

Thank you <a class="user-mention notranslate" data-hovercard-type="user" data-hovercar

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Ok, have found out that I can specify cache directories using: <code class="notran

Low TPU usage (under 7%) with default fine-tuning parameters, small model about text-to-text-transfer-transformer HOT 20 OPEN

google-research commented on June 17, 2024

Low TPU usage (under 7%) with default fine-tuning parameters, small model

from text-to-text-transfer-transformer.

Comments (20)

NaxAlpha commented on June 17, 2024 3

Just to add here, (it is a hypothesis but) TPU has both a normal (ARM based or x86 based) CPU along with the accelerator (matrix multiplication units) and on cloud console it does not provide details about how much the accelerator is being used but it shows how much the CPU is being used.

In case of T5, TF-DS graph is uploaded to TPU, it uses TPU's CPU to execute non-deep learning ops like loading from GCS and preprocessing/tokenizing and then uses accelerator to do the deep learning training etc.

If you want to see how much TPU's accelerator is being used, you can use TPU profiling. In my case CPU was being used <0.1% but accelerator was being used at ~45%.

from text-to-text-transfer-transformer.

adarob commented on June 17, 2024 1

I'm not surprised that this would be I/O bound since your TSV is so large and would never be cached. One thing to check is that your data and TPU are in the same region. You can also shard your TSV into multiple files and pass in a path that can be globbed to find them all. The ideal solution would be to pre-tokenize the TSV into sharded TFRecord files. It wouldn't be too hard to write a beam script to do this but it's not something I'll have time to do for a few weeks.

…

On Tue, Dec 17, 2019, 10:58 PM Fabian Langer ***@***.***> wrote: I am currently finetuning the large model on a 6GB TSV file and get a TPU usage of < 1% . Anything new here? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#15?email_source=notifications&email_token=AAIJV2ENKDER4NOXOK3WHDDQZHCZHA5CNFSM4JO3DXM2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHFCCVA#issuecomment-566894932>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAIJV2GI7PVVRRQ3O6VF6Y3QZHCZHANCNFSM4JO3DXMQ> .

from text-to-text-transfer-transformer.

adarob commented on June 17, 2024

This means the model must be is I/O bound, due in part to its small size. We do tokenization and packing on the fly by default. I have a TODO to add support for cacheing smaller datasets in memory post-tokenization. Let me see if I can get to it today and you can try it out.

from text-to-text-transfer-transformer.

anatoly-khomenko commented on June 17, 2024

Thank you @adarob , let me know if I can help you to implement something. I'm launching the T5 for one of my current working tasks and I'm eager to make it train faster.

from text-to-text-transfer-transformer.

adarob commented on June 17, 2024

Can you try using the latest commit to see if this improves? It will cache the dataset on the first pass so the it will be much faster after that.

from text-to-text-transfer-transformer.

anatoly-khomenko commented on June 17, 2024

@adarob
Hi Adam,
Thank you for the update. I'm running it now (had to update python to 3.6 to be able to run the latest and it took a good part of the day).
From what I see it did not improve, timing parameters are pretty much the same (global_step/sec: 0.401189, examples/sec: 821.635) , but strangely the TPU load decreased to 1%.

from text-to-text-transfer-transformer.

t5-copybara commented on June 17, 2024

Can you maybe hardcode it to use `ds = ds.cache()` just to make sure it is getting enabled?

…

On Tue, Nov 19, 2019 at 12:34 PM Anatoly Khomenko ***@***.***> wrote: @adarob <https://github.com/adarob> Hi Adam, Thank you for the update. I'm running it now (had to update python to 3.6 to be able to run the latest and it took a good part of the day). From what I see it did not improve, timing parameters are pretty much the same (global_step/sec: 0.401189, examples/sec: 821.635) , but strangely the TPU load decreased to 1%. [image: image] <https://user-images.githubusercontent.com/4006428/69184052-19cd6800-0ae2-11ea-95d8-c9b80e134ccd.png> — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#15?email_source=notifications&email_token=ANQSF2GU5JJJNRAL53APFKTQUREWPA5CNFSM4JO3DXM2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEPVD4I#issuecomment-555700721>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ANQSF2BWK4JPFZA5UDYDGYTQUREWPANCNFSM4JO3DXMQ> . -- You received this message because you are subscribed to the Google Groups "t5-copybara" group. To unsubscribe from this group and stop receiving emails from it, send an email to ***@***.*** To post to this group, send email to ***@***.*** To view this discussion on the web visit https://groups.google.com/a/google.com/d/msgid/t5-copybara/google-research/text-to-text-transfer-transformer/issues/15/555700721%40github.com <https://groups.google.com/a/google.com/d/msgid/t5-copybara/google-research/text-to-text-transfer-transformer/issues/15/555700721%40github.com?utm_medium=email&utm_source=footer> .

from text-to-text-transfer-transformer.

anatoly-khomenko commented on June 17, 2024

Hi @t5-copybara,
As far as I understand it is here:
https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/utils.py#L654

I will comment out condition and keep ds = ds.cache()

Please, let me know if this is the correct approach.

Thank you!

from text-to-text-transfer-transformer.

anatoly-khomenko commented on June 17, 2024

@t5-copybara ,

I have found that I could specify use_cached in command line parameters like this:

--gin_param="mesh_train_dataset_fn.use_cached = True"

This is the complete line:

t5_mesh_transformer --tpu="${TPU_NAME}" --gcp_project="${PROJECT}" --tpu_zone="${ZONE}" --model_dir="${MODEL_DIR}" --t5_tfds_data_dir="${DATA_DIR}" --gin_file="dataset.gin" --gin_param="mesh_train_dataset_fn.use_cached = True" --gin_param="utils.tpu_mesh_shape.model_parallelism = 1" --gin_param="utils.tpu_mesh_shape.tpu_topology = '2x2'" --gin_param="MIXTURE_NAME = 'super_glue_boolq_v102'" --gin_file="gs://t5-data/pretrained_models/small/operative_config.gin"

When I run it though, I get the exception here:

https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/utils.py#L602

Do you know where I can specify cache directories?

from text-to-text-transfer-transformer.

anatoly-khomenko commented on June 17, 2024

Ok, have found out that I can specify cache directories using:
--additional_task_cache_dirs="${CACHE_DIR}"
but in this case the cache does not get created as well.

This is the message I get:

22:18:44.970715 140004609750016 utils.py:584] 'super_glue_boolq_v102' does not exist in any task cache directories (searched ['gs://uniquebucketname/t5-boolq-data-dir-cache/super_glue_boolq_v102']).

Giving up for now.

from text-to-text-transfer-transformer.

adarob commented on June 17, 2024

The offline use_cached stuff is only supported on our internal infrastructure for the time being. What I added for you is something that will do the caching on the fly. You should be able to explicitly enable it as you mentioned above ("I will comment out condition and keep ds = ds.cache()"). Are you sure you're actually using this new code when you run and not what's in the pip package?

from text-to-text-transfer-transformer.

anatoly-khomenko commented on June 17, 2024

@adarob
Hi Adam,
I'll give it a try on another run. But I see several places with ds = ds.cache(), do I have to make the change to all of them?

I'm using latest master by installing from source with command:

pip install --upgrade -e ./text-to-text-transfer-transformer

And I see all the recent fixes there, so I'm pretty sure to be using the most recent version.

Will let you know as soon as I run it again.

from text-to-text-transfer-transformer.

f-lng commented on June 17, 2024

I am currently finetuning the large model on a 6GB TSV file and get a TPU usage of < 1% . Anything new here?

from text-to-text-transfer-transformer.

anatoly-khomenko commented on June 17, 2024

@f-lng , @adarob ,
I have ended up using the notebook provided here: https://github.com/google-research/text-to-text-transfer-transformer/blob/master/notebooks/t5-trivia.ipynb

It seems to be using TPU more effectively. At least training that took more than 12 hours using the script from the issue, is accomplished in less than 4 hours using the notebook.

Size of the dataset is under 500Mb though.

@adarob , thank you for providing the notebook!

from text-to-text-transfer-transformer.

f-lng commented on June 17, 2024

@adarob I was not aware that the size of the TSV file could be an issue, I assumed that the code would just read chunks of it. Thank you for clarifying, I will try to pre-shard it.

@anatoly-khomenko Thanks for letting me know, I will have a look at the notebook as well.

from text-to-text-transfer-transformer.

adarob commented on June 17, 2024

It does read chunks, but if it's sharded it can read multiple chunks in parallel.

…

On Thu, Dec 19, 2019, 1:01 AM Fabian Langer ***@***.***> wrote: @adarob <https://github.com/adarob> I was not aware that the size of the TSV file could be an issue, I assumed that the code would just read chunks of it. Thank you for clarifying, I will try to pre-shard it. @anatoly-khomenko <https://github.com/anatoly-khomenko> Thanks for letting me know, I will have a look at the notebook as well. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#15?email_source=notifications&email_token=AAIJV2CUDK4GSQHK6UP5XH3QZMZ5HA5CNFSM4JO3DXM2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHI5HIA#issuecomment-567399328>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAIJV2CGZMEXUGBTACLJUIDQZMZ5HANCNFSM4JO3DXMQ> .

from text-to-text-transfer-transformer.

f-lng commented on June 17, 2024

@adarob I did now take 7.000 examples from my dataset and precomputed 7 TFRecords files from it.

To create the TFRecords I directly used the T5 'Task' ( https://pastebin.com/36pG5ne4 )

I then adjusted your _get_cached_dataset function (and made sure its called) to load it ( https://pastebin.com/cvTzxEXh ) . The debug print is showing, so the function is called and the in-memory caching is also working.

I am using code that has been adapted from your Notebook to train the model ( https://pastebin.com/8am5S5H2 )

However, I am still getting a speed of ~50 examples/second and a TPU-CPU usage of <= 0.15% (sic) during training most of the time, with some spikes (1-2%)

( The naive approach of putting my huge TSV into the command line util gave me ~90 examples / sec. )

I do not have a lot of experience with the TF ecosystem except from some hacking around in Tensor2Tensor, and none with TPUs, so perhaps I am missing something important ?!

Btw I just checked, the buckets, the TPU and the VM are all in us-central1(-a) and it is a TPU v3-8.

from text-to-text-transfer-transformer.

f-lng commented on June 17, 2024

@adarob I did another experiment and set tokens_per_batch to 1024^2 and trained on ~250k datapoints. The examples/second stayed at ~50. (Also note that I got OOM errors with such high batch sizes when training using the CLI but did not get one this time. )

from text-to-text-transfer-transformer.

caffeinetoomuch commented on June 17, 2024

@adarob @anatoly-khomenko I am having a similar issue and I am trying to finetune on the GPU. When training, GPU usage is less than 7% and there is a huge usage of CPU(It has to be I/O bound). I also ended up using the following notebook even with the example data provided:
https://github.com/google-research/text-to-text-transfer-transformer/blob/master/notebooks/t5-trivia.ipynb
Moreover, I even tried parameter mesh_train_dataset_fn.use_cached = True.
Any suggestion or correction on what I might be doing wrong?

from text-to-text-transfer-transformer.

adarob commented on June 17, 2024

use_cached=True won't work unless you have run the cache_tasks_main preprocessing (https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/cache_tasks_main.py). You should also check that your data is being stored in the same region as your TPU/GPU. I'm not sure what else could be causing this issue.

from text-to-text-transfer-transformer.

Low TPU usage (under 7%) with default fine-tuning parameters, small model about text-to-text-transfer-transformer HOT 20 OPEN

Comments (20)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent