Git Product home page Git Product logo

t5-japanese's Introduction

t5-japanese's People

Contributors

shirayu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

t5-japanese's Issues

Restart of a preemptible TPU sometimes does not work

Sometimes the training process (t5.models.mesh_transformer_main) working with a preemptible TPU does not finish (with an error exit code) and freezes.
This is an example of the log.

I0902 03:33:36.516501 140070334211904 basic_session_run_hooks.py:260] loss = 1.109375, step = 488600 (45.410 sec)
INFO:tensorflow:global_step/sec: 2.20221
I0902 03:33:36.518121 140070334211904 tpu_estimator.py:2402] global_step/sec: 2.20221
INFO:tensorflow:examples/sec: 140.942
I0902 03:33:36.518576 140070334211904 tpu_estimator.py:2403] examples/sec: 140.942
INFO:tensorflow:Enqueue next (100) batch(es) of data to infeed.
I0902 03:33:36.520152 140070334211904 tpu_estimator.py:616] Enqueue next (100) batch(es) of data to infeed.
INFO:tensorflow:Dequeue next (100) batch(es) of data from outfeed.
I0902 03:33:36.520488 140070334211904 tpu_estimator.py:620] Dequeue next (100) batch(es) of data from outfeed.
INFO:tensorflow:Outfeed finished for iteration (1862, 53)
I0902 03:34:01.018308 140066416998144 tpu_estimator.py:289] Outfeed finished for iteration (1862, 53)
INFO:tensorflow:ShutdownHook: lame workers found: HeartbeatManager(/job:worker/replica:0/task:0/device:CPU:0)
I0902 03:34:21.925864 140070334211904 session_support.py:391] ShutdownHook: lame workers found: HeartbeatManager(/job:worker/replica:0/task:0/device:CPU:0)
INFO:tensorflow:ShutdownHook: saving checkpoint to gs://somewhere/model.ckpt
I0902 03:34:21.941661 140070334211904 session_support.py:394] ShutdownHook: saving checkpoint to gs://somewhere/model.ckpt
INFO:tensorflow:No save on shutdown when there are user-defined CheckpointSaverHooks
I0902 03:34:21.942317 140070334211904 tpu_estimator.py:2370] No save on shutdown when there are user-defined CheckpointSaverHooks
INFO:tensorflow:Shutting down HeartbeatManager(/job:worker/replica:0/task:0/device:CPU:0).
I0902 03:34:21.942646 140070334211904 session_support.py:150] Shutting down HeartbeatManager(/job:worker/replica:0/task:0/device:CPU:0).
INFO:tensorflow:Configuring worker heartbeat: shutdown_mode: SHUTDOWN_AFTER_TIMEOUT
watchdog_config {
  timeout_ms: 60000
}
exit_code {
  exit_code: 42
}

I0902 03:34:21.943512 140070334211904 session_support.py:104] Configuring worker heartbeat: shutdown_mode: SHUTDOWN_AFTER_TIMEOUT
watchdog_config {
  timeout_ms: 60000
}
exit_code {
  exit_code: 42
}

INFO:tensorflow:Waiting 70.00 seconds for worker shutdown.
I0902 03:34:21.945668 140070334211904 session_support.py:159] Waiting 70.00 seconds for worker shutdown.
INFO:tensorflow:Resetting coordinator.
I0902 03:35:32.017142 140070334211904 session_support.py:423] Resetting coordinator.
INFO:tensorflow:An error was raised. This may be due to a preemption in a connected worker or parameter server. The current session will be closed and a new session w
ill be created. This error may also occur due to a gRPC failure caused by high memory or network bandwidth usage in the parameter servers. If this error occurs repeat
edly, try increasing the number of parameter servers assigned to the job. Error: Resetting session loop due to worker shutdown.
I0902 03:35:32.020745 140070334211904 monitored_session.py:1286] An error was raised. This may be due to a preemption in a connected worker or parameter server. The c
urrent session will be closed and a new session will be created. This error may also occur due to a gRPC failure caused by high memory or network bandwidth usage in t
he parameter servers. If this error occurs repeatedly, try increasing the number of parameter servers assigned to the job. Error: Resetting session loop due to worker
 shutdown.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.