Comments (2)
And dmesg
:
[ +0.000844] nvidia 0000:03:00.0: AER: can't recover (no error_detected callback)
[ +0.000001] snd_hda_intel 0000:03:00.1: AER: can't recover (no error_detected callback)
[ +0.000008] pcieport 0000:00:03.0: AER: device recovery failed
[ +0.000001] pcieport 0000:00:03.0: AER: Multiple Uncorrected (Fatal) error received: 0000:00:03.0
[ +0.000004] pcieport 0000:00:03.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Requester ID)
[ +0.000898] pcieport 0000:00:03.0: device [8086:6f08] error status/mask=00004020/00000000
[ +0.000889] pcieport 0000:00:03.0: [ 5] SDES
[ +0.000894] pcieport 0000:00:03.0: [14] CmpltTO (First)
[ +0.000921] nvidia 0000:03:00.0: AER: can't recover (no error_detected callback)
[ +0.000002] snd_hda_intel 0000:03:00.1: AER: can't recover (no error_detected callback)
[ +1.050439] pcieport 0000:00:03.0: AER: Root Port link has been reset (0)
[ +0.000041] pcieport 0000:00:03.0: AER: device recovery failed
from returnn.
I guess nothing we can do, it's just a hardware issue.
from returnn.
Related Issues (20)
- Make batch_size configurable for cross validation HOT 1
- Ignore a single broken gradient HOT 2
- DistributeFilesDataset: _distribute_evenly_by_size suboptimal for multi-gpu sharding HOT 8
- multiprocessing: OSError: AF_UNIX path too long HOT 11
- ConcatSeqsDataset with extended functionality HOT 3
- Torch: print model at log verbosity 3 HOT 1
- RuntimeError: CUDA error: an illegal memory access was encountered HOT 1
- Torch gradient_checkpoint_scope _unregister_custom_saved_tensors_hooks error HOT 4
- RF parametrization breaks Conv
- Torch gradient_checkpoint_scope could trigger segmentation fault? HOT 16
- Torch gradient_checkpoint_scope potential memory leak
- Torch multiple simultaneous gradient_checkpoint_scope
- `rf.pack_padded` with PyTorch takes a lot of memory HOT 1
- `rf.RelPosCausalSelfAttention` fails with `single_step_dim` HOT 9
- Torch `report_profile` `check_events` based tests maybe unstable HOT 1
- Torch: gradient_clip wrong when grad_scaler is used
- Torch print step info on crash
- Make `FileCache` able to detect updated remote files HOT 1
- RF masked computation / masking (like masked_select but without the packing) HOT 3
- TF end layer independent of batch causes error in beam search
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from returnn.