Git Product home page Git Product logo

cgra4ml's People

Contributors

abarajithan11 avatar ang037 avatar awengz avatar blazecode2 avatar raviduhm99 avatar rck289 avatar zhenghuama avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

cgra4ml's Issues

Error in model training

OS: RHEL8
GPU: A100
softwares/packages installed with conda
deepsocflow installed following the instruction

In deepsocflow/test/py/resnet18_bundle_api.ipynb, I turned on model.fit() and had an error. This is the error message:
(deepsocflow) [zpli@sdfampere022 py]$ python resnet18_bundle_api.py
2024-05-08 14:25:32.449639: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-05-08 14:25:32.449704: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-05-08 14:25:32.451415: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-05-08 14:25:32.459064: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-05-08 14:25:33.336913: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
x_train shape: (50000, 32, 32, 3)
50000 train samples
10000 test samples
y_train shape: (50000, 1)
(32, 32, 3)
2024-05-08 14:25:36.434910: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 38379 MB memory: -> device: 0, name: NVIDIA A100-SXM4-40GB, pci bus id: 0000:01:00.0, compute capability: 8.0
2024-05-08 14:25:36.906015: I external/local_tsl/tsl/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory
Learning rate: 0.001
Learning rate: 0.001
Epoch 1/2
Traceback (most recent call last):
File "/sdf/home/z/zpli/deepsocflow/deepsocflow/test/py/resnet18_bundle_api.py", line 254, in
model.fit(x_train, y_train,
File "/sdf/group/exo/zpli/conda/envs/deepsocflow/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 70, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/sdf/group/exo/zpli/conda/envs/deepsocflow/lib/python3.10/site-packages/tensorflow/python/eager/polymorphic_function/autograph_util.py", line 52, in autograph_handler
raise e.ag_error_metadata.to_exception(e)
tensorflow.python.autograph.impl.api.StagingError: in user code:

File "/sdf/group/exo/zpli/conda/envs/deepsocflow/lib/python3.10/site-packages/keras/src/engine/training.py", line 1401, in train_function  *
    return step_function(self, iterator)
File "/sdf/group/exo/zpli/conda/envs/deepsocflow/lib/python3.10/site-packages/keras/src/engine/training.py", line 1384, in step_function  **
    outputs = model.distribute_strategy.run(run_step, args=(data,))
File "/sdf/group/exo/zpli/conda/envs/deepsocflow/lib/python3.10/site-packages/keras/src/engine/training.py", line 1373, in run_step  **
    outputs = model.train_step(data)
File "/sdf/group/exo/zpli/conda/envs/deepsocflow/lib/python3.10/site-packages/keras/src/engine/training.py", line 1154, in train_step
    self.optimizer.minimize(loss, self.trainable_variables, tape=tape)
File "/sdf/group/exo/zpli/conda/envs/deepsocflow/lib/python3.10/site-packages/keras/src/engine/base_layer.py", line 2271, in trainable_variables
    return self.trainable_weights
File "/sdf/group/exo/zpli/conda/envs/deepsocflow/lib/python3.10/site-packages/keras/src/engine/training.py", line 3010, in trainable_weights
    trainable_variables += trackable_obj.trainable_variables
File "/sdf/group/exo/zpli/conda/envs/deepsocflow/lib/python3.10/site-packages/keras/src/engine/base_layer.py", line 2271, in trainable_variables
    return self.trainable_weights
File "/sdf/group/exo/zpli/conda/envs/deepsocflow/lib/python3.10/site-packages/keras/src/engine/base_layer.py", line 1312, in trainable_weights
    children_weights = self._gather_children_attribute(
File "/sdf/group/exo/zpli/conda/envs/deepsocflow/lib/python3.10/site-packages/keras/src/engine/base_layer.py", line 3298, in _gather_children_attribute
    return list(
File "/sdf/group/exo/zpli/conda/envs/deepsocflow/lib/python3.10/site-packages/keras/src/engine/base_layer.py", line 3300, in <genexpr>
    getattr(layer, attribute) for layer in nested_layers
File "/sdf/group/exo/zpli/conda/envs/deepsocflow/lib/python3.10/site-packages/keras/src/engine/base_layer.py", line 2271, in trainable_variables
    return self.trainable_weights
File "/sdf/group/exo/zpli/conda/envs/deepsocflow/lib/python3.10/site-packages/keras/src/engine/base_layer.py", line 1312, in trainable_weights
    children_weights = self._gather_children_attribute(
File "/sdf/group/exo/zpli/conda/envs/deepsocflow/lib/python3.10/site-packages/keras/src/engine/base_layer.py", line 3298, in _gather_children_attribute

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.