Git Product home page Git Product logo

abci-docs's People

Contributors

a-wada-rk avatar abci-fjse avatar ahama avatar e-kwsm avatar keichi avatar morikawa-pasc avatar nis8192 avatar nwatab avatar ogawa avatar pasc-kouda avatar peaceiris avatar rymzt avatar s-yama avatar sakkabe avatar stakizawa avatar ttakayuki avatar u-kawasaki avatar y-tanimura avatar ykado avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

abci-docs's Issues

用語の統一

/en/docs/02.md などでは "Usage Manager"、https://portal.abci.ai/docs/portal/en/03/ では "users who are granted the user administrator authority" という表現になっています。そもそも "Usage Manager", "user administrator" という表現がおかしいので、"Group Administrator (users who are granted the group administrator authority)" などという表現で統一してください。

Guide about Group area disk usage

Hi, thanks for your nice work.

Now I am a user of ABCI server. I found there is only 200G disk space in home directory for each user. In order to run a large model on ABCI server, I have applied 5T group disk space. But I cannot find any guide about how to use Group disk in ABCI document.

Could you give me some advices about how to mount or use group disk?

SSHアクセスの説明を追加

  • appendix/ssh-access.md を追加して記述。
  • 機能を説明することとし、torch.distributed の例などは tips 以下に追加。後者は milestone に含めない。

インタラクティブノードからのqrsh -inherit

1.バッチジョブの投入
[username@es1 ~]$ qsub -g grpname -l rt_F=2 run.sh
Your job 1000000 ("run.sh") has been submitted

2.ジョブに割り当てられた計算ノードの確認
[username@es1 ~]$ qstat -j 1000000
(snip)
exec_host_list 1: g0001:80, g0002:80
(snip)

3.環境変数の設定
[username@es1 ~]$ export JOB_ID=1000000
[username@es1 ~]$ export SGE_TASK_ID=undefined

4.計算ノードでnvidia-smiの実行
[username@es1 ~]$ qrsh -inherit g0001 nvidia-smi
Wed Oct 21 16:01:12 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:3D:00.0 Off | 0 |
| N/A 31C P0 38W / 300W | 0MiB / 16160MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... On | 00000000:3E:00.0 Off | 0 |
| N/A 29C P0 42W / 300W | 0MiB / 16160MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... On | 00000000:B1:00.0 Off | 0 |
| N/A 30C P0 42W / 300W | 0MiB / 16160MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... On | 00000000:B2:00.0 Off | 0 |
| N/A 32C P0 42W / 300W | 0MiB / 16160MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
[username@es1 ~]$

Failed to load cupy

I loaded modules
module load python/3.6 cuda/9.2
and then created venv, then installed cupy-cuda92, but failed to import cupy.

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Wed_Apr_11_23:16:29_CDT_2018
Cuda compilation tools, release 9.2, V9.2.88
$ pip3 freeze
absl-py==0.7.0
astor==0.7.1
chainer==5.2.0
cupy-cuda92==5.2.0
fastrlock==0.4
filelock==3.0.10
gast==0.2.2
grpcio==1.18.0
h5py==2.9.0
imageio==2.5.0
Keras==2.2.4
Keras-Applications==1.0.7
Keras-Preprocessing==1.0.9
Markdown==3.0.1
numpy==1.16.1
Pillow==5.4.1
protobuf==3.6.1
PyYAML==3.13
scikit-learn==0.20.2
scipy==1.2.1
six==1.12.0
sklearn==0.0
tensorboard==1.12.2
tensorflow==1.12.0
termcolor==1.1.0
Werkzeug==0.14.1
$ python3
Python 3.6.5 (default, Jun  2 2018, 15:49:50) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-16)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import cupy
Traceback (most recent call last):
  File "/fs3/home/aca10485tl/dl/lib/python3.6/site-packages/cupy/__init__.py", line 11, in <module>
    from cupy import core  # NOQA
  File "/fs3/home/aca10485tl/dl/lib/python3.6/site-packages/cupy/core/__init__.py", line 1, in <module>
    from cupy.core import core  # NOQA
ImportError: libcuda.so.1: cannot open shared object file: No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/fs3/home/aca10485tl/dl/lib/python3.6/site-packages/cupy/__init__.py", line 32, in <module>
    six.reraise(ImportError, ImportError(msg), exc_info[2])
  File "/fs3/home/aca10485tl/dl/lib/python3.6/site-packages/six.py", line 692, in reraise
    raise value.with_traceback(tb)
  File "/fs3/home/aca10485tl/dl/lib/python3.6/site-packages/cupy/__init__.py", line 11, in <module>
    from cupy import core  # NOQA
  File "/fs3/home/aca10485tl/dl/lib/python3.6/site-packages/cupy/core/__init__.py", line 1, in <module>
    from cupy.core import core  # NOQA
ImportError: CuPy is not correctly installed.

If you are using wheel distribution (cupy-cudaXX), make sure that the version of CuPy you installed matches with the version of CUDA on your host.
Also, confirm that only one CuPy package is installed:
  $ pip freeze

If you are building CuPy from source, please check your environment, uninstall CuPy and reinstall it with:
  $ pip install cupy --no-cache-dir -vvvv

Check the Installation Guide for details:
  https://docs-cupy.chainer.org/en/latest/install.html

original error: libcuda.so.1: cannot open shared object file: No such file or directory

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.