aistabci / abci-docs Goto Github PK

View Code? Open in Web Editor NEW

60.0 12.0 21.0 61.21 MB

ABCI User Guide & Portal Guide

Home Page: https://docs.abci.ai/

Makefile 53.13% HTML 46.88%

abci ai supercomputer cloud mkdocs-sites

abci-docs's People

Contributors

Stargazers

Watchers

Forkers

rymzt abci-fjse sakkabe stakizawa 3zn youyoshi a-wada-rk iketeru ahama pasc-kouda xysong1201 ttakayuki sonichn keichi u-kawasaki y-tanimura taguchi-k08 yosuke-291 aistabci montana

abci-docs's Issues

ソフトウェア一覧のアップデート

/ja/docs/01.md#software
/en/docs/01.md#software

Singularity Endpoint (英語版) のメンテナンスで不要になった警告文等を削除

Singularity Endpoint のページの日本語版に適用された差分を英語版にも適用します。

#235

インデックスのアップデート

/ja/mkdocs.yaml
/ja/docs/index.md
/en/mkdocs.yaml
/en/docs/index.md

/en/docs/02.md などでは "Usage Manager"、https://portal.abci.ai/docs/portal/en/03/ では "users who are granted the user administrator authority" という表現になっています。そもそも "Usage Manager", "user administrator" という表現がおかしいので、"Group Administrator (users who are granted the group administrator authority)" などという表現で統一してください。

Guide about Group area disk usage

Hi, thanks for your nice work.

Now I am a user of ABCI server. I found there is only 200G disk space in home directory for each user. In order to run a large model on ABCI server, I have applied 5T group disk space. But I cannot find any guide about how to use Group disk in ABCI document.

Could you give me some advices about how to mount or use group disk?

GCC 7.3.0, 7.4.0 の environment module が提供されていることが未反映

2019-11-26現在、module avail によるとgccに関しては gcc/4.8.5 , gcc/7.3.0, gcc/7.4.0 が準備されているようですが、Tips[1]にはそれが反映されていないように見えます。

[1] https://docs.abci.ai/ja/tips/gcc-7.3.0/

既存PRのmerge (216)

#216

既存PRのmerge (207)

#207

CUDA Toolkit, cuDNN, NCCLの一覧のアップデート

/ja/docs/07.md
/en/docs/07.md

システム更新履歴の追加

/ja/docs/system-updates.md
/en/docs/system-updates.md

SSHアクセスの説明を追加

appendix/ssh-access.md を追加して記述。
機能を説明することとし、torch.distributed の例などは tips 以下に追加。後者は milestone に含めない。

インタラクティブノードからのqrsh -inherit

1.バッチジョブの投入
[username@es1 ~]$ qsub -g grpname -l rt_F=2 run.sh
Your job 1000000 ("run.sh") has been submitted

2.ジョブに割り当てられた計算ノードの確認
[username@es1 ~]$ qstat -j 1000000
(snip)
exec_host_list 1: g0001:80, g0002:80
(snip)

3.環境変数の設定
[username@es1 ~]$ export JOB_ID=1000000
[username@es1 ~]$ export SGE_TASK_ID=undefined

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
[username@es1 ~]$

既存PRのmerge (209)

#209

Failed to load cupy

I loaded modules
module load python/3.6 cuda/9.2
and then created venv, then installed cupy-cuda92, but failed to import cupy.

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Wed_Apr_11_23:16:29_CDT_2018
Cuda compilation tools, release 9.2, V9.2.88

$ pip3 freeze
absl-py==0.7.0
astor==0.7.1
chainer==5.2.0
cupy-cuda92==5.2.0
fastrlock==0.4
filelock==3.0.10
gast==0.2.2
grpcio==1.18.0
h5py==2.9.0
imageio==2.5.0
Keras==2.2.4
Keras-Applications==1.0.7
Keras-Preprocessing==1.0.9
Markdown==3.0.1
numpy==1.16.1
Pillow==5.4.1
protobuf==3.6.1
PyYAML==3.13
scikit-learn==0.20.2
scipy==1.2.1
six==1.12.0
sklearn==0.0
tensorboard==1.12.2
tensorflow==1.12.0
termcolor==1.1.0
Werkzeug==0.14.1

$ python3
Python 3.6.5 (default, Jun  2 2018, 15:49:50) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-16)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import cupy
Traceback (most recent call last):
  File "/fs3/home/aca10485tl/dl/lib/python3.6/site-packages/cupy/__init__.py", line 11, in <module>
    from cupy import core  # NOQA
  File "/fs3/home/aca10485tl/dl/lib/python3.6/site-packages/cupy/core/__init__.py", line 1, in <module>
    from cupy.core import core  # NOQA
ImportError: libcuda.so.1: cannot open shared object file: No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/fs3/home/aca10485tl/dl/lib/python3.6/site-packages/cupy/__init__.py", line 32, in <module>
    six.reraise(ImportError, ImportError(msg), exc_info[2])
  File "/fs3/home/aca10485tl/dl/lib/python3.6/site-packages/six.py", line 692, in reraise
    raise value.with_traceback(tb)
  File "/fs3/home/aca10485tl/dl/lib/python3.6/site-packages/cupy/__init__.py", line 11, in <module>
    from cupy import core  # NOQA
  File "/fs3/home/aca10485tl/dl/lib/python3.6/site-packages/cupy/core/__init__.py", line 1, in <module>
    from cupy.core import core  # NOQA
ImportError: CuPy is not correctly installed.

If you are using wheel distribution (cupy-cudaXX), make sure that the version of CuPy you installed matches with the version of CUDA on your host.
Also, confirm that only one CuPy package is installed:
  $ pip freeze

If you are building CuPy from source, please check your environment, uninstall CuPy and reinstall it with:
  $ pip install cupy --no-cache-dir -vvvv

Check the Installation Guide for details:
  https://docs-cupy.chainer.org/en/latest/install.html

original error: libcuda.so.1: cannot open shared object file: No such file or directory

9. Linuxコンテナの冒頭にSingularity 2.6.1が3末で提供中止になる旨、警告を追加

単純に提供中止予定であることをwarningとして追加すること。CVEへの言及、対処方法の説明は不要。