p2enjoy / kohya_ss-docker Goto Github PK

This is the tandem repository to exploit on linux the kohya_ss training webui converted to Linux. It uses the fork in the following link

Home Page: https://github.com/P2Enjoy/kohya_ss

License: Apache License 2.0

Dockerfile 61.77% Shell 38.23%

colab-notebook dall-e docker kohya-webui linux locon lora midjourney stable-diffusion

kohya_ss-docker's People

Contributors

Stargazers

Watchers

Forkers

journeyworker lurenjiasworld mattssn angainordev ukaserge pixelkaiser swfsql foobar18 philipy1219 zhiyanliu huadry tianlang0704 usermonk littleboys578 treksis se7enth t-sumida oddomatik

kohya_ss-docker's Issues

tensorflow-*.whl is not a valid wheel filename.

Hi apologies for my most likely silly question but perhaps someone can help:
when running docker compose I get the error:

tensorflow-*.whl is not a valid wheel filename.

after

 => ERROR [10/17] RUN <<EOF (# tensorflow, torch, torchvision, torchaudio  2.2s
------
 > [10/17] RUN <<EOF (# tensorflow, torch, torchvision, torchaudio...):
#0 0.348 + /bin/bash /docker/install-container-dep.sh /docker/torch-1.13.1+cu117.with.pypi.cudnn-cp310-cp310-linux_x86_64.whl /docker/torchvision-0.14.1+cu117-cp310-cp310-linux_x86_64.whl '/docker/tensorflow-*.whl'
#0 1.316 WARNING: Requirement '/docker/tensorflow-*.whl' looks like a filename, but the file does not exist
#0 1.317 ERROR: tensorflow-*.whl is not a valid wheel filename.

although I copy all *.deb and *.whl files in the right place (hope so)

[loewe@fedora kohya_ss-docker]$ tree
.
├── 2chAI_LoRA_Dreambooth_guide_english.pdf
├── data
│   ├── tensorflow-2.11.0-cp310-cp310-linux_x86_64.whl
│   ├── torch-1.13.1+cu117.with.pypi.cudnn-cp310-cp310-linux_x86_64.whl
│   ├── torchvision-0.14.1+cu117-cp310-cp310-linux_x86_64.whl
│   └── xformers-0.0.14.dev0-cp310-cp310-linux_x86_64.whl
├── docker-compose.yml
├── kohya_ss
│   ├── data
│   │   ├── libs
│   │   │   ├── libnvinfer7_7.2.2-1+cuda11.1_amd64.deb
│   │   │   ├── libnvinfer-dev_7.2.2-1+cuda11.1_amd64.deb
│   │   │   ├── libnvinfer-plugin7_7.2.2-1+cuda11.1_amd64.deb
│   │   │   ├── libnvinfer-plugin-dev_7.2.2-1+cuda11.1_amd64.deb
│   │   │   └── Readme.md
│   │   ├── Readme.md
│   │   ├── tensorflow_cpu-2.11.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
│   │   ├── torch-1.13.1+cu117.with.pypi.cudnn-cp310-cp310-linux_x86_64.whl
│   │   ├── torchvision-0.14.1+cu117-cp310-cp310-linux_x86_64.whl
│   │   └── xformers-0.0.14.dev0-cp310-cp310-linux_x86_64.whl
│   ├── Dockerfile
│   └── scripts
│       ├── debug.sh
│       ├── install-container-dep.sh
│       ├── mount.sh
│       └── run.sh
├── LICENSE
├── README.md
├── tensorflow_gpu-2.11.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
└── xformer-1.0.1-py3-none-any.whl

6 directories, 25 files
[loewe@fedora kohya_ss-docker]$

I already pruned the docker cache so that all parts were reloaded, but still the same error.

Heiko

Restart policy

Error:
[+] Running 0/0
⠋ Container kohya-docker-kohya-1 Creating 0.0s
Error response from daemon: invalid restart policy: maximum retry count can only be used with 'on-failure'

Add "condition: on-failure" or it will not start on Arch.

deploy:
restart_policy:
condition: on-failure
delay: 5s
max_attempts: 10
window: 120s

Rebase https://github.com/P2Enjoy/kohya_ss ?

Your fork of kohya_ss is behind, could you rebase it? Issues are disabled on that repo.

For those of us who know what they're doing, it'd be preferable to use without Docker.

Also, I'm not sure why the original code used tkinter, as web browsers provide a file dialog.

I think removing that and diverging from upstream would be best.

Can you please Provide correct links to the packages you asked us to compile ?

Hi, I came from your reply on kohya_ss github page.
Can you please Provide me correct links to the packages you asked use to compile? I can compile and provide whl packages after compiling. if I can get instructions on how to compile and create a whl packages.

AssertionError: Unable to pre-compile ops without torch installed. Please install torch before attempting to pre-compile ops.

# clone project
git clone https://github.com/P2Enjoy/kohya_ss-docker.git /root/kohya_ss-docker

# donwload whl file
cd /root/kohya_ss-docker/kohya_ss/data
wget https://files.pythonhosted.org/packages/b2/c3/668c91cc7074eed672691f130562c0f02d89aebf01f6e14f1741f7fb900b/tensorflow-2.10.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
wget https://files.pythonhosted.org/packages/ca/74/7342c7f21449557a8263c925071a55081edd7e9b641404cfe31d6fb71d3b/torch-1.12.1-cp310-cp310-manylinux1_x86_64.whl
wget https://files.pythonhosted.org/packages/03/bb/8d6aade7c39bbc55a75054fcbd6c3292f080d05c7d9cff092f2813a65c10/torchvision-0.13.1-cp310-cp310-manylinux1_x86_64.whl
wget https://files.pythonhosted.org/packages/37/62/dd1bce2506a3b52c175e633033c38f6501cb96020d26afe1046bc1856498/xformers-0.0.17.dev448-cp310-cp310-manylinux2014_x86_64.whl

# download deb file
cd /root/kohya_ss-docker/kohya_ss/data/libs
wget https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/libnvinfer7_7.2.2-1+cuda11.1_amd64.deb
wget https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/libnvinfer-dev_7.2.2-1+cuda11.1_amd64.deb
wget https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/libnvinfer-plugin7_7.2.2-1+cuda11.1_amd64.deb
wget https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/libnvinfer-plugin-dev_7.2.2-1+cuda11.1_amd64.deb

# docker build
cd /root/kohya_ss-docker/kohya_ss
docker compose --profile kohya up --build

then got this exception

docker compose --profile kohya up --build
WARN[0000] The "DISPLAY" variable is not set. Defaulting to a blank string. 
[+] Building 43.2s (20/24)                                                                                                                                                                                  
 => [internal] load .dockerignore                                                                                                                                                                      0.0s
 => => transferring context: 2B                                                                                                                                                                        0.0s
 => [internal] load build definition from Dockerfile                                                                                                                                                   0.0s
 => => transferring dockerfile: 32B                                                                                                                                                                    0.0s
 => resolve image config for docker.io/docker/dockerfile:1                                                                                                                                             1.6s
 => CACHED docker-image://docker.io/docker/dockerfile:1@sha256:39b85bbfa7536a5feceb7372a0817649ecb2724562a38360f4d6a7782a409b14                                                                        0.0s
 => [internal] load .dockerignore                                                                                                                                                                      0.0s
 => [internal] load build definition from Dockerfile                                                                                                                                                   0.0s
 => => transferring dockerfile: 32B                                                                                                                                                                    0.0s
 => [internal] load metadata for docker.io/nvidia/cuda:11.6.2-cudnn8-devel-ubuntu20.04                                                                                                                 0.0s
 => [internal] load build context                                                                                                                                                                      0.0s
 => => transferring context: 883B                                                                                                                                                                      0.0s
 => [ 1/16] FROM docker.io/nvidia/cuda:11.6.2-cudnn8-devel-ubuntu20.04                                                                                                                                 0.0s
 => CACHED [ 2/16] RUN <<EOF (# apt for general container dependencies...)                                                                                                                             0.0s
 => CACHED [ 3/16] RUN <<EOF (# python...)                                                                                                                                                             0.0s
 => CACHED [ 4/16] COPY ./data/libs/*.deb /docker/                                                                                                                                                     0.0s
 => CACHED [ 5/16] RUN <<EOF (# cuda...)                                                                                                                                                               0.0s
 => CACHED [ 6/16] WORKDIR /koyah_ss                                                                                                                                                                   0.0s
 => CACHED [ 7/16] RUN <<EOF (git clone https://github.com/P2Enjoy/kohya_ss.git /koyah_ss...)                                                                                                          0.0s
 => CACHED [ 8/16] COPY ./data/*.whl /docker/                                                                                                                                                          0.0s
 => CACHED [ 9/16] COPY ./scripts/install-container-dep.sh ./data/*-requirements.txt /docker/                                                                                                          0.0s
 => CACHED [10/16] RUN <<EOF (# tensorflow, torch, torchvision, torchaudio...)                                                                                                                         0.0s
 => CACHED [11/16] RUN <<EOF (# xformers...)                                                                                                                                                           0.0s
 => ERROR [12/16] RUN <<EOF (# Build requirements...)                                                                                                                                                 41.3s
------                                                                                                                                                                                                      
 > [12/16] RUN <<EOF (# Build requirements...):                                                                                                                                                             
#0 0.477 + /bin/bash /docker/install-container-dep.sh --use-pep517 --upgrade -r /koyah_ss/requirements.txt                                                                                                  
#0 1.369 Looking in indexes: https://pypi.org/simple, https://download.pytorch.org/whl/cu116                                                                                                                
#0 1.376 Processing /koyah_ss                                                                                                                                                                               
#0 1.379   Installing build dependencies: started                                                                                                                                                           
#0 6.086   Installing build dependencies: finished with status 'done'
#0 6.087   Getting requirements to build wheel: started
#0 6.269   Getting requirements to build wheel: finished with status 'done'
#0 6.272   Preparing metadata (pyproject.toml): started
#0 6.490   Preparing metadata (pyproject.toml): finished with status 'done'
#0 7.580 Collecting accelerate==0.15.0
#0 7.603   Downloading accelerate-0.15.0-py3-none-any.whl (191 kB)
#0 7.610      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 191.5/191.5 kB 49.5 MB/s eta 0:00:00
#0 8.471 Collecting transformers==4.26.0
#0 8.479   Downloading transformers-4.26.0-py3-none-any.whl (6.3 MB)
#0 8.517      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.3/6.3 MB 181.8 MB/s eta 0:00:00
#0 9.361 Collecting ftfy==6.1.1
#0 9.365   Downloading ftfy-6.1.1-py3-none-any.whl (53 kB)
#0 9.368      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 53.1/53.1 kB 190.9 MB/s eta 0:00:00
#0 9.594 Collecting albumentations==1.3.0
#0 9.600   Downloading albumentations-1.3.0-py3-none-any.whl (123 kB)
#0 9.603      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 123.5/123.5 kB 254.5 MB/s eta 0:00:00
#0 10.59 Collecting opencv-python==4.7.0.68
#0 10.60   Downloading opencv_python-4.7.0.68-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (61.8 MB)
#0 11.24      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 61.8/61.8 MB 104.6 MB/s eta 0:00:00
#0 12.24 Collecting einops==0.6.0
#0 12.24   Downloading einops-0.6.0-py3-none-any.whl (41 kB)
#0 12.25      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 41.6/41.6 kB 203.2 MB/s eta 0:00:00
#0 13.06 Collecting diffusers[torch]==0.10.2
#0 13.07   Downloading diffusers-0.10.2-py3-none-any.whl (503 kB)
#0 13.07      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 503.1/503.1 kB 229.2 MB/s eta 0:00:00
#0 13.94 Collecting pytorch-lightning==1.9.0
#0 13.97   Downloading pytorch_lightning-1.9.0-py3-none-any.whl (825 kB)
#0 13.97      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 825.8/825.8 kB 311.7 MB/s eta 0:00:00
#0 14.21 Collecting bitsandbytes==0.35.0
#0 14.22   Downloading bitsandbytes-0.35.0-py3-none-any.whl (62.5 MB)
#0 14.73      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 62.5/62.5 MB 147.9 MB/s eta 0:00:00
#0 14.93 Requirement already satisfied: tensorboard==2.10.1 in ./kohya_venv/lib/python3.10/site-packages (from -r /koyah_ss/requirements.txt (line 10)) (2.10.1)
#0 16.61 Collecting safetensors==0.2.6
#0 16.63   Downloading safetensors-0.2.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
#0 16.64      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.2/1.2 MB 187.3 MB/s eta 0:00:00
#0 17.59 Collecting gradio==3.16.2
#0 17.60   Downloading gradio-3.16.2-py3-none-any.whl (14.2 MB)
#0 17.67      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14.2/14.2 MB 241.7 MB/s eta 0:00:00
#0 18.51 Collecting altair==4.2.2
#0 18.52   Downloading altair-4.2.2-py3-none-any.whl (813 kB)
#0 18.52      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 813.6/813.6 kB 269.9 MB/s eta 0:00:00
#0 19.35 Collecting easygui==0.98.3
#0 19.35   Downloading easygui-0.98.3-py2.py3-none-any.whl (92 kB)
#0 19.36      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 92.7/92.7 kB 252.5 MB/s eta 0:00:00
#0 20.18 Collecting tk==0.1.0
#0 20.19   Downloading tk-0.1.0-py3-none-any.whl (3.9 kB)
#0 20.19 Requirement already satisfied: requests==2.28.2 in ./kohya_venv/lib/python3.10/site-packages (from -r /koyah_ss/requirements.txt (line 17)) (2.28.2)
#0 21.84 Collecting timm==0.6.12
#0 21.84   Downloading timm-0.6.12-py3-none-any.whl (549 kB)
#0 21.85      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 549.1/549.1 kB 270.9 MB/s eta 0:00:00
#0 22.65 Collecting fairscale==0.4.13
#0 22.65   Downloading fairscale-0.4.13.tar.gz (266 kB)
#0 22.66      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 266.3/266.3 kB 297.4 MB/s eta 0:00:00
#0 22.72   Installing build dependencies: started
#0 27.12   Installing build dependencies: finished with status 'done'
#0 27.12   Getting requirements to build wheel: started
#0 27.29   Getting requirements to build wheel: finished with status 'done'
#0 27.30   Installing backend dependencies: started
#0 29.49   Installing backend dependencies: finished with status 'done'
#0 29.49   Preparing metadata (pyproject.toml): started
#0 29.74   Preparing metadata (pyproject.toml): finished with status 'done'
#0 29.74 Requirement already satisfied: tensorflow==2.10.1 in ./kohya_venv/lib/python3.10/site-packages (from -r /koyah_ss/requirements.txt (line 22)) (2.10.1)
#0 31.55 Collecting huggingface-hub==0.12.0
#0 31.56   Downloading huggingface_hub-0.12.0-py3-none-any.whl (190 kB)
#0 31.56      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 190.3/190.3 kB 298.4 MB/s eta 0:00:00
#0 31.57 Requirement already satisfied: torch in ./kohya_venv/lib/python3.10/site-packages (from -r /koyah_ss/requirements.txt (line 24)) (1.13.1+cu116)
#0 32.42 Requirement already satisfied: torchvision in ./kohya_venv/lib/python3.10/site-packages (from -r /koyah_ss/requirements.txt (line 25)) (0.13.1)
#0 33.53 Collecting torchvision
#0 33.54   Downloading https://download.pytorch.org/whl/cu116/torchvision-0.14.1%2Bcu116-cp310-cp310-linux_x86_64.whl (24.2 MB)
#0 34.01      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 24.2/24.2 MB 60.0 MB/s eta 0:00:00
#0 34.09 Requirement already satisfied: xformers in ./kohya_venv/lib/python3.10/site-packages (from -r /koyah_ss/requirements.txt (line 26)) (0.0.17.dev448)
#0 35.87 Collecting deepspeed
#0 35.88   Downloading deepspeed-0.8.1.tar.gz (759 kB)
#0 35.89      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 759.6/759.6 kB 290.1 MB/s eta 0:00:00
#0 36.08   Installing build dependencies: started
#0 40.50   Installing build dependencies: finished with status 'done'
#0 40.50   Getting requirements to build wheel: started
#0 40.68   Getting requirements to build wheel: finished with status 'error'
#0 40.69   error: subprocess-exited-with-error
#0 40.69   
#0 40.69   × Getting requirements to build wheel did not run successfully.
#0 40.69   │ exit code: 1
#0 40.69   ╰─> [20 lines of output]
#0 40.69       [WARNING] Unable to import torch, pre-compiling ops will be disabled. Please visit https://pytorch.org/ to see how to properly install torch on your system.
#0 40.69        [WARNING]  unable to import torch, please install it if you want to pre-compile any deepspeed ops.
#0 40.69       DS_BUILD_OPS=1
#0 40.69       Traceback (most recent call last):
#0 40.69         File "/koyah_ss/kohya_venv/lib/python3.10/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 353, in <module>
#0 40.69           main()
#0 40.69         File "/koyah_ss/kohya_venv/lib/python3.10/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 335, in main
#0 40.69           json_out['return_val'] = hook(**hook_input['kwargs'])
#0 40.69         File "/koyah_ss/kohya_venv/lib/python3.10/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 118, in get_requires_for_build_wheel
#0 40.69           return hook(config_settings)
#0 40.69         File "/tmp/pip-build-env-nfj5b6uq/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 338, in get_requires_for_build_wheel
#0 40.69           return self._get_build_requires(config_settings, requirements=['wheel'])
#0 40.69         File "/tmp/pip-build-env-nfj5b6uq/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 320, in _get_build_requires
#0 40.69           self.run_setup()
#0 40.69         File "/tmp/pip-build-env-nfj5b6uq/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 484, in run_setup
#0 40.69           super(_BuildMetaLegacyBackend,
#0 40.69         File "/tmp/pip-build-env-nfj5b6uq/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 335, in run_setup
#0 40.69           exec(code, locals())
#0 40.69         File "<string>", line 122, in <module>
#0 40.69       AssertionError: Unable to pre-compile ops without torch installed. Please install torch before attempting to pre-compile ops.
#0 40.69       [end of output]
#0 40.69   
#0 40.69   note: This error originates from a subprocess, and is likely not a problem with pip.
#0 40.69 error: subprocess-exited-with-error
#0 40.69 
#0 40.69 × Getting requirements to build wheel did not run successfully.
#0 40.69 │ exit code: 1
#0 40.69 ╰─> See above for output.
#0 40.69 
#0 40.69 note: This error originates from a subprocess, and is likely not a problem with pip.
#0 41.05 
#0 41.05 [notice] A new release of pip is available: 23.0 -> 23.0.1
#0 41.05 [notice] To update, run: pip install --upgrade pip
------
failed to solve: failed to solve with frontend dockerfile.v0: failed to solve with frontend gateway.v0: rpc error: code = Unknown desc = failed to build LLB: executor failed running [/bin/bash -ceuxo pipefail # Build requirements
/bin/bash /docker/install-container-dep.sh --use-pep517 --upgrade -r ${ROOT}/requirements.txt
]: runc did not terminate sucessfully

libinfer7 moved to a different place

the new url is https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/

Cloud Gpu

Can i run kohya on a linux cloud gpu machine, with no interface appart from jupyter notebook ? it doesnt support the triton ui part where it asks for a file locally

Multi GPU use

Hi,

It would seem the GUI script is not ready for multi GPU.
I use a dual 3090 setup.
Running with accelerate starts 2 servers:

Load CSS...
Load CSS...
Running on local URL:  http://127.0.0.1:7861

To create a public link, set `share=True` in `launch()`.
Running on local URL:  http://127.0.0.1:7862

To create a public link, set `share=True` in `launch()`.

Is there a way to use accelerate with the UI, or only with the train_network.py script?
Can you provide some guidance?

Packages do not match the requirements file

Greetings,
Thanks for creating this wrapper docker for kohya_ss.

Currently I am facing some issues setting up.
I am running on Ubuntu 22.10.
I've downloaded the compiled *.whl as well as *.deb as mentioned in the README as opposed to compiling myself.

Basically I've downloaded these files:

From PyPI: (kohya_ss/data)

tensorflow-2.11.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
torch-1.13.1-cp310-cp310-manylinux1_x86_64.whl
torchvision-0.14.1-cp310-cp310-manylinux1_x86_64.whl
xformers-0.0.17-cp310-cp310-manylinux2014_x86_64.whl

From Nvidia (ubuntu2204): (kohya_ss/data/libs)

libnvinfer8_8.6.0.12-1+cuda12.0_amd64.deb
libnvinfer-dev_8.6.0.12-1+cuda12.0_amd64.deb
libnvinfer-plugin8_8.6.0.12-1+cuda12.0_amd64.deb
libnvinfer-plugin-dev_8.6.0.12-1+cuda12.0_amd64.deb

This is the output for nvidia-smi. I am downloading *.cuda12.0 since my CUDA version is 12.
I've installed nvidia-container-toolkit before by following this instruction.

Thu Apr 13 00:43:42 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   38C    P0     6W /  50W |      6MiB /  4096MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      2580      G   /usr/lib/xorg/Xorg                  4MiB |
+-----------------------------------------------------------------------------+

However, I am currently getting error when running docker compose --profile kohya up --build.

Seems like the process failed with the hashes mismatch error. This is very weird as I've downloaded everything from official sources.

#0 125.3 ERROR: THESE PACKAGES DO NOT MATCH THE HASHES FROM THE REQUIREMENTS FILE. If you have updated the package versions, please update the hashes. Otherwise, examine the package contents carefully; someone may have tampered with them.
#0 125.3     nvidia-cudnn-cu11==8.5.0.96 from https://files.pythonhosted.org/packages/dc/30/66d4347d6e864334da5bb1c7571305e501dcb11b9155971421bb7bb5315f/nvidia_cudnn_cu11-8.5.0.96-2-py3-none-manylinux1_x86_64.whl (from torch->triton==2.0.0):
#0 125.3         Expected sha256 402f40adfc6f418f9dae9ab402e773cfed9beae52333f6d86ae3107a1b9527e7
#0 125.3              Got        a1b11129590baa3e90cc040d742315cb75c12a113feff518f6338254b8a4d84c

I am by no means an expert on this topic, any guidance will be very well appreciated 🙇🏻‍♂️

yaml: unmarshal errors - mapping key "<<" already defined

Had issue with docker-compose.yaml mapping keys.

I replaced

kohya: &kohya_service <<: *base_service \n <<: *gpu_service <<: *kohya_service_envs

with

kohya: &kohya_service <<: [*base_service, *gpu_service, *kohya_service_envs]

and

kohya_debug: <<: *kohya_service <<: *kohya_service_envs

with

kohya_debug: <<: [*kohya_service, *kohya_service_envs]

Which seemed to resolve the issue.

Idea: Autoinstall wls and deb

It would be useful to use!

And especially since download sources are sometimes hard to find:

Spoiler

and

This can be done either by simply injecting this feature into the Dockerfile or by the method from https://github.com/AbdBarho/stable-diffusion-webui-docker with download profile:
https://github.com/AbdBarho/stable-diffusion-webui-docker/blob/master/docker-compose.yml#L21-L25
https://github.com/AbdBarho/stable-diffusion-webui-docker/tree/master/services/download

Сan use conda(recommend to try) or pythonhosted
Сan also add automatic compilation of the necessary files

Question: Running without using docker hosts for X server

In readme:

Remember to allow docker hosts to clients to your X server.

# Unsecure way to do (I take no responsability if you do this way) !! 
# You should add the exact host but I leave that to you.
xhost +

It's not very safe or convenient. I would like to be able to specify paths
For example, now you can specify the path to the config, but you cannot load the config

It would be more convenient to use docker without external manipulation:)
It would work on Wayland

Unable to run docker compose

I tried to do the command you wrote below to launch the docker compose after installing the nvidia extension for docker but the container returns this

I tried to change the CUDA_DRIVER in the docker compose file to cuda-drivers-545 but I have the same problem.

46.44 Setting up libcuda1:amd64 (545.23.08-1) ...
46.53 Configuring libcuda1:amd64
46.53 --------------------------
46.53
46.53 Mismatching nvidia kernel module loaded
46.53
46.53 The NVIDIA driver that is being installed (version 545.23.08) does not
46.53 match the nvidia kernel module currently loaded (version 545.29.06).
46.53
46.53 The X server, OpenGL, and GPGPU applications may not work properly.
46.53
46.53 The easiest way to fix this is to reboot the machine once the installation
46.53 has finished. You can also stop the X server (usually by stopping the login
46.53 manager, e.g. gdm3, sddm, or xdm), manually unload the module ("modprobe -r
46.53 nvidia"), and restart the X server.
46.53
46.55 Setting up libnvidia-cfg1:amd64 (545.23.08-1) ...
46.57 Setting up nvidia-opencl-icd:amd64 (545.23.08-1) ...
46.58 Setting up libnvidia-allocator1:amd64 (545.23.08-1) ...
46.60 Setting up libglx-nvidia0:amd64 (545.23.08-1) ...
46.61 Setting up nvidia-kernel-support (545.23.08-1) ...
46.65 Setting up xserver-xorg-video-nvidia (545.23.08-1) ...
46.77 Setting up libnvidia-pkcs11:amd64 (545.23.08-1) ...
46.79 Setting up libnvoptix1:amd64 (545.23.08-1) ...
46.80 Setting up nvidia-vulkan-icd:amd64 (545.23.08-1) ...
46.82 Setting up libnvidia-fbc1:amd64 (545.23.08-1) ...
46.84 Setting up nvidia-vdpau-driver:amd64 (545.23.08-1) ...
46.85 Setting up libgl1-nvidia-glvnd-glx:amd64 (545.23.08-1) ...
46.87 Setting up libgles-nvidia1:amd64 (545.23.08-1) ...
46.88 Setting up libcudadebugger1:amd64 (545.23.08-1) ...
46.90 Setting up libegl-nvidia0:amd64 (545.23.08-1) ...
46.91 Setting up nvidia-settings (545.23.08-1) ...
46.93 Setting up nvidia-smi (545.23.08-1) ...
46.94 Setting up libgles-nvidia2:amd64 (545.23.08-1) ...
46.98 Setting up nvidia-driver-bin (545.23.08-1) ...
47.00 Setting up libnvcuvid1:amd64 (545.23.08-1) ...
47.02 Setting up nvidia-persistenced (545.23.08-1) ...
47.06 adduser: Warning: The home dir /var/run/nvpd/ you specified can't be accessed: No such file or directory
47.06 Adding system user nvpd' (UID 101) ... 47.06 Adding new group nvpd' (GID 107) ...
47.08 Adding new user nvpd' (UID 101) with group nvpd' ...
47.13 Not creating home directory `/var/run/nvpd/'.
47.14 invoke-rc.d: could not determine current runlevel
47.15 invoke-rc.d: policy-rc.d denied execution of start.
47.23 Created symlink /etc/systemd/system/multi-user.target.wants/nvidia-persistenced.service → /lib/systemd/system/nvidia-persistenced.service.
47.24 Setting up libnvidia-opticalflow1:amd64 (545.23.08-1) ...
47.26 Setting up nvidia-egl-icd:amd64 (545.23.08-1) ...
47.27 Setting up libnvidia-encode1:amd64 (545.23.08-1) ...
47.29 Setting up nvidia-driver-libs:amd64 (545.23.08-1) ...
47.41 Processing triggers for nvidia-alternative (545.23.08-1) ...
47.48 update-alternatives: updating alternative /usr/lib/nvidia/current because link group nvidia has changed slave links
47.51 Setting up nvidia-kernel-dkms (545.23.08-1) ...
47.58 Loading new nvidia-current-545.23.08 DKMS files...
47.66 It is likely that 6.5.0-17-generic belongs to a chroot's host
47.66 Building for 6.1.0-18-amd64
47.69 Building initial module for 6.1.0-18-amd64
70.20 Error! Bad return status for module build on kernel: 6.1.0-18-amd64 (x86_64)
70.20 Consult /var/lib/dkms/nvidia-current/545.23.08/build/make.log for more information.
70.20 dpkg: error processing package nvidia-kernel-dkms (--configure):
70.20 installed nvidia-kernel-dkms package post-installation script subprocess returned error exit status 10
70.20 dpkg: dependency problems prevent configuration of nvidia-driver:
70.20 nvidia-driver depends on nvidia-kernel-dkms (= 545.23.08-1) | nvidia-kernel-545.23.08 | nvidia-kernel-open-dkms (= 545.23.08-1); however:
70.20 Package nvidia-kernel-dkms is not configured yet.
70.20 Package nvidia-kernel-545.23.08 is not installed.
70.20 Package nvidia-kernel-dkms which provides nvidia-kernel-545.23.08 is not configured yet.
70.20 Package nvidia-kernel-open-dkms is not installed.
70.20
70.20 dpkg: error processing package nvidia-driver (--configure):
70.20 dependency problems - leaving unconfigured
70.20 dpkg: dependency problems prevent configuration of cuda-drivers-545:
70.20 cuda-drivers-545 depends on nvidia-driver (>= 545.23.08); however:
70.20 Package nvidia-driver is not configured yet.
70.20
70.20 dpkg: error processing package cuda-drivers-545 (--configure):
70.20 dependency problems - leaving unconfigured
70.20 Processing triggers for libgdk-pixbuf-2.0-0:amd64 (2.42.10+dfsg-1+b1) ...
70.24 Processing triggers for libc-bin (2.36-9+deb12u4) ...
70.26 Processing triggers for update-glx (1.2.2) ...
70.29 Processing triggers for glx-alternative-nvidia (1.2.2) ...
70.35 update-alternatives: using /usr/lib/nvidia to provide /usr/lib/glx (glx) in auto mode
70.40 Processing triggers for glx-alternative-mesa (1.2.2) ...
70.44 Processing triggers for libc-bin (2.36-9+deb12u4) ...
70.49 Errors were encountered while processing:
70.49 nvidia-kernel-dkms
70.49 nvidia-driver
70.49 cuda-drivers-545
70.57 W: Target Packages (main/binary-amd64/Packages) is configured multiple times in /etc/apt/sources.list:4 and /etc/apt/sources.list.d/debian.sources:1
70.57 W: Target Packages (main/binary-all/Packages) is configured multiple times in /etc/apt/sources.list:4 and /etc/apt/sources.list.d/debian.sources:1
70.57 E: Sub-process /usr/bin/dpkg returned an error code (1)

failed to solve: process "/bin/bash -ceuxo pipefail # cuda driver update\napt-get update\napt-get -y install $CUDA_DRIVERS\napt-get -y install $CUDA_VERSION\n" did not complete successfully: exit code: 100

Incompatible versions in requirements

https://github.com/P2Enjoy/kohya_ss/blob/master/requirements.txt
has
deepspeed
triton==2.0.0.dev20230208

but trying to install deepspeed (0.8.1) raises
[WARNING] using untested triton version (2.0.0), only 1.0.0 is known to be compatible
[WARNING] One can disable sparse_attn with DS_BUILD_SPARSE_ATTN=0
[ERROR] Unable to pre-compile sparse_attn
(compiles ok with export DS_BUILD_SPARSE_ATTN=0;pip3 install deepspeed)

Also, on my build, deepspeed didn't build when installed in one go.
I have to install torch first of all, then only can install deepspeed, or deepspeed install complains about missing torch if done from the same requirements.txt

Runpod.io

Can't seem to make it run with runpod.io

Failed Install on Jetson AGX Orin 64GB Developer Kit

Greetings, I have been searching for a way to run koyha_ss on the Jetson AGX Orin within the nvidia container, so it will utilize the GPU. After copying this git and running the docker compose line, the following message was received.

user@ubuntu:~/kohya_ss-docker$ docker compose --profile kohya up --build
[+] Building 1.1s (22/27) docker:default
=> [kohya internal] load build definition from Dockerfile 0.0s
=> => transferring dockerfile: 5.35kB 0.0s
=> [kohya] resolve image config for docker.io/docker/dockerfile:1 0.3s
=> CACHED [kohya] docker-image://docker.io/docker/dockerfile:1@sha256:ac 0.0s
=> [kohya internal] load metadata for docker.io/library/python:3.10-slim 0.2s
=> [kohya internal] load .dockerignore 0.0s
=> => transferring context: 2B 0.0s
=> [kohya internal] load build context 0.0s
=> => transferring context: 187B 0.0s
=> [kohya base 1/7] FROM docker.io/library/python:3.10-slim@sha256:4bd9a 0.0s
=> CACHED [kohya base 2/7] RUN <<EOF (# apt for general container depend 0.0s
=> CACHED [kohya base 3/7] RUN <<EOF (# apt for extensions/custom script 0.0s
=> CACHED [kohya base 4/7] RUN <<EOF (# apt configurations...) 0.0s
=> CACHED [kohya base 5/7] RUN <<EOF (# cuda configurations...) 0.0s
=> CACHED [kohya base 6/7] COPY ./scripts/install-container-dep.sh /dock 0.0s
=> CACHED [kohya base 7/7] RUN <<EOF (# cuda cudnn + cutlass + tensorrt. 0.0s
=> CACHED [kohya kohya_base 1/8] RUN <<EOF (git clone https://github.com 0.0s
=> CACHED [kohya kohya_base 2/8] WORKDIR /koyah_ss 0.0s
=> CACHED [kohya kohya_base 3/8] RUN <<EOF (# Build requirements...) 0.0s
=> CACHED [kohya kohya_base 4/8] RUN <<EOF (# tensorflow...) 0.0s
=> CACHED [kohya kohya_base 5/8] RUN <<EOF (# torch, torchvision, torcha 0.0s
=> CACHED [kohya kohya_base 6/8] RUN <<EOF (# xformers...) 0.0s
=> CACHED [kohya kohya_base 7/8] RUN <<EOF (# deepspeed...) 0.0s
=> CACHED [kohya kohya_base 8/8] RUN <<EOF (#jax/tpu...) 0.0s
=> ERROR [kohya kohya_cuda 1/2] RUN <<EOF (# Hotfix for libnvinfer7...) 0.3s

[kohya kohya_cuda 1/2] RUN <<EOF (# Hotfix for libnvinfer7...):
0.276 + ln -s /venv/lib/python3.10/site-packages/tensorrt/libnvinfer.so.8 /venv/lib/python3.10/site-packages/tensorrt/libnvinfer.so.7
0.278 ln: failed to create symbolic link '/venv/lib/python3.10/site-packages/tensorrt/libnvinfer.so.7': No such file or directory

failed to solve: process "/bin/bash -ceuxo pipefail # Hotfix for libnvinfer7\nln -s $TENSORRT_PATH/libnvinfer.so.8 $TENSORRT_PATH/libnvinfer.so.7\nln -s $TENSORRT_PATH/libnvinfer_plugin.so.8 $TENSORRT_PATH/libnvinfer_plugin.so.7\n" did not complete successfully: exit code: 1

Any thoughts on getting past this, so we can move on with some training?

Help - please forgive my dumb-userness

Firstly, thank you for preparing a Linux based solution (who really does do anything on Windows these days?...).

Secondly, please help. Could you provide more specific instructions on how to setup the docker? For example:

Let's say I want to take the easy way out and use pip to install the whl's instead of building them myself. At what point in the process should I do this?
Where do I find libnvinfer*7.deb? All my searching seems to indicate that I need to be a member of the NVIDIA developer program in order to download it.

It is impossible to download libnvinfer7.deb packages

You should download and put here the relevant libnvinfer*7.deb as to be installed offline in your container.
For some strange reasons, the CUDA package seems not to install these but are recommended by Tensorflow:

libnvinfer7_7.2.2-1+cuda11.1_amd64
libnvinfer-dev_7.2.2-1+cuda11.1_amd64
libnvinfer-plugin7_7.2.2-1+cuda11.1_amd64
libnvinfer-plugin-dev_7.2.2-1+cuda11.1_amd64

Use these NVidia Debian Package website 1 and NVidia Debian Package website 2 to search and download the library is you're not willing to compile them by yourself, then put the *.deb here and compile the docker image.

Both deb-packages repos don't include libnvinfer*7 packages. There are libnvinfer*8 packages only.

WARN[0000] The "DISPLAY" variable is not set. Defaulting to a blank string.

What value should be set on DISPLAY?

The builded image seems to be very hudge

Hello!
Thanks for sharing this repo! I have a question.
is it normal to have a very big image for the build?
There is a loads of layers for the image with some very huge, the total size of the image for me is 18 Gb.

I'm not really familiar with the kohya app and python, i have no idea if it's a normal size or not for something like that.

i join the export for each layer. There is solution to reduce the size?

1	74.8 MB	ADD file:209589a8bdb5a3788ee42ecdbccbbb561835dab96b0d8286bb5a2229d2f41be7 in /
2	0 B	CMD ["bash"]
3	0 B	ENV PATH=/usr/local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
4	0 B	ENV LANG=C.UTF-8
5	9.2 MB	RUN RUN set -eux; apt-get update; apt-get install -y --no-install-recommends ca-certificates netbase tzdata ; rm -rf ...
6	0 B	ENV GPG_KEY=A035C8C19219BA821ECEA86B64E628F8D684696D
7	0 B	ENV PYTHON_VERSION=3.10.12
8	50.9 MB	RUN RUN set -eux; savedAptMark="$(apt-mark showmanual)"; apt-get update; apt-get install -y --no-install-recommends dpkg-...
9	32 B	RUN RUN set -eux; for src in idle3 pydoc3 python3 python3-config; do dst="$(echo "$src" \| tr -d 3)"; [ -s "/usr/local/bin/...
10	0 B	ENV PYTHON_PIP_VERSION=23.0.1
11	0 B	ENV PYTHON_SETUPTOOLS_VERSION=65.5.1
12	0 B	ENV PYTHON_GET_PIP_URL=https://github.com/pypa/get-pip/raw/0d8570dc44796f4369b652222cf176b3db6ac70e/public/get-pip.py
13	0 B	ENV PYTHON_GET_PIP_SHA256=96461deced5c2a487ddc65207ec5a9cffeca0d34e7af7ea1afc470ff0d746207
14	12.2 MB	RUN RUN set -eux; savedAptMark="$(apt-mark showmanual)"; apt-get update; apt-get install -y --no-install-recommends wget; ...
15	0 B	CMD ["python3"]
16	0 B	SHELL [/bin/bash -ceuxo pipefail]
17	0 B	ARG TORCH_CUDA_ARCH_LIST
18	0 B	ARG CUDNN_VERSION
19	0 B	ARG NVCC_FLAGS
20	0 B	ARG pyver
21	0 B	ARG PIP_REPOSITORY
22	0 B	ARG CUDA_KEYRING
23	0 B	ENV NVCC_FLAGS=--use_fast_math
24	0 B	ENV TORCH_CUDA_ARCH_LIST=7.5+PTX
25	0 B	ENV DEBIAN_FRONTEND=noninteractive
26	0 B	ENV PIP_PREFER_BINARY=1
27	0 B	ENV PIP_REPOSITORY="https://download.pytorch.org/whl/cu118"
28	0 B	ENV PIP_NO_CACHE_DIR=1
29	0 B	ENV NVIDIA_VISIBLE_DEVICES=all
30	0 B	ENV NVIDIA_DRIVER_CAPABILITIES=compute,utility
31	343.4 MB	RUN \|6 TORCH_CUDA_ARCH_LIST=7.5+PTX CUDNN_VERSION="8.6.0.163" NVCC_FLAGS=--use_fast_math pyver=3.10 PIP_REPOSITORY="https://dow...
32	67.1 MB	RUN \|6 TORCH_CUDA_ARCH_LIST=7.5+PTX CUDNN_VERSION="8.6.0.163" NVCC_FLAGS=--use_fast_math pyver=3.10 PIP_REPOSITORY="https://dow...
33	351 kB	RUN \|6 TORCH_CUDA_ARCH_LIST=7.5+PTX CUDNN_VERSION="8.6.0.163" NVCC_FLAGS=--use_fast_math pyver=3.10 PIP_REPOSITORY="https://dow...
34	1.9 MB	RUN \|6 TORCH_CUDA_ARCH_LIST=7.5+PTX CUDNN_VERSION="8.6.0.163" NVCC_FLAGS=--use_fast_math pyver=3.10 PIP_REPOSITORY="https://dow...
35	293 B	COPY ./scripts/install-container-dep.sh /docker/ # buildkit
36	3.1 GB	RUN \|6 TORCH_CUDA_ARCH_LIST=7.5+PTX CUDNN_VERSION="8.6.0.163" NVCC_FLAGS=--use_fast_math pyver=3.10 PIP_REPOSITORY="https://dow...
37	0 B	SHELL [/bin/bash -ceuxo pipefail]
38	0 B	ARG TORCH_COMMAND
39	0 B	ARG XFORMERS_COMMAND
40	0 B	ARG TENSORFLOW_COMMAND
41	0 B	ARG DS_BUILD_OPS
42	0 B	ARG pyver
43	0 B	ARG DEEPSPEED
44	0 B	ARG DEEPSPEED_VERSION
45	0 B	ARG TORCH_CUDA_ARCH_LIST
46	0 B	ARG TRITON_VERSION
47	0 B	ARG JAX
48	0 B	ARG TPU
49	0 B	ENV pyver=3.10
50	0 B	ENV ROOT=/koyah_ss
51	0 B	ENV CUDART_PATH=/venv/lib/python3.10/site-packages/nvidia/cuda_runtime
52	0 B	ENV CUDNN_PATH=/venv/lib/python3.10/site-packages/nvidia/cudnn
53	0 B	ENV TENSORRT_PATH=/venv/lib/python3.10/site-packages/tensorrt
54	0 B	ENV LD_LIBRARY_PATH=/venv/lib/python3.10/site-packages/tensorrt:/venv/lib/python3.10/site-packages/nvidia/cudnn/lib:/venv/lib/p...
55	7.8 MB	RUN \|11 TORCH_COMMAND=/bin/bash XFORMERS_COMMAND=/bin/bash TENSORFLOW_COMMAND=/bin/bash DS_BUILD_OPS=1 pyver=3.10 DEEPSPEED=Fal...
56	0 B	WORKDIR /koyah_ss
57	3 MB	RUN \|11 TORCH_COMMAND=/bin/bash XFORMERS_COMMAND=/bin/bash TENSORFLOW_COMMAND=/bin/bash DS_BUILD_OPS=1 pyver=3.10 DEEPSPEED=Fal...
58	0 B	RUN \|11 TORCH_COMMAND=/bin/bash XFORMERS_COMMAND=/bin/bash TENSORFLOW_COMMAND=/bin/bash DS_BUILD_OPS=1 pyver=3.10 DEEPSPEED=Fal...
59	0 B	RUN \|11 TORCH_COMMAND=/bin/bash XFORMERS_COMMAND=/bin/bash TENSORFLOW_COMMAND=/bin/bash DS_BUILD_OPS=1 pyver=3.10 DEEPSPEED=Fal...
60	0 B	RUN \|11 TORCH_COMMAND=/bin/bash XFORMERS_COMMAND=/bin/bash TENSORFLOW_COMMAND=/bin/bash DS_BUILD_OPS=1 pyver=3.10 DEEPSPEED=Fal...
61	111 B	RUN \|11 TORCH_COMMAND=/bin/bash XFORMERS_COMMAND=/bin/bash TENSORFLOW_COMMAND=/bin/bash DS_BUILD_OPS=1 pyver=3.10 DEEPSPEED=Fal...
62	0 B	ENV TPU_LIBRARY_PATH=/venv/lib/python3.10/site-packages/libtpu/
63	0 B	RUN \|11 TORCH_COMMAND=/bin/bash XFORMERS_COMMAND=/bin/bash TENSORFLOW_COMMAND=/bin/bash DS_BUILD_OPS=1 pyver=3.10 DEEPSPEED=Fal...
64	0 B	ARG CUDA_VERSION
65	0 B	ARG CUDA_DRIVERS
66	0 B	ARG pyver
67	0 B	ENV CUDA_HOME=/usr/local/cuda
68	0 B	ENV PATH=/usr/local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/cuda/bin
69	0 B	ENV LD_LIBRARY_PATH=/usr/local/cuda/lib:/venv/lib/python3.10/site-packages/tensorrt:/venv/lib/python3.10/site-packages/nvidia/c...
70	0 B	RUN \|3 CUDA_VERSION=cuda-11-8 CUDA_DRIVERS=cuda-drivers-535 pyver=3.10 /bin/bash -ceuxo pipefail # Hotfix for libnvinfer7 ln -s...
71	8.9 GB	RUN \|3 CUDA_VERSION=cuda-11-8 CUDA_DRIVERS=cuda-drivers-535 pyver=3.10 /bin/bash -ceuxo pipefail # cuda driver update apt-get u...
72	0 B	WORKDIR /koyah_ss
73	5.3 GB	RUN /bin/bash -ceuxo pipefail # installing kohya trainer scripts git pull --rebase /bin/bash /docker/install-container-dep.sh -...
74	3.3 kB	COPY ./scripts/*.sh /docker/ # buildkit
75	3 kB	RUN /bin/bash -ceuxo pipefail chmod +x /docker/{run,mount,debug}.sh # buildkit
76	0 B	EXPOSE map[7680/tcp:{}]
77	0 B	EXPOSE map[6006/tcp:{}]
78	0 B	ENTRYPOINT ["/bin/bash" "-ceuxo" "pipefail" "$RUNNER $RUN_ARGS"]

Launch with Gradio share

Hey, thanks for making this.

I was able to get it to run but is there any way to set share=True in the launch function? It doesn't appear to be among the args I can set in docker_compose.yaml, as kohya_gui.py doesn't normally accept it.

Not startable

Error:
kohya-1 | /venv/lib/python3.10/site-packages/gradio_client/documentation.py:103: UserWarning: Could not get documentation group for <class 'gradio.mix.Parallel'>: No known documentation group for module 'gradio.mix'
kohya-1 | warnings.warn(f"Could not get documentation group for {cls}: {exc}")
kohya-1 | /venv/lib/python3.10/site-packages/gradio_client/documentation.py:103: UserWarning: Could not get documentation group for <class 'gradio.mix.Series'>: No known documentation group for module 'gradio.mix'
kohya-1 | warnings.warn(f"Could not get documentation group for {cls}: {exc}")

Doesnt work

got an error when running this.

whl and deb links

Most of the specific versions of the .whl and .deb files mentioned in the dot files and readme respectively don't seem to be on the websites mentioned in the readme.

So I'm just going to use the ones from pytorch.org, developer.download.nvidia.com and pythonhosted.org. Some versions I couldn't find at all, so am trying the closest ones available on pypi.org.

Unable to find GPU

I was able to build and run the application offline on my own hardware. However, I was met with:

Torch reports CUDA not available
and I cannot train anything with FP16 precision, instead having to train using the cpu exclusively.

About my system:

I installed the nvidia toolkit using the instructions and configured it, restarting docker daemon:
[https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html]

Output for driver and CUDA from nvidia-smi:

** NVIDIA-SMI 470.161.03 Driver Version: 470.161.03 CUDA Version: 11.4**

System information:
Linux Mint 20 x86_64
5.15.0-83-generic

GPU information:
RTX 3070
registered as dev/nvidia0

Any ideas what might be causing this?

Edit:
After the cpu training was finished, I got another error that may shed light on what was causing this:

kohya-docker-kohya-1 | CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64...
kohya-docker-kohya-1 | CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
kohya-docker-kohya-1 | CUDA exception! Error code: forward compatibility was attempted on non supported HW
kohya-docker-kohya-1 | CUDA exception! Error code: initialization error
kohya-docker-kohya-1 | CUDA SETUP: Highest compute capability among GPUs detected: None
kohya-docker-kohya-1 | CUDA exception! Error code: forward compatibility was attempted on non supported HW
kohya-docker-kohya-1 | CUDA SETUP: CUDA version lower than 11 are currenlty not supported for LLM.int8(). You will be only to use 8-bit optimizers and quantization routines!!
kohya-docker-kohya-1 | CUDA SETUP: Detected CUDA version 00
kohya-docker-kohya-1 | CUDA SETUP: TODO: compile library for specific version: libbitsandbytes_cuda00_nocublaslt.so
kohya-docker-kohya-1 | CUDA SETUP: Defaulting to libbitsandbytes.so...
kohya-docker-kohya-1 | CUDA SETUP: CUDA detection failed. Either CUDA driver not installed, CUDA not installed, or you have multiple conflicting CUDA libraries!
kohya-docker-kohya-1 | CUDA SETUP: If you compiled from source, try again with make CUDA_VERSION=DETECTED_CUDA_VERSION for example, make CUDA_VERSION=113.

My current CUDA version:
Cuda compilation tools, release 10.1, V10.1.243

I'm guessing my CUDA version is too old and not configured properly since it isn't found in path. I'll try reinstalling a version 11 or higher and retrying the docker image.