crispycrafter / cdeep3m-docker Goto Github PK
View Code? Open in Web Editor NEWA docker container for cdeep3m
Home Page: https://github.com/CRBS/cdeep3m
License: Other
A docker container for cdeep3m
Home Page: https://github.com/CRBS/cdeep3m
License: Other
First of all, thanks for the docker build sharing.
I've been try to test cdeep3m with docker.
training(+retraining with pre-trained model) with my own dataset was just fine.
I'm faced octave error when predict boundary map with trained model.
this is what the program said.
[error msg]
$ docker-compose up
Creating network "cdeep3m-docker_default" with the default driver
Creating cdeep3m-docker_cdeep3m_1 ... done
Attaching to cdeep3m-docker_cdeep3m_1
cdeep3m_1 | octave: X11 DISPLAY environment variable not set
cdeep3m_1 | octave: disabling GUI features
cdeep3m_1 | Starting Image Augmentation
cdeep3m_1 | Check image size of:
cdeep3m_1 | /data/images/roi9
cdeep3m_1 | Reading file: /data/images/roi9/roi09_0001.png
cdeep3m_1 | z_blocks =
cdeep3m_1 |
cdeep3m_1 | 1 64
cdeep3m_1 |
cdeep3m_1 | panic: panic: attempted clean up apparently failed -- aborting...
cdeep3m_1 | panic: attempted clean up apparently failed -- aborting...
cdeep3m_1 | panic: attempted clean up apparently failed -- aborting...
cdeep3m_1 | panic: attempted clean up apparently failed -- aborting...
cdeep3m_1 | panic: attempted clean up apparently failed -- aborting...
cdeep3m_1 | panic: attempted clean up apparently failed -- aborting...
cdeep3m_1 | Segmentation fault -- stopping myself...
cdeep3m_1 | attempting to save variables to 'octave-workspace'...
cdeep3m_1 | /home/cdeep3m/runprediction.sh: line 124: 13 Aborted (core dumped) DefDataPackages.m "$images" "$augimages"
cdeep3m_1 | ERROR, a non-zero exit code (134) was received from: DefDataPackages.m "/data/images/roi9" "/data/predictout/my_25k/roi9/augimages"
cdeep3m-docker_cdeep3m_1 exited with c
I googled it, and it seemed this error caused by octave.
DefDataPackages.m done its job properly(I guessed), but octave spit the error after execution of DefDataPackages.
I wonder that is there anybody experience same error I've got and how can I solve this problem.
thanks.
Installed all the cuda drivers and docker-nvidia, but still get this error.
/home/cdeep3m/trainworker.sh: line 99: nvidia-smi: command not found cdeep3m_1 | ERROR unable to get count of GPU(s). Is nvidia-smi working? cdeep3m_1 | ERROR, a non-zero exit code (4) was received from: trainworker.sh --numiterations 10000
Is it because I have cuda-10 and ubuntu 18.04 installed on my system?
Running sudo docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi
shows that docker-nvidia is working succesfully
Issue found when running the following command. Any advice?
docker run -it cdeep3m:v0.0.1 /train/sbem/mitochrondria/xy5.9nm40nmz/30000iterations_train_out /home/cdeep3m/cdeep3m-1.6.2/mito_testsample/testset/ /train/predictout30k
octave: X11 DISPLAY environment variable not set
octave: disabling GUI features
Starting Image Augmentation
Check image size of:
/home/cdeep3m/cdeep3m-1.6.2/mito_testsample/testset/
Reading file: /home/cdeep3m/cdeep3m-1.6.2/mito_testsample/testset/images.081.png
z_blocks =
1 5
Start up worker to generate packages to process
Start up worker to run prediction on packages
Start up worker to run post processing on packages
To see progress run the following command in another window:
tail -f /train/predictout30k/logs/*.log
octave: X11 DISPLAY environment variable not set
octave: disabling GUI features
/train/predictout30k/1fm not a directory
Please use: EnsemblePredictions ./inputdir1 ./inputdir2 ./inputdir3 ./outputdir
ERROR file found. Something went wrong
ERROR, a non-zero exit code (127) received from PreprocessPackage.m 001 01 1fm 1
8
Runtraining.sh needs to locate a folder with augmented data (augdata) and build the trained model in another folder (trainout). When I include the path to these folders as follows:
sudo docker run -it cdeep:v0.0.1 --numiterations 10000 --gpu 0 ~/cdeep3m-docker/augdata ~/cdeep3m-docker/trainout
I get the following error:
./runtraining.sh: line 127: CreateTrainJob.m: command not found Error, a non-zero exit code (127) was received from: CreateTrainJob.m "/home/jurgen/cdeep3m-docker/augdata" "/home/jurgen/cdeep3m-docker/trainout" "/home/jurgen/cdeep3m-docker/augdata"
Am I just specifying the path to the training data incorrectly?
Hi, I am new to Cdeep3m-docker and have run into an error quite early. Attempting to run docker-compose build leads to the error "ERROR: The Compose file './docker-compose.yml' is invalid because:
Unsupported config option for services.cdeep3m: 'runtime'"
Any help would be appreciated.
It seems we have ran out of memory on the GPU.
How do we set the training batch size?
Hi,
I am testing the software but stucked at the training phase, the machine totally hangs after allocating all the memory. (128G of RAM, 32 cores and 1 M60)
After restricting the memory in the docker-compose to 90G it seems to work, but for instance when crashing it throws a warning like:
Warning: unable to close filehandle properly: Cannot allocate memory during global destruction.
And after a while this:
cdeep3m_1 | ERROR: caffe had a non zero exit code: 134
cdeep3m_1 | /home/cdeep3m/caffetrain.sh: line 166: 100 Aborted (core dumped) GLOG_log_dir=$log_dir caffe.bin train --solver=$model_dir/solver.prototxt --gpu $gpu $snapshot_opts > "${model_dir}/log/out.log" 2>&1
cdeep3m_1 | ERROR: caffe had a non zero exit code: 137
cdeep3m_1 | /home/cdeep3m/caffetrain.sh: line 166: 127 Killed GLOG_log_dir=$log_dir caffe.bin train --solver=$model_dir/solver.prototxt --gpu $gpu $snapshot_opts > "${model_dir}/log/out.log" 2>&1
GPU looks like:
nvidia-smi
Mon Apr 8 14:01:47 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.107 Driver Version: 410.107 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla M60 On | 00000000:06:00.0 Off | Off |
| 32% 36C P0 36W / 120W | 262MiB / 8129MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla M60 On | 00000000:07:00.0 Off | Off |
| 32% 27C P8 14W / 120W | 11MiB / 8129MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 30767 C caffe.bin 109MiB |
| 0 30793 C caffe.bin 109MiB |
+----------------------------------------------------------------------------
Apr 8 14:01:06 opskvm01 kernel: Memory cgroup stats for /docker/6e765d2d36b931a1188c2c1f93552068f2d68d46e0060e11986265dd5fa83e0d: cache:93406836KB rss:1472KB rss_huge:0KB mapped_file:88703160KB swap:393296KB inactive_anon:4703640KB active_anon:88704632KB inactive_file:0KB active_file:0KB unevictable:0KB
Apr 8 14:01:06 opskvm01 kernel: [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
Apr 8 14:01:06 opskvm01 kernel: [29973] 23446 29973 4545 322 14 99 0 runtraining.sh
Apr 8 14:01:06 opskvm01 kernel: [30186] 23446 30186 4516 324 14 71 0 trainworker.sh
Apr 8 14:01:06 opskvm01 kernel: [30203] 23446 30203 11475 703 27 3271 0 perl
Apr 8 14:01:06 opskvm01 kernel: [30256] 23446 30256 4546 325 14 101 0 caffetrain.sh
Apr 8 14:01:06 opskvm01 kernel: [30281] 23446 30281 40152923 10540653 23224 47801 0 caffe.bin
Apr 8 14:01:06 opskvm01 kernel: [30292] 23446 30292 4546 325 14 101 0 caffetrain.sh
Apr 8 14:01:06 opskvm01 kernel: [30314] 23446 30314 40153011 11681879 23151 46889 0 caffe.bin
Apr 8 14:01:06 opskvm01 kernel: [30697] 23446 30697 4570 498 14 0 0 bash
Apr 8 14:01:06 opskvm01 kernel: Memory cgroup out of memory: Kill process 30319 (caffe.bin) score 478 or sacrifice child
Apr 8 14:01:06 opskvm01 kernel: Killed process 30314 (caffe.bin) total-vm:160612044kB, anon-rss:0kB, file-rss:93580kB, shmem-rss:46633936kB
Apr 8 14:01:16 opskvm01 kernel: ___slab_alloc: 42 callbacks suppressed
Apr 8 14:01:16 opskvm01 kernel: SLUB: Unable to allocate memory on node -1 (gfp=0x80d0)
Apr 8 14:01:16 opskvm01 kernel: cache: taskstats(4:6e765d2d36b931a1188c2c1f93552068f2d68d46e0060e11986265dd5fa83e0d), object size: 328, buffer size: 328, default order: 2, min order: 0
Is anyone else having issues similar to this?
Cheers.
I didn't previously have this problem, but now when I change the entry point in the dockerfile and update the commands to run the 'runprediction.sh' script I get the following error:
ERROR, a non-zero exit code (127) received from PreprocessPackage.m 001 01 1fm 1
cdeep3m_1
Has this happened to you before?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.