Comments (17)
ah, you are using two gpus (2x 24GB). That potentially may be the case. GPU usage indeed indicates that pytorch is using gpus. I didn't test the code (epe training / mseg generations) in multiGPU setup yet, but I remember from my past pytorch projects that if something was not well designed for multiGPU setup, some steps (communication, combinning results, even some part of calculating gradients, copying things back and forth, some merging ops on cpu) took much time and may be less efficient than on one gpu. What I would try in your case:
- try running the code for one GPU and check (mby that will solve your issue)
- if that does not helped. Create new conda env and install all dependencies that have gpu support.
- check what mseg configuration are you using
Also if you step of generating knn will be slow that means faiss use cpu. On gpu faiss is blazing fast and all took 1-2 second for 500k samples in my case.
is okay to use another segmentation network for this enhancing photorealism enhancement?
Msegs and EPE are different networks, you probably mean conda envs. Everything should be running fine in the same environment (as it is working for me), but you can use different conda environments, it does not matter at all.
PS. what is your input image size for Mseg step?
from photorealismenhancement.
o ya, also i dont nderstand that much about batch universal demo from Mseg, because i thought with batch, the MSeg run on batches of images, but in that command line above is like 1 batch consists only 1 image, so there are 2500 batches.
ah, I am going only with universal_demo.py
i have error when run the training:
you have at least two errors. One says some file is missing. Another one, num_samples=0 this happens when dataloader see no data at all, usually wrong paths / wrong input file structure / wrong input file path.
For preparing EPE data, you must take extra care. Go throu all preparation steps careffuly. I recommend printing results of every step below (values, means) to check does tensors looks ok (not nans/ not inf etc.)
these are my scripts, where images are 4k (hence 2160 3840 and -c 60)
python epe/matching/feature_based/collect_crops.py FirstRealVids /path/real.txt -n 8 -c 60
python epe/matching/feature_based/collect_crops.py Coffing /path/fake.txt -n 8 -c 60
python epe/matching/feature_based/find_knn.py crop_Coffing.npz crop_FirstRealVids.npz knn_Coffing-FirstRealVids.npz -k 15
python epe/matching/filter.py knn_Coffing-FirstRealVids.npz crop_Coffing.csv crop_FirstRealVids.csv 1.0 matched_crops_Coffing-FirstRealVids.csv
python epe/matching/compute_weights.py matched_crops_Coffing-FirstRealVids.csv 2160 3840 crop_weights_Coffing-FirstRealVids.npz
take a note of correct order for /path/real.txt (images, msegs)
and for /path/fake.txt (images, msegs, gbuffers_npz, gt_stencils)
To verify a bit, its good to see does matched crops looks ok
python ./epe/matching/feature_based/sample_matches.py /path/fake.txt crop_Coffing.csv /path/real.txt crop_FirstRealVids.csv knn_Coffing-FirstRealVids.npz
Also, make sure all your input color images are RGB, 3-channel (not RGBA). Robust maps (mseg) and Stencils (gt masks) are in 8 bit. Your NPZs should have a same structure as fake NPZs ('data' key in numpy dict, float16). If it has different dim than 32 (32 == num of gbuffer channels in total) modify the code accordingly; should be in one place.
Have you tried also to implement and train the EPE? because till now i still can't bring it for training.
After solving some trivial issues, all is training fine for me. Results are ... shortly speaking ... breathtaking. Possibly I would rewrite an entire training pipeline to latest pytorch and pipeline similar to how I work nowadays, so it would be easier for me to modify epe basilne arch furhter, support batches > 1, logging, etc.
from photorealismenhancement.
My setup:
- Ubuntu 20.04.6 LTS
- core i9-10980XE @ 3GHz
- graphic (nvidia-smi): NVIDIA RTX A5000 (48 GB)
- Cuda 11.7
- Pytorch version : 2.0.0
- Cuda at: /usr/local/cuda*
from photorealismenhancement.
Sounds like you are running on cpu and not on gpu. I am runninng on 24gb gpu and for "default_config_1080_ms.yaml" config it took good few seconds for one fullHD input image. Check in your config what gpu is indicated as test_gpu: [0]. Make sure also pytorch is picking nvidia as a device and not a cpu. It is the often case where you have something badly installed in your environment. From log lines you should indicate or you can also write super simple pytorch script to print device name (howto check pytorch is using gpu). If it is the case, I would create new conda environment and install pytorch for gpu/cuda (pip install might be bit different for gpu support).
About grayscale output it is probably what you want. One channel (8 bit gray) is super enough to store 256 classes and mseg is producing less than that. This way of storing info is efficient for hdd, 4k image is barely 112kb or so.
from photorealismenhancement.
ok, thank you for the reply.. ok, i will make new virtual environment just for the MSeg. about the cpu and gpu, when running the script, i also check the nvidia-smi and it is on 100% load
is okay to use another segmentation network for this enhancing photorealism enhancement?
from photorealismenhancement.
okay, so the MSeg scipt i used is the universal demo inference or the universal demo for batch; so the command is like:
python -u ~/mseg-semantic/mseg_semantic/tool/universal_demo_batched.py
--config=mseg_semantic/config/test/480/default_config_batched_ss.yaml
model_name mseg-3m
model_path ~/mseg-semantic/mseg-3m-480p.pth
input_file /home/luda1013/PfD/image/01_images/
the input file 01_images is a folder contains of 2500 images of Playing for Data - dataset
each image has 1914 pixel x 1052 pixel (Width x Height)
here are some screenshots on running MSeg:
o ya, also i dont nderstand that much about batch universal demo from Mseg, because i thought with batch, the MSeg run on batches of images, but in that command line above is like 1 batch consists only 1 image, so there are 2500 batches.
And before, i tried to use the universal_demo.py and at the first image i break the running and i think these are why it takes long time just to infer 1 image. But i dont yet investigate/debug line-per-line further.
because took too long for me for the mseg, now i have MSeg-BW-images for 1,5-2 folders of PfD: images_01 and images_02.
Now i am using it to implements the EPE, 1 folder as fake , 2nd folder as real dataset.
Have you tried also to implement and train the EPE? because till now i still can't bring it for training.
i have error when run the training:
now i am still at debug line per line where this num_samples = 0 came..
Thank you so much @czero69 for your help btw :)
from photorealismenhancement.
ya, this skipped entries is found at script epe.datasets.utils.py in function read_filelist.
here is my text file looks like for 1 line / 1 image:
rendered.txt:
/home/luda1013/PfD_test/image/02_images/images/02501.png,/home/luda1013/mseg-semantic/temp_files/mseg-3m_01_images_universal_ss/259/gray/02501_gray.jpg,/home/luda1013/PEcon2/gbuffer_02_images_rendered/02501.npz,/home/luda1013/PfD_test/label/02_labels/labels/02501.png
real.txt:
/home/luda1013/PfD_test/image/01_images/images/00001.png,/home/luda1013/mseg-semantic/temp_files/mseg-3m_01_images_universal_ss/259/gray/00001.png
so i think with the text files i do it correctly. and i already check the NaN.
o ya, at compute_weights.py, before i have NaN number when i run the script on terminal, tried to run at jupyter notebook, it gave me number.. idk why.. but even after solving the NaN number in compute_weights; i still get the same error and skipped entries at training.
The other input like RGB 3 Channel (not RGBA), how many bits, i dont check it yet.
i tell you again as soon i try all of the suggestions you made.
With the MSeg, i will try it later with 1 GPU after i can bring this EPE to train with these 2 folders of PfD-dataset that i have XD
from photorealismenhancement.
in compute_weights.py take a note that argument is in H, W (and not W, H), so for e.g. fullHD will be 1080 1920; it was my NaN reason
The other input like RGB 3 Channel (not RGBA), how many bits, i dont check it yet.
should be 24 bits, check one random img for fake & real. Not everywhere in the code there is [:,:3,:,:] so RGBA will rise some 4!=3 in tensors size
num_samples=0
Almost for sure paths are wrong. Take one file from each of your .txt, .csvs and make stat in the terminal
stat /path/to.png
from photorealismenhancement.
oww.. so the txt file i created is false? i thought is correct because i followed the instructions
" Each line should contain paths to image, robust label map, gbuffer file, and ground truth label map, all separated by commas. Each line should contain paths to image, robust label map, gbuffer file, and ground truth label map, all separated by commas."
sorry but i also dont quite understand with the stat in the terminal, so u suggest to verify the location of the image using stat?
maybe could help u too look better, here is the part of screen shot of my rendered.txt:
from photorealismenhancement.
oww.. so the txt file i created is false?
at least order looks ok. I have this order too ["screenshot", "msegs/gray4k", "NPZs", "gray_stencils"]
the stat is just to make sure .txt/.csv have correct paths. The one you've printed is correct, indicating file indeed exist in HDD. Rendered.txt looks ok too for a first sight. But num_samples == 0 indicates epe pipeline does not see sth. Check all 4 paths in some random row in rendered.txt. Check paths in your citysample_ie2.config (or whatever name config you are using for epe training). Also check whats going on with, missing files, probably some paths are pointing to non-existing files.
from photorealismenhancement.
ahh okay, thank you, @czero69 for the stat tips to check the location of the images.
ya when i run the training, the num_samples = 0 come from the skipped entries.
and this comes form the epe.datasets.utils so from the function read_filelist()
ya, for the config i use the train_pfd2cs.yaml from github, i just modify the basic like path, etc.. i even keep the name same. pfd and cs, just to avoid unnecessary error
the number 355 skipped entries are for validation and 1066 are for training, in the val.txt there are 355 lines/ images and 1066 images in train.txt
i guess the script cannot see/ work the lines of paths
from photorealismenhancement.
Hello @luda1013,
I noticed you are using CUDA version 11.7, were you able to train the model properly?
I was also using this version of CUDA but the training step takes a lot of time, I supposed because pytorch is not using the GPU (maybe because of incompatible CUDA?)
Thank you in advance for any help
from photorealismenhancement.
Hi @vace17 , not yet, i am still countered some problem, now i countered the problem:
seems that i need to squeeze or reshape the image first into (H,W) instead (H,W,C)
in your case, in the script they always made the used device is cuda, u can also then check when u are on training, open your terminal and check with nvidia-smi
u should then know if your gpu is loaded or not
Could you please help me then with training? till now i cannot bring it to train.
may i know your dataset for fake and real, the config and the step by step? do u just train or did u also modify the scripts? thank you very much :)
update:
i solved the ValueError: could not broadcast input array from shape (..) to shape (..)
i tried to convert the label map from PfD (which is still RGB or CMYK) to gray
but i have another issue now, and maybe it is the same with you @vace17 :
from photorealismenhancement.
@luda1013 I have a different issue since I don't encounter this specific error and the training process is running but it takes a lot of time
from photorealismenhancement.
@czero69 can I ask what is your specific setup?
Mine is :
- Windows 10
- Graphic card: NVIDIA GEFORCE RTX 3090
- Cuda 11.7
- Python version 3.8.16
I checked from the terminal the current usage of the GPU using the command nvidia-smi during the run of the training process. It seems to me that the GPU is currently used but the percentage of usage of it continues changing between low values (5-10%) and 50-60%.
How much time does it take to train the framework with your current setup?
Thank you in advance for any help
from photorealismenhancement.
hey, I have tried two set-ups so far:
- win10, NVIDIA RTX 3090 it trained 75k steps / day
- linux, NVIDIA A6000 (48GB), ~150k steps / day
my entire epoch would be around 1M steps (batch == 1), 196x196 single crop
Authors mentioned somewhere here in the issue space that for them it was around 200k steps / day too and they were using 1x3090
from photorealismenhancement.
Related Issues (20)
- How many images should I input? HOT 1
- Question HOT 2
- Training config for hr_new (ienet2.py) HOT 1
- How many GPUs used during training? HOT 1
- Struggling with a bit of the setup HOT 1
- Sigmoid missing in hr_new? HOT 2
- MSeg difficulties HOT 9
- Can't figure out how to compile it. HOT 2
- How to get the ground truth label map of GTA game? HOT 1
- Loading model for training HOT 2
- ValueError when saving model HOT 3
- RuntimeError while training HOT 3
- usage
- Tensor size mismatch error for batch_size>1 - File:gb_encoder.py HOT 1
- step: Matching patches across datasets. Error num_samples = 0 HOT 3
- Train without using robust labels map from Mseg HOT 6
- Error Training -> Num_samples = 0
- License? HOT 2
- Error Training: RecursionError
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from photorealismenhancement.