Git Product home page Git Product logo

Comments (17)

czero69 avatar czero69 commented on July 27, 2024 1

ah, you are using two gpus (2x 24GB). That potentially may be the case. GPU usage indeed indicates that pytorch is using gpus. I didn't test the code (epe training / mseg generations) in multiGPU setup yet, but I remember from my past pytorch projects that if something was not well designed for multiGPU setup, some steps (communication, combinning results, even some part of calculating gradients, copying things back and forth, some merging ops on cpu) took much time and may be less efficient than on one gpu. What I would try in your case:

  • try running the code for one GPU and check (mby that will solve your issue)
  • if that does not helped. Create new conda env and install all dependencies that have gpu support.
  • check what mseg configuration are you using

Also if you step of generating knn will be slow that means faiss use cpu. On gpu faiss is blazing fast and all took 1-2 second for 500k samples in my case.

is okay to use another segmentation network for this enhancing photorealism enhancement?

Msegs and EPE are different networks, you probably mean conda envs. Everything should be running fine in the same environment (as it is working for me), but you can use different conda environments, it does not matter at all.

PS. what is your input image size for Mseg step?

from photorealismenhancement.

czero69 avatar czero69 commented on July 27, 2024 1

o ya, also i dont nderstand that much about batch universal demo from Mseg, because i thought with batch, the MSeg run on batches of images, but in that command line above is like 1 batch consists only 1 image, so there are 2500 batches.

ah, I am going only with universal_demo.py

i have error when run the training:

you have at least two errors. One says some file is missing. Another one, num_samples=0 this happens when dataloader see no data at all, usually wrong paths / wrong input file structure / wrong input file path.

For preparing EPE data, you must take extra care. Go throu all preparation steps careffuly. I recommend printing results of every step below (values, means) to check does tensors looks ok (not nans/ not inf etc.)

these are my scripts, where images are 4k (hence 2160 3840 and -c 60)

python epe/matching/feature_based/collect_crops.py FirstRealVids /path/real.txt -n 8 -c 60
python epe/matching/feature_based/collect_crops.py Coffing /path/fake.txt -n 8 -c 60
python epe/matching/feature_based/find_knn.py crop_Coffing.npz crop_FirstRealVids.npz knn_Coffing-FirstRealVids.npz -k 15
python epe/matching/filter.py knn_Coffing-FirstRealVids.npz crop_Coffing.csv crop_FirstRealVids.csv 1.0 matched_crops_Coffing-FirstRealVids.csv
python epe/matching/compute_weights.py matched_crops_Coffing-FirstRealVids.csv 2160 3840 crop_weights_Coffing-FirstRealVids.npz

take a note of correct order for /path/real.txt (images, msegs)
and for /path/fake.txt (images, msegs, gbuffers_npz, gt_stencils)

To verify a bit, its good to see does matched crops looks ok

python ./epe/matching/feature_based/sample_matches.py  /path/fake.txt crop_Coffing.csv /path/real.txt crop_FirstRealVids.csv knn_Coffing-FirstRealVids.npz

Also, make sure all your input color images are RGB, 3-channel (not RGBA). Robust maps (mseg) and Stencils (gt masks) are in 8 bit. Your NPZs should have a same structure as fake NPZs ('data' key in numpy dict, float16). If it has different dim than 32 (32 == num of gbuffer channels in total) modify the code accordingly; should be in one place.

Have you tried also to implement and train the EPE? because till now i still can't bring it for training.

After solving some trivial issues, all is training fine for me. Results are ... shortly speaking ... breathtaking. Possibly I would rewrite an entire training pipeline to latest pytorch and pipeline similar to how I work nowadays, so it would be easier for me to modify epe basilne arch furhter, support batches > 1, logging, etc.

from photorealismenhancement.

luda1013 avatar luda1013 commented on July 27, 2024

My setup:

  • Ubuntu 20.04.6 LTS
  • core i9-10980XE @ 3GHz
  • graphic (nvidia-smi): NVIDIA RTX A5000 (48 GB)
  • Cuda 11.7
  • Pytorch version : 2.0.0
  • Cuda at: /usr/local/cuda*

from photorealismenhancement.

czero69 avatar czero69 commented on July 27, 2024

Sounds like you are running on cpu and not on gpu. I am runninng on 24gb gpu and for "default_config_1080_ms.yaml" config it took good few seconds for one fullHD input image. Check in your config what gpu is indicated as test_gpu: [0]. Make sure also pytorch is picking nvidia as a device and not a cpu. It is the often case where you have something badly installed in your environment. From log lines you should indicate or you can also write super simple pytorch script to print device name (howto check pytorch is using gpu). If it is the case, I would create new conda environment and install pytorch for gpu/cuda (pip install might be bit different for gpu support).

About grayscale output it is probably what you want. One channel (8 bit gray) is super enough to store 256 classes and mseg is producing less than that. This way of storing info is efficient for hdd, 4k image is barely 112kb or so.

from photorealismenhancement.

luda1013 avatar luda1013 commented on July 27, 2024

ok, thank you for the reply.. ok, i will make new virtual environment just for the MSeg. about the cpu and gpu, when running the script, i also check the nvidia-smi and it is on 100% load
Nvidia when MSeg active

is okay to use another segmentation network for this enhancing photorealism enhancement?

from photorealismenhancement.

luda1013 avatar luda1013 commented on July 27, 2024

okay, so the MSeg scipt i used is the universal demo inference or the universal demo for batch; so the command is like:
python -u ~/mseg-semantic/mseg_semantic/tool/universal_demo_batched.py
--config=mseg_semantic/config/test/480/default_config_batched_ss.yaml
model_name mseg-3m
model_path ~/mseg-semantic/mseg-3m-480p.pth
input_file /home/luda1013/PfD/image/01_images/

the input file 01_images is a folder contains of 2500 images of Playing for Data - dataset
each image has 1914 pixel x 1052 pixel (Width x Height)
here are some screenshots on running MSeg:
image
o ya, also i dont nderstand that much about batch universal demo from Mseg, because i thought with batch, the MSeg run on batches of images, but in that command line above is like 1 batch consists only 1 image, so there are 2500 batches.

And before, i tried to use the universal_demo.py and at the first image i break the running and i think these are why it takes long time just to infer 1 image. But i dont yet investigate/debug line-per-line further.
MSeg-log_took long time here
MSeg-log_took long time here_2

because took too long for me for the mseg, now i have MSeg-BW-images for 1,5-2 folders of PfD: images_01 and images_02.
Now i am using it to implements the EPE, 1 folder as fake , 2nd folder as real dataset.
Have you tried also to implement and train the EPE? because till now i still can't bring it for training.
i have error when run the training:
image

image

now i am still at debug line per line where this num_samples = 0 came..

Thank you so much @czero69 for your help btw :)

from photorealismenhancement.

luda1013 avatar luda1013 commented on July 27, 2024

ya, this skipped entries is found at script epe.datasets.utils.py in function read_filelist.
here is my text file looks like for 1 line / 1 image:
rendered.txt:
/home/luda1013/PfD_test/image/02_images/images/02501.png,/home/luda1013/mseg-semantic/temp_files/mseg-3m_01_images_universal_ss/259/gray/02501_gray.jpg,/home/luda1013/PEcon2/gbuffer_02_images_rendered/02501.npz,/home/luda1013/PfD_test/label/02_labels/labels/02501.png

real.txt:
/home/luda1013/PfD_test/image/01_images/images/00001.png,/home/luda1013/mseg-semantic/temp_files/mseg-3m_01_images_universal_ss/259/gray/00001.png

so i think with the text files i do it correctly. and i already check the NaN.
o ya, at compute_weights.py, before i have NaN number when i run the script on terminal, tried to run at jupyter notebook, it gave me number.. idk why.. but even after solving the NaN number in compute_weights; i still get the same error and skipped entries at training.

The other input like RGB 3 Channel (not RGBA), how many bits, i dont check it yet.

i tell you again as soon i try all of the suggestions you made.
With the MSeg, i will try it later with 1 GPU after i can bring this EPE to train with these 2 folders of PfD-dataset that i have XD

from photorealismenhancement.

czero69 avatar czero69 commented on July 27, 2024

in compute_weights.py take a note that argument is in H, W (and not W, H), so for e.g. fullHD will be 1080 1920; it was my NaN reason

The other input like RGB 3 Channel (not RGBA), how many bits, i dont check it yet.

should be 24 bits, check one random img for fake & real. Not everywhere in the code there is [:,:3,:,:] so RGBA will rise some 4!=3 in tensors size

num_samples=0

Almost for sure paths are wrong. Take one file from each of your .txt, .csvs and make stat in the terminal

stat /path/to.png

from photorealismenhancement.

luda1013 avatar luda1013 commented on July 27, 2024

oww.. so the txt file i created is false? i thought is correct because i followed the instructions
" Each line should contain paths to image, robust label map, gbuffer file, and ground truth label map, all separated by commas. Each line should contain paths to image, robust label map, gbuffer file, and ground truth label map, all separated by commas."

sorry but i also dont quite understand with the stat in the terminal, so u suggest to verify the location of the image using stat?
image

maybe could help u too look better, here is the part of screen shot of my rendered.txt:
image

from photorealismenhancement.

czero69 avatar czero69 commented on July 27, 2024

oww.. so the txt file i created is false?

at least order looks ok. I have this order too ["screenshot", "msegs/gray4k", "NPZs", "gray_stencils"]

the stat is just to make sure .txt/.csv have correct paths. The one you've printed is correct, indicating file indeed exist in HDD. Rendered.txt looks ok too for a first sight. But num_samples == 0 indicates epe pipeline does not see sth. Check all 4 paths in some random row in rendered.txt. Check paths in your citysample_ie2.config (or whatever name config you are using for epe training). Also check whats going on with, missing files, probably some paths are pointing to non-existing files.

from photorealismenhancement.

luda1013 avatar luda1013 commented on July 27, 2024

ahh okay, thank you, @czero69 for the stat tips to check the location of the images.

ya when i run the training, the num_samples = 0 come from the skipped entries.
image
and this comes form the epe.datasets.utils so from the function read_filelist()

ya, for the config i use the train_pfd2cs.yaml from github, i just modify the basic like path, etc.. i even keep the name same. pfd and cs, just to avoid unnecessary error

the number 355 skipped entries are for validation and 1066 are for training, in the val.txt there are 355 lines/ images and 1066 images in train.txt
i guess the script cannot see/ work the lines of paths

from photorealismenhancement.

vace17 avatar vace17 commented on July 27, 2024

Hello @luda1013,
I noticed you are using CUDA version 11.7, were you able to train the model properly?
I was also using this version of CUDA but the training step takes a lot of time, I supposed because pytorch is not using the GPU (maybe because of incompatible CUDA?)
Thank you in advance for any help

from photorealismenhancement.

luda1013 avatar luda1013 commented on July 27, 2024

Hi @vace17 , not yet, i am still countered some problem, now i countered the problem:
error_train
seems that i need to squeeze or reshape the image first into (H,W) instead (H,W,C)

in your case, in the script they always made the used device is cuda, u can also then check when u are on training, open your terminal and check with nvidia-smi
u should then know if your gpu is loaded or not

Could you please help me then with training? till now i cannot bring it to train.
may i know your dataset for fake and real, the config and the step by step? do u just train or did u also modify the scripts? thank you very much :)

update:
i solved the ValueError: could not broadcast input array from shape (..) to shape (..)
i tried to convert the label map from PfD (which is still RGB or CMYK) to gray

but i have another issue now, and maybe it is the same with you @vace17 :
image

from photorealismenhancement.

vace17 avatar vace17 commented on July 27, 2024

@luda1013 I have a different issue since I don't encounter this specific error and the training process is running but it takes a lot of time

from photorealismenhancement.

vace17 avatar vace17 commented on July 27, 2024

@czero69 can I ask what is your specific setup?
Mine is :

  • Windows 10
  • Graphic card: NVIDIA GEFORCE RTX 3090
  • Cuda 11.7
  • Python version 3.8.16

I checked from the terminal the current usage of the GPU using the command nvidia-smi during the run of the training process. It seems to me that the GPU is currently used but the percentage of usage of it continues changing between low values (5-10%) and 50-60%.
How much time does it take to train the framework with your current setup?
Thank you in advance for any help

from photorealismenhancement.

czero69 avatar czero69 commented on July 27, 2024

hey, I have tried two set-ups so far:

  • win10, NVIDIA RTX 3090 it trained 75k steps / day
  • linux, NVIDIA A6000 (48GB), ~150k steps / day

my entire epoch would be around 1M steps (batch == 1), 196x196 single crop

Authors mentioned somewhere here in the issue space that for them it was around 200k steps / day too and they were using 1x3090

from photorealismenhancement.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.