haochen-rye / hnerv Goto Github PK

Official Pytorch implementation for HNeRV: a hybrid video neural representation (CVPR 2023)

Home Page: https://haochen-rye.github.io/HNeRV/

Python 100.00%

hnerv's Introduction

HNeRV: A Hybrid Neural Representation for Videos (CVPR 2023)

Paper | Project Page | UVG Data

Hao Chen, Matthew Gwilliam, Ser-Nam Lim, Abhinav Shrivastava
This is the official implementation of the paper "HNeRV: A Hybrid Neural Representation for Videos".

TODO

[ ✓ ] Video inpainting
[ ✓ ] Fast loading from video checkpoints
Upload results and checkpoints for UVG

Method overview

Get started

We run with Python 3.8, you can set up a conda environment with all dependencies like so:

pip install -r requirements.txt

High-Level structure

The code is organized as follows:

train_nerv_all.py includes a generic traiing routine.
model_all.py contains the dataloader and neural network architecure
data/ directory video/imae dataset, we provide bunny frames here
checkpoints/ we provide model weights, and quantized video checkpoints for bunny here
log files (tensorboard, txt, state_dict etc.) will be saved in output directory (specified by --outf)
We provide numerical results for distortion-compression at uvg_results and per_video_results .

Reproducing experiments

Training HNeRV

HNeRV of 1.5M is specified with '--modelsize 1.5', and we balance parameters with '-ks 0_1_5 --reduce 1.2'

python train_nerv_all.py  --outf 1120  --data_path data/bunny --vid bunny   \
   --conv_type convnext pshuffel --act gelu --norm none  --crop_list 640_1280  \
    --resize_list -1 --loss L2  --enc_strds 5 4 4 2 2 --enc_dim 64_16 \
    --dec_strds 5 4 4 2 2 --ks 0_1_5 --reduce 1.2   \
    --modelsize 1.5  -e 300 --eval_freq 30  --lower_width 12 -b 2 --lr 0.001

NeRV baseline

NeRV baseline is specified with '--embed pe_1.25_80 --fc_hw 8_16', with imbalanced parameters '--ks 0_3_3 --reduce 2'

python train_nerv_all.py  --outf 1120  --data_path data/bunny --vid bunny   \
   --conv_type convnext pshuffel --act gelu --norm none  --crop_list 640_1280  \
   --resize_list -1 --loss L2   --embed pe_1.25_80 --fc_hw 8_16 \
    --dec_strds 5 4 2 2 --ks 0_3_3 --reduce 2   \
    --modelsize 1.5  -e 300 --eval_freq 30  --lower_width 12 -b 2 --lr 0.001

Evaluation & dump images and videos

To evaluate pre-trained model, use '--eval_only --weight [CKT_PATH]' to evaluate and specify model path.
For model and embedding quantization, use '--quant_model_bit 8 --quant_embed_bit 6'.
To dump images or videos, use '--dump_images --dump_videos'.

python train_nerv_all.py  --outf 1120  --data_path data/bunny --vid bunny   \
   --conv_type convnext pshuffel --act gelu --norm none  --crop_list 640_1280  \
    --resize_list -1 --loss L2  --enc_strds 5 4 4 2 2 --enc_dim 64_16 \
    --dec_strds 5 4 4 2 2 --ks 0_1_5 --reduce 1.2  \
    --modelsize 1.5  -e 300 --eval_freq 30  --lower_width 12 -b 2 --lr 0.001 \
   --eval_only --weight checkpoints/hnerv-1.5m-e300.pth \
   --quant_model_bit 8 --quant_embed_bit 6 \
    --dump_images --dump_videos

Video inpainting

We can specified inpainting task with '--vid bunny_inpaint_50' where '50' is the mask size.

python train_nerv_all.py  --outf 1120  --data_path data/bunny --vid bunny_inpaint_50   \
   --conv_type convnext pshuffel --act gelu --norm none  --crop_list 640_1280  \
    --resize_list -1 --loss L2  --enc_strds 5 4 4 2 2 --enc_dim 64_16 \
    --dec_strds 5 4 4 2 2 --ks 0_1_5 --reduce 1.2   \
    --modelsize 1.5  -e 300 --eval_freq 30  --lower_width 12 -b 2 --lr 0.001

Efficient video loading

We can load video efficiently from a tiny checkpoint.
Specify decoder and checkpoint by '--decoder [Decoder_path] --ckt [Video checkpoint]', output dir and frames by '--dump_dir [out_dir] --frames [frame_num]'.

python efficient_nvloader.py --frames 16

Citation

If you find our work useful in your research, please cite:

@InProceedings{chen2023hnerv,
      title={{HN}e{RV}: Neural Representations for Videos}, 
      author={Hao Chen and Matthew Gwilliam and Ser-Nam Lim and Abhinav Shrivastava},
      year={2023},
      booktitle={CVPR},
}

Contact

If you have any questions, please feel free to email the authors: [email protected]

hnerv's People

Contributors

Stargazers

Watchers

Forkers

techthiyanes mgwillia fork-for-modify baoyu2020 ihaeyong whuhxb dannielge sateodoro aidol liuhaochuan79 ltphuongunited hubin858130 minhngt62 flytosky21 heroyjz

hnerv's Issues

About results of Video regression at resolution 480×960

I cannot reproduce your results about Video compression at resolution 480 × 960. Could you provide more details to help me reproduce the implementation of results about Video compression at resolution 480 × 960 on the UVG dataset.
I use the command python train_nerv_all.py --outf 1120 --data_path data/dance-twirl --vid dance_3M_resize --conv_type convnext pshuffel --act gelu --norm none --crop_list 960_1920 --resize_list 480_960 --loss L2 --enc_strds 5 4 3 2 2 --enc_dim 64_16 --dec_strds 5 4 3 2 2 --ks 0_1_5 --reduce 1.2 --modelsize 3 -e 300 --eval_freq 30 --lower_width 12 -b 2 --lr 0.001 --quant_model_bit -1 --dump_images

Architecture

Hello, I noticed that the HNeRV in your code seems to differ slightly from the Architecture in the paper.
In the paper, first use convnext for encoder, then use a learning based embed, and finally use decoder. As shown in the following figure.

But in the code, I found that you use optional positional encoding for embeddings, and then use convnext for encoder, followed by decoder, as shown in the following figure. Why is there a line of code img_embed=self. encoder (input)? What part of the code for learning based small embeddings?

License

Hello, is it on purpose that you did not specify the license of the project ?

About the ppp metric computation

Thanks for your great work in representing the video.

I notice that you use the pixel-for-pixel (PPP) to measure the compactness of the model; however, as a result, that I'm a freshman in this area, and I'm confused about how to compute the PPP. So would you mind explaining more about this metric, or directly updating the code to compute this metric? I would appreciate it if you could do these.

Thanks for your time again.

PSNR

Dear author @haochen-rye,

I would like to extend my sincere appreciation for the exceptional work you have accomplished. I am writing to seek clarification on the utilization of PSNR metrics within the model evaluation process. My understanding of this metric is that it is typically calculated using the formula 20 * torch.log10(255.0 / torch.sqrt(mse)). However, I noticed that in your code, the calculation for PSNR is represented as psnr = -10 * torch.log10(mse). Could you kindly provide an explanation for this discrepancy? Your insights would be greatly appreciated.

Thank you for your attention to this matter.

Yours sincerely,
Charles

Update: Because of the preprocessing step, the model has normalized the tensor to the range of 0 to 1, resulting in MAX=1. Therefore, the formula is simplified as above.

Hi, I failed to run your code in GPU mode in my device, could you tell me the reason?

environment: linux + 1660 graphics card desktop
Running display: GPU: none for training
torch.cuda.is_available is displayed as true

De-synchronized frames after quantization and decoding

I have tried encoding and decoding a video using the reference software and it seems that, in the comparisons generated, original and quantized decoded frames are not synchronized. This happens when decoding the video 'bunny' using the provided weights as well. This is the comparison image for the first frame, named "pred_0000_13.83.png":

I have run the following command, which is the one reported in the README:

python train_nerv_all.py  --outf 1120  --data_path data/bunny --vid bunny      --conv_type convnext pshuffel --act gelu --norm none  --crop_list 640_1280      --resize_list -1 --loss L2  --enc_strds 5 4 4 2 2 --enc_dim 64_16     --dec_strds 5 4 4 2 2 --ks 0_1_5 --reduce 1.2      --modelsize 1.5  -e 300 --eval_freq 30  --lower_width 12 -b 2 --lr 0.001    --eval_only --weight checkpoints/hnerv-1.5m-e300.pth    --quant_model_bit 8 --quant_embed_bit 6     --dump_images --dump_videos

The GIF file is not synchronized as well. This problem does not seem to affect the unquantized predictions. What could the problem be? I have installed required dependencies using the provided file.

Hardware specifications:

GPU: Tesla K80
Driver Version: 470.141.03
CUDA Version: 11.4

About the internal generalization result.

Hello, thanks for the nice work.But I have some questions about several small details in the paper. Regarding the test results of internal generalization, at what size and epoch are all models trained? Can you give comparative details?

About the crop

Thanks for your great work.
It seems that you crop the image before inputting them into the network.
But the size of the output image is also the size that has been cropped. Is there a way to revert to the size of the original image?
The current code seems to be able to process only the video with 1:2 size ratio.

Does the code include compression function?

May I ask if can I use the code to test video compression and how? Thank you!

Raw or MP4 format?

Dear @haochen-rye,

I hope this letter finds you well. First and foremost, I would like to extend my appreciation for the outstanding work you have been doing. 🤗

I am writing to inquire about the UVG dataset and its testing procedures. Specifically, I am interested to know whether you conduct tests on raw YUV videos or MP4 videos. As far as my understanding goes, the UVG dataset is in raw format and compressed. However, I am aware that when reading video files, the matrices representing the channel data will be the same, regardless of the file format.

In my search for NeRV-based papers, I have been unable to locate any code files that demonstrate how to read raw video files. Could you kindly provide clarification on this matter or direct me to relevant resources?

Thank you for your time and attention to this matter. I genuinely appreciate any insights you can share. Wishing you a wonderful day and looking forward to hearing from you soon.

Sincerely,
Charles

The difference between HNeRV and AutoEncoder

Hi~ @haochen-rye
Thanks for sharing your nice work. After reading the paper, I find that the network structure and design scheme of HNeRV seems to be similar to an auto-encoder (AE). Although original AEs are mainly used for supervised/unsupervised learning, appling it to data fitting/compression is also a direct & valid idea. For classical NeRF (or NeRV from your another work), one could use a coordinate to query the corresponding pixel value or frame values. But for HNeRV, the input is actually the video/frame itself rather than the coordinate, which means one cannot query designed data from an explicit coordinate and instead he must has the image embedding from the encoder beforehand to query the image.

I think this should be the main difference between HNeRV & conventional NeRF, NeRV, and ENeRV. So have I misunderstood something? And what's your opinion about the difference.

BTW, I wonder how long it takes to train the NeRV & HNerV. I didn't find the absolute time in the paper. Thanks.

Pruning of Trained Model

Hello, thanks for the nice work.

One question, I believe the code and process of model pruning is not included in the current training and eval script, is this correct?

Thanks!