Git Product home page Git Product logo

hnerv's Introduction

HNeRV: A Hybrid Neural Representation for Videos (CVPR 2023)

Hao Chen, Matthew Gwilliam, Ser-Nam Lim, Abhinav Shrivastava
This is the official implementation of the paper "HNeRV: A Hybrid Neural Representation for Videos".

TODO

  • [ ✓ ] Video inpainting
  • [ ✓ ] Fast loading from video checkpoints
  • Upload results and checkpoints for UVG

Method overview

Get started

We run with Python 3.8, you can set up a conda environment with all dependencies like so:

pip install -r requirements.txt 

High-Level structure

The code is organized as follows:

  • train_nerv_all.py includes a generic traiing routine.
  • model_all.py contains the dataloader and neural network architecure
  • data/ directory video/imae dataset, we provide bunny frames here
  • checkpoints/ we provide model weights, and quantized video checkpoints for bunny here
  • log files (tensorboard, txt, state_dict etc.) will be saved in output directory (specified by --outf)
  • We provide numerical results for distortion-compression at uvg_results and per_video_results .

Reproducing experiments

Training HNeRV

HNeRV of 1.5M is specified with '--modelsize 1.5', and we balance parameters with '-ks 0_1_5 --reduce 1.2'

python train_nerv_all.py  --outf 1120  --data_path data/bunny --vid bunny   \
   --conv_type convnext pshuffel --act gelu --norm none  --crop_list 640_1280  \
    --resize_list -1 --loss L2  --enc_strds 5 4 4 2 2 --enc_dim 64_16 \
    --dec_strds 5 4 4 2 2 --ks 0_1_5 --reduce 1.2   \
    --modelsize 1.5  -e 300 --eval_freq 30  --lower_width 12 -b 2 --lr 0.001

NeRV baseline

NeRV baseline is specified with '--embed pe_1.25_80 --fc_hw 8_16', with imbalanced parameters '--ks 0_3_3 --reduce 2'

python train_nerv_all.py  --outf 1120  --data_path data/bunny --vid bunny   \
   --conv_type convnext pshuffel --act gelu --norm none  --crop_list 640_1280  \
   --resize_list -1 --loss L2   --embed pe_1.25_80 --fc_hw 8_16 \
    --dec_strds 5 4 2 2 --ks 0_3_3 --reduce 2   \
    --modelsize 1.5  -e 300 --eval_freq 30  --lower_width 12 -b 2 --lr 0.001

Evaluation & dump images and videos

To evaluate pre-trained model, use '--eval_only --weight [CKT_PATH]' to evaluate and specify model path.
For model and embedding quantization, use '--quant_model_bit 8 --quant_embed_bit 6'.
To dump images or videos, use '--dump_images --dump_videos'.

python train_nerv_all.py  --outf 1120  --data_path data/bunny --vid bunny   \
   --conv_type convnext pshuffel --act gelu --norm none  --crop_list 640_1280  \
    --resize_list -1 --loss L2  --enc_strds 5 4 4 2 2 --enc_dim 64_16 \
    --dec_strds 5 4 4 2 2 --ks 0_1_5 --reduce 1.2  \
    --modelsize 1.5  -e 300 --eval_freq 30  --lower_width 12 -b 2 --lr 0.001 \
   --eval_only --weight checkpoints/hnerv-1.5m-e300.pth \
   --quant_model_bit 8 --quant_embed_bit 6 \
    --dump_images --dump_videos

Video inpainting

We can specified inpainting task with '--vid bunny_inpaint_50' where '50' is the mask size.

python train_nerv_all.py  --outf 1120  --data_path data/bunny --vid bunny_inpaint_50   \
   --conv_type convnext pshuffel --act gelu --norm none  --crop_list 640_1280  \
    --resize_list -1 --loss L2  --enc_strds 5 4 4 2 2 --enc_dim 64_16 \
    --dec_strds 5 4 4 2 2 --ks 0_1_5 --reduce 1.2   \
    --modelsize 1.5  -e 300 --eval_freq 30  --lower_width 12 -b 2 --lr 0.001

Efficient video loading

We can load video efficiently from a tiny checkpoint.
Specify decoder and checkpoint by '--decoder [Decoder_path] --ckt [Video checkpoint]', output dir and frames by '--dump_dir [out_dir] --frames [frame_num]'.

python efficient_nvloader.py --frames 16

Citation

If you find our work useful in your research, please cite:

@InProceedings{chen2023hnerv,
      title={{HN}e{RV}: Neural Representations for Videos}, 
      author={Hao Chen and Matthew Gwilliam and Ser-Nam Lim and Abhinav Shrivastava},
      year={2023},
      booktitle={CVPR},
}

Contact

If you have any questions, please feel free to email the authors: [email protected]

hnerv's People

Contributors

haochen-rye avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

hnerv's Issues

About results of Video regression at resolution 480×960

I cannot reproduce your results about Video compression at resolution 480 × 960. Could you provide more details to help me reproduce the implementation of results about Video compression at resolution 480 × 960 on the UVG dataset.
I use the command python train_nerv_all.py --outf 1120 --data_path data/dance-twirl --vid dance_3M_resize --conv_type convnext pshuffel --act gelu --norm none --crop_list 960_1920 --resize_list 480_960 --loss L2 --enc_strds 5 4 3 2 2 --enc_dim 64_16 --dec_strds 5 4 3 2 2 --ks 0_1_5 --reduce 1.2 --modelsize 3 -e 300 --eval_freq 30 --lower_width 12 -b 2 --lr 0.001 --quant_model_bit -1 --dump_images

Architecture

Hello, I noticed that the HNeRV in your code seems to differ slightly from the Architecture in the paper.
In the paper, first use convnext for encoder, then use a learning based embed, and finally use decoder. As shown in the following figure.
Uploading 7ac48adaab7dceeab45206a4e618c8c.png…
But in the code, I found that you use optional positional encoding for embeddings, and then use convnext for encoder, followed by decoder, as shown in the following figure. Why is there a line of code img_embed=self. encoder (input)? What part of the code for learning based small embeddings?
image

License

Hello, is it on purpose that you did not specify the license of the project ?

About the ppp metric computation

Thanks for your great work in representing the video.

I notice that you use the pixel-for-pixel (PPP) to measure the compactness of the model; however, as a result, that I'm a freshman in this area, and I'm confused about how to compute the PPP. So would you mind explaining more about this metric, or directly updating the code to compute this metric? I would appreciate it if you could do these.

Thanks for your time again.

PSNR

Dear author @haochen-rye,

I would like to extend my sincere appreciation for the exceptional work you have accomplished. I am writing to seek clarification on the utilization of PSNR metrics within the model evaluation process. My understanding of this metric is that it is typically calculated using the formula 20 * torch.log10(255.0 / torch.sqrt(mse)). However, I noticed that in your code, the calculation for PSNR is represented as psnr = -10 * torch.log10(mse). Could you kindly provide an explanation for this discrepancy? Your insights would be greatly appreciated.

Thank you for your attention to this matter.

Yours sincerely,
Charles

Update: Because of the preprocessing step, the model has normalized the tensor to the range of 0 to 1, resulting in MAX=1. Therefore, the formula is simplified as above.

De-synchronized frames after quantization and decoding

I have tried encoding and decoding a video using the reference software and it seems that, in the comparisons generated, original and quantized decoded frames are not synchronized. This happens when decoding the video 'bunny' using the provided weights as well. This is the comparison image for the first frame, named "pred_0000_13.83.png":

pred_0000_13 83

I have run the following command, which is the one reported in the README:

python train_nerv_all.py  --outf 1120  --data_path data/bunny --vid bunny      --conv_type convnext pshuffel --act gelu --norm none  --crop_list 640_1280      --resize_list -1 --loss L2  --enc_strds 5 4 4 2 2 --enc_dim 64_16     --dec_strds 5 4 4 2 2 --ks 0_1_5 --reduce 1.2      --modelsize 1.5  -e 300 --eval_freq 30  --lower_width 12 -b 2 --lr 0.001    --eval_only --weight checkpoints/hnerv-1.5m-e300.pth    --quant_model_bit 8 --quant_embed_bit 6     --dump_images --dump_videos

The GIF file is not synchronized as well. This problem does not seem to affect the unquantized predictions. What could the problem be? I have installed required dependencies using the provided file.

Hardware specifications:

GPU: Tesla K80
Driver Version: 470.141.03
CUDA Version: 11.4

About the internal generalization result.

Hello, thanks for the nice work.But I have some questions about several small details in the paper. Regarding the test results of internal generalization, at what size and epoch are all models trained? Can you give comparative details?

About the crop

Thanks for your great work.
It seems that you crop the image before inputting them into the network.
But the size of the output image is also the size that has been cropped. Is there a way to revert to the size of the original image?
The current code seems to be able to process only the video with 1:2 size ratio.

Raw or MP4 format?

Dear @haochen-rye,

I hope this letter finds you well. First and foremost, I would like to extend my appreciation for the outstanding work you have been doing. 🤗

I am writing to inquire about the UVG dataset and its testing procedures. Specifically, I am interested to know whether you conduct tests on raw YUV videos or MP4 videos. As far as my understanding goes, the UVG dataset is in raw format and compressed. However, I am aware that when reading video files, the matrices representing the channel data will be the same, regardless of the file format.

In my search for NeRV-based papers, I have been unable to locate any code files that demonstrate how to read raw video files. Could you kindly provide clarification on this matter or direct me to relevant resources?

Thank you for your time and attention to this matter. I genuinely appreciate any insights you can share. Wishing you a wonderful day and looking forward to hearing from you soon.

Sincerely,
Charles

The difference between HNeRV and AutoEncoder

Hi~ @haochen-rye
Thanks for sharing your nice work. After reading the paper, I find that the network structure and design scheme of HNeRV seems to be similar to an auto-encoder (AE). Although original AEs are mainly used for supervised/unsupervised learning, appling it to data fitting/compression is also a direct & valid idea. For classical NeRF (or NeRV from your another work), one could use a coordinate to query the corresponding pixel value or frame values. But for HNeRV, the input is actually the video/frame itself rather than the coordinate, which means one cannot query designed data from an explicit coordinate and instead he must has the image embedding from the encoder beforehand to query the image.

I think this should be the main difference between HNeRV & conventional NeRF, NeRV, and ENeRV. So have I misunderstood something? And what's your opinion about the difference.

BTW, I wonder how long it takes to train the NeRV & HNerV. I didn't find the absolute time in the paper. Thanks.

Pruning of Trained Model

Hello, thanks for the nice work.

One question, I believe the code and process of model pruning is not included in the current training and eval script, is this correct?

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.