cesnet / gpujpeg Goto Github PK

JPEG encoder and decoder library and console application for NVIDIA GPUs from CESNET and SITOLA of Faculty of Informatics at Masaryk University.

License: BSD 2-Clause "Simplified" License

Makefile 0.97% Shell 1.06% M4 1.03% C 67.52% C++ 3.53% Cuda 24.16% Smarty 0.09% CMake 1.65%

gpujpeg's Introduction

GPUJPEG

JPEG encoder and decoder library and console application for NVIDIA GPUs for high-performance image encoding and decoding. The software runs also on AMD GPUs using ZLUDA (see ZLUDA.md).

This documents provides an introduction to the library and how to use it. You can also look to FAQ.md for additional information. To see latest changes you can display file NEWS.md.

Authors
Features
Overview
Performance
- Encoding
- Decoding
Quality
Compile
Usage
- libgpujpeg library
  - Encoding
  - Decoding
- GPUJPEG console application
Requirements
License
References

Authors

Martin Srom, CESNET z.s.p.o
Jan Brothánek
Petr Holub
Martin Jirman
Jiri Matela
Martin Pulec
Lukáš Ručka

Features

uses NVIDIA CUDA platform
baseline Huffman 8-bit coding
use of JFIF file format by default, Adobe and SPIFF is supported as well (used by encoder if JPEG internal color space is not representable by JFIF - eg. limited range YCbCr BT.709 or RGB)
use of restart markers that allow fast parallel encoding/decoding
Encoder by default creates non-interleaved stream, optionally it can produce an interleaved stream (all components in one scan) or/and subsampled stream.
support for color transformations and coding RGB JPEG
Decoder can decompress JPEG codestreams that can be generated by encoder. If scan contains restart flags, decoder can use parallelism for fast decoding.
command-line tool with support for encoding/decoding raw images as well as PNM/PAM or Y4M

Overview

Encoding/Decoding of JPEG codestream is divided into following phases:

 Encoding:                       Decoding
 1) Input data loading           1) Input data loading
 2) Preprocessing                2) Parsing codestream
 3) Forward DCT                  3) Huffman decoder
 4) Huffman encoder              4) Inverse DCT
 5) Formatting codestream        5) Postprocessing

and they are implemented on CPU or/and GPU as follows:

CPU:
- Input data loading
- Parsing codestream
- Huffman encoder/decoder (when restart flags are disabled)
- Output data formatting
GPU:
- Preprocessing/Postprocessing (color component parsing, color transformation RGB <-> YCbCr)
- Forward/Inverse DCT (discrete cosine transform)
- Huffman encoder/decoder (when restart flags are enabled)

Performance

Source 16K (DCI) image (8, 9) was cropped to 15360x8640+0+0 (1920x1080 multiplied by 8 in both dimensions) and for lower resolutions downscaled. Encoding was done with default values with input in RGB (quality 75, non-interleaved, rst 24-36, average from 99 measurements excluding first iteration) with following command:

gpujpegtool -v -e mediadivision_frame_<res>.pnm mediadivision_frame_<res>.jpg -n 100 [-q <Q>]

Encoding

GPU \ resolution	HD (2 Mpix)	4K (8 Mpix)	8K (33 Mpix)	16K (132 Mpix)
GTX 3080	0.54 ms	1.71 ms	6.20 ms	24.48 ms
GTX 2080 Ti	0.82 ms	2.89 ms	11.15 ms	46.23 ms
GTX 1060M	1.36 ms	4.55 ms	17.34 ms	(low mem)
GTX 580	2.38 ms	8.68 ms	(low mem)	(low mem)
AMD Radeon RX 7600 [ZLUDA]	0.88 ms	3.16 ms	13.09 ms	50.52 ms

Note: First iteration took 233 ms for 8K on GTX 3080 and scales proportionally with respect to resolution.

Further measurements were performed on GTX 3080 only:

quality	10	20	30	40	50	60	70	80	90	100
duration HD (ms)	0.48	0.49	0.50	0.51	0.51	0.53	0.54	0.57	0.60	0.82
duration 4K (ms)	1.61	1.65	1.66	1.67	1.69	1.68	1.70	1.72	1.79	2.44
duration 8K (ms)	6.02	6.04	6.09	6.14	6.12	6.17	6.21	6.24	6.47	8.56
duration 8K (ms, w/o PCIe xfers)	2.13	2.14	2.18	2.24	2.23	2.25	2.28	2.33	2.50	5.01

Decoding

Decoded images were those encoded in previous section, averaging has been done similarly by taking 99 samples excluding the first one. Command used:

gpujpegtool -v mediavision_frame_<res>.jpg output.pnm -n 100

GPU \ resolution	HD (2 Mpix)	4K (8 Mpix)	8K (33 Mpix)	16K (132 Mpix)
GTX 3080	0.75 ms	1.94 ms	6.76 ms	31.50 ms
GTX 2080 Ti	1.02 ms	1.07 ms	11.29 ms	44.42 ms
GTX 1060M	1.68 ms	4.81 ms	17.56 ms	(low mem)
GTX 580	2.61 ms	7.96 ms	(low mem)	(low mem)
AMD Radeon RX 7600 [ZLUDA]	1.00 ms	3.02 ms	11.25 ms	45.06 ms

Note: (low mem) above means that the card didn't have sufficient memory to encode or decode the picture.

Following measurements were performed on GTX 3080 only:

quality	10	20	30	40	50	60	70	80	90	100
duration HD (ms)	0.58	0.60	0.63	0.65	0.67	0.69	0.73	0.78	0.89	1.58
duration 4K (ms)	1.77	1.80	1.83	1.84	1.87	1.89	1.92	1.95	2.11	3.69
duration 8K (ms)	6.85	6.88	6.90	6.92	6.98	6.70	6.74	6.84	7.17	12.43
duration 8K (ms, w/o PCIe xfers)	2.14	2.18	2.21	2.24	2.27	2.29	2.34	2.42	2.71	7.27

Quality

Following tables summarizes encoding quality and file size using NVIDIA GTX 580 for non-interleaved and non-subsampled stream with different quality settings (PSNR and encoded size values are averages of encoding several images, each of them multiple times):

quality	PSNR 4K¹	size 4K	PSNR HD²	size HD
10	29.33 dB	539.30 kB	27.41 dB	145.90 kB
20	32.70 dB	697.20 kB	30.32 dB	198.30 kB
30	34.63 dB	850.60 kB	31.92 dB	243.60 kB
40	35.97 dB	958.90 kB	32.99 dB	282.20 kB
50	36.94 dB	1073.30 kB	33.82 dB	319.10 kB
60	37.96 dB	1217.10 kB	34.65 dB	360.00 kB
70	39.22 dB	1399.20 kB	35.71 dB	422.10 kB
80	40.67 dB	1710.00 kB	37.15 dB	526.70 kB
90	42.83 dB	2441.40 kB	39.84 dB	768.40 kB
100	47.09 dB	7798.70 kB	47.21 dB	2499.60 kB

^1,2 sizes 4096x2160 and 1920x1080

Compile

To build console application check Requirements and go to gpujpeg directory (where README.md and COPYING files are placed) and run cmake command:

cmake -DCMAKE_BUILD_TYPE=Release -Bbuild .
cmake --build build

You can also use autotools to create a build recipe for the library and the application or a plain old Makefile.bkp. However, cmake is recommended.

Usage

libgpujpeg library

To build libgpujpeg library check Compile.

To use library in your project you have to include library to your sources and linked shared library object to your executable:

#include <libgpujpeg/gpujpeg.h>

For simple library usage examples you look into subdirectory examples.

Encoding

For encoding by libgpujpeg library you have to declare two structures and set proper values to them. The first is definition of encoding/decoding parameters, and the second is structure with parameters of input image:

struct gpujpeg_parameters param;
gpujpeg_set_default_parameters(&param);
param.quality = 80;
// (default value is 75)
param.restart_interval = 16;
// (default value is 8)
param.interleaved = 1;
// (default value is 0)

struct gpujpeg_image_parameters param_image;
gpujpeg_image_set_default_parameters(&param_image);
param_image.width = 1920;
param_image.height = 1080;
param_image.color_space = GPUJPEG_RGB; // input colorspace (GPUJPEG_RGB
                                       // default), can be also
                                       // eg. GPUJPEG_YCBCR_JPEG
param_image.pixel_format = GPUJPEG_444_U8_P012;
// or eg. GPUJPEG_U8 for grayscale
// (default value is GPUJPEG_444_U8_P012)

If you want to use subsampling in JPEG format call following function, that will set default sampling factors (2x2 for Y, 1x1 for Cb and Cr):

// Use 4:2:0 subsampling
gpujpeg_parameters_chroma_subsampling(&param, GPUJPEG_SUBSAMPLING_420);

Or define sampling factors by hand:

// User custom sampling factors
gpujpeg_parameters_chroma_subsampling(&param, MK_SUBSAMPLING(4, 4, 1, 2, 2, 1, 0, 0));

Next you can initialize CUDA device by calling (if not called, default CUDA device will be used):

if ( gpujpeg_init_device(device_id, 0) )
    return -1;

where first parameters is CUDA device (e.g. device_id = 0) id and second parameter is flag if verbose output should be used (0 or GPUJPEG_VERBOSE). Next step is to create encoder:

struct gpujpeg_encoder* encoder = gpujpeg_encoder_create(0);
if ( encoder == NULL )
    return -1;

When creating encoder, library allocates all device buffers which will be needed for image encoding and when you encode concrete image, they are already allocated and encoder will used them for every image. Now we need raw image data that we can encode by encoder, for example we can load it from file:

size_t image_size = 0;
uint8_t* input_image = NULL;
if ( gpujpeg_image_load_from_file("input_image.rgb", &input_image,
         &image_size) != 0 )
    return -1;

Next step is to encode uncompressed image data to JPEG compressed data by encoder:

struct gpujpeg_encoder_input encoder_input;
gpujpeg_encoder_input_set_image(&encoder_input, input_image);

uint8_t* image_compressed = NULL;
int image_compressed_size = 0;
if ( gpujpeg_encoder_encode(encoder, &encoder_input, &image_compressed,
         &image_compressed_size) != 0 )
    return -1;

Compressed data are placed in internal encoder buffer so we have to save them somewhere else before we start encoding next image, for example we can save them to file:

if ( gpujpeg_image_save_to_file("output_image.jpg", image_compressed,
         image_compressed_size, NULL) != 0 )
    return -1;

Now we can load, encode and save next image or finish and move to clean up encoder. Finally we have to clean up so destroy loaded image and destroy the encoder.

gpujpeg_image_destroy(input_image);
gpujpeg_encoder_destroy(encoder);

Decoding

For decoding we don't need to initialize two structures of parameters. We only have to initialize CUDA device if we haven't initialized it yet and create decoder:

if ( gpujpeg_init_device(device_id, 0) )
    return -1;

struct gpujpeg_decoder* decoder = gpujpeg_decoder_create(0);
if ( decoder == NULL )
    return -1;

Now we have two options. The first is to do nothing and decoder will postpone buffer allocations to decoding first image where it determines proper image size and all other parameters (recommended). The second option is to provide input image size and other parameters (reset interval, interleaving) and the decoder will allocate all buffers and it is fully ready when encoding even the first image:

// you can skip this code below and let the decoder initialize automatically
struct gpujpeg_parameters param;
gpujpeg_set_default_parameters(&param);
param.restart_interval = 16;
param.interleaved = 1;

struct gpujpeg_image_parameters param_image;
gpujpeg_image_set_default_parameters(&param_image);
param_image.width = 1920;
param_image.height = 1080;
param_image.color_space = GPUJPEG_RGB;
param_image.pixel_format = GPUJPEG_444_U8_P012;

// Pre initialize decoder before decoding
gpujpeg_decoder_init(decoder, &param, &param_image);

If you didn't initialize the decoder by gpujpeg_decoder_init but want to specify output image color space and subsampling factor, you can use following code:

gpujpeg_decoder_set_output_format(decoder, GPUJPEG_RGB,
                GPUJPEG_444_U8_P012);
// or eg. GPUJPEG_YCBCR_JPEG and GPUJPEG_422_U8_P1020

If not called, RGB or grayscale is output depending on JPEG channel count.

Next we have to load JPEG image data from file and decoded it to raw image data:

size_t image_size = 0;
uint8_t* image = NULL;
if ( gpujpeg_image_load_from_file("input_image.jpg", &image,
         &image_size) != 0 )
    return -1;

struct gpujpeg_decoder_output decoder_output;
gpujpeg_decoder_output_set_default(&decoder_output);
if ( gpujpeg_decoder_decode(decoder, image, image_size,
         &decoder_output) != 0 )
    return -1;

Now we can save decoded raw image data to file and perform cleanup:

if ( gpujpeg_image_save_to_file("output_image.pnm", decoder_output.data,
         decoder_output.data_size, &decoder_output.param_image) != 0 )
    return -1;

gpujpeg_image_destroy(image);
gpujpeg_decoder_destroy(decoder);

GPUJPEG console application

The console application gpujpeg uses libgpujpeg library to demonstrate it's functions. To build console application check Compile.

To encode image from raw RGB image file to JPEG image file use following command:

gpujpegtool --encode --size=WIDTHxHEIGHT --quality=QUALITY \
        INPUT_IMAGE.rgb OUTPUT_IMAGE.jpg

You must specify input image size by --size=WIDTHxHEIGHT parameter. Optionally you can specify desired output quality by parameter --quality=QUALITY which accepts values 0-100. Console application accepts a few more parameters and you can list them by folling command:

gpujpegtool --help

To decode image from JPEG image file to raw RGB image file use following command:

gpujpegtool --decode OUTPUT_IMAGE.jpg INPUT_IMAGE.rgb

You can also encode and decode image to test the console application:

gpujpegtool --encode --decode --size=WIDTHxHEIGHT --quality=QUALITY \
        INPUT_IMAGE.rgb OUTPUT_IMAGE.jpg

Decoder will create new decoded file OUTPUT_IMAGE.jpg.decoded.rgb and do not overwrite your INPUT_IMAGE.rgb file.

Console application is able to load raw RGB image file data from *.rgb files and raw YUV and YUV422 data from *.yuv files. For YUV422 you must specify *.yuv file and use --sampling-factor=4:2:2 parameter.

All supported parameters for console application are following:

--help
    Prints console application help
--size=1920x1080
    Input image size in pixels, e.g. 1920x1080
--pixel-format=444-u8-p012
    Input/output image pixel format ('u8', '444-u8-p012', '444-u8-p012z',
    '444-u8-p0p1p2', '422-u8-p1020', '422-u8-p0p1p2' or '420-u8-p0p1p2')
--colorspace=rgb
    Input image colorspace (supported are 'rgb', 'yuv' and 'ycbcr-jpeg',
    where 'yuv' means YCbCr ITU-R BT.601), when *.yuv file is specified,
    instead of default 'rgb', automatically the colorspace 'yuv' is used
--quality
    Set output quality level 0-100 (default 75)
--restart=8
    Set restart interval for encoder, number of MCUs between
    restart markers
--subsampled
    Produce chroma subsampled JPEG stream
--interleaved
    Produce interleaved stream
--encode
    Encode images
--decode
    Decode images
--device=0
    By using this parameter you can specify CUDA device id which will
    be used for encoding/decoding.

Restart interval is important for parallel huffman encoding and decoding. When --restart=N is used (default is 8), the coder can process each N MCUs independently, and so he can code each N MCUs in parallel. When --restart=0 is specified, restart interval is disabled and the coder must use CPU version of huffman coder (because on GPU would run only one thread, which is very slow).

The console application can encode/decode multiple images by following command:

gpujpegtool ARGUMENTS INPUT_IMAGE_1.rgb OUTPUT_IMAGE_1.jpg \
        INPUT_IMAGE_2.rgb OUTPUT_IMAGE_2.jpg ...

Requirements

To be able to build and run libgpujpeg library and gpujpeg console application you need:

NVIDIA CUDA Toolkit
C/C++ compiler + CMake
CUDA enabled NVIDIA GPU (cc >= 2.0; older may or may not work) with NVIDIA drivers or AMD with ZLUDA (see ZLUDA.md)
optional OpenGL support:
- GLEW, OpenGL (usually present in Windows, may need headers installation in Linux)
- GLFW or GLX (Linux only) for context creation
- GLUT for OpenGL tests

License

See file COPYING.
This software contains source code provided by NVIDIA Corporation.
This software source code is based on SiGenGPU [3].

References

ITU-T Rec T.81
ILG
SiGenGPU (currently defunct)
ECMA TR/098 (JFIF)
ITU-T Rec T.84 (SPIFF)
SPIFF File Format Summary (FileFormat.Info)
Component ID registration

gpujpeg's People

Contributors

Stargazers

Watchers

gpujpeg's Issues

Encoded Image Not Displaying Correctly on MacOs

Hello,

I have been using this library for a while now and everything seems to be working correctly, except for one use case. Images encoded with GpuJpeg do not seem to be displaying correctly on MacOs in the Apple "Preview" application. The images are encoded on the windows platform and then copied to the mac. We have tried multiple formats and settings and do not see any change in behavior. We are seeing the issue with our implementation as well as with your sample.

The Mac we are testing on is El Capitan, with Preview application 8.1 (877.7). Please note that the thumbnails on Mac display correctly, the issue only presents itself when the jpg is opened by the Preview application. We have not seen the issue with any other application on Mac or Windows.

I have attached a screenshot of what the image looks like in the Preview application as well as the encoded jpg. The input image used was camera_bt709_422.yuv.

Any guidance you can provide on this issue would be appreciated.

Thanks,
Mike

Input sampling factor 4:2:0

I'm working with YUV 4:2:0 frames that I want to compress to JPEG.
The supported factors for param_image.sampling_factor are only GPUJPEG_4_4_4 and GPUJPEG_4_2_2.
How is it difficult to add support for 4:2:0 sampling?
The 1st and easy step I did was to modify the int gpujpeg_image_calculate_size(struct gpujpeg_image_parameters* param) function. Then, the things start to be a little bit more complicated. So before I go on with this, I wanted to ask if this is something feasible to do or maybe not so much?
If yes, could you please provide some guidelines? Thanks a lot!
Alex Kh.

Merge decoder-gltex branch into master?

Is there a reason that decoder-gltex branch is maintained as a separate branch? The use of streams for encoding/decoding helps immensely when GPUJPEG is not the exclusive user of the GPU.

Can not get the Decoding time to under 278ms

Hi, thanks for the interesting project.

I was comparing my results with the benchmark in the README and I have a hard time getting to less than ~278ms decode time. I am running my code on a 2070 super GPU. I was wondering if I am doing anything wrong or the numbers in the benchmark are referring to the in-GPU processing?

Decode Image GPU: 23.01 ms (only in-GPU processing)
Decode Image: 278.78 ms

Thanks,
Sina

Potential leak ? Not Sure ...

Hi,

I am running gpujpeg to extract 4k frame from a video, but it seems I have a leak memory, here the code I am currently using in order to load an .rgb file and save it into .jpg, I just attach the code I am using, initially I use a change inside the gpujpeg in order to load the images from a uint8_t* but now I am using the non modify gpujpeg and I have the same behavior.
Maybe I miss something. Thanks for your feedback

void save(int device_id, int quality, const char* output, const char* input, int w, int h)
{
    int flags = 0;

    struct gpujpeg_parameters param;
    struct gpujpeg_image_parameters param_image;
    struct gpujpeg_encoder* encoder;

    gpujpeg_set_default_parameters(&param);
    gpujpeg_image_set_default_parameters(&param_image);

    param.quality = quality;
    param.restart_interval = 16;
    param.interleaved = 0;
    param_image.color_space = GPUJPEG_NONE;

    memset(&param_image, 0, sizeof param_image);
    param_image.width = w;
    param_image.height = h;
    param_image.comp_count = 3;
    param_image.color_space = GPUJPEG_RGB;
    param_image.pixel_format = GPUJPEG_444_U8_P012;

    if (gpujpeg_init_device(device_id, flags) != 0)
        printf("[ERROR] Failed init gpugpeg device\n");

    encoder = gpujpeg_encoder_create(NULL);
    if (encoder == nullptr) {
        printf("[ERROR] Failed to create encoder!\n");
    }

    int image_size = gpujpeg_image_calculate_size(&param_image);
    uint8_t* image = NULL;
    if ( gpujpeg_image_load_from_file(input, &image, &image_size) != 0 ) {
        printf("[ERROR] Failed to load image [%s]!\n", input);
    }

    //if (gpujpeg_image_load_from_data(ptr, size) != 0) {
    //    printf("[ERROR] Failed to load data!\n");
    //}

    struct gpujpeg_encoder_input encoder_input;

    gpujpeg_encoder_input_set_image(&encoder_input, image);

    uint8_t* image_compressed = NULL;
    int image_compressed_size = 0;

    if (gpujpeg_encoder_encode(encoder, &param, &param_image, &encoder_input, &image_compressed, &image_compressed_size) != 0) {
        printf("[ERROR] Failed to encode image!\n");
    }

    if (gpujpeg_image_save_to_file(output, image_compressed, image_compressed_size) != 0) {
        printf("[ERROR] Failed to save image [%s]!\n", output);
    }

    gpujpeg_encoder_destroy(encoder);
}

BTW the code I added in order to load from a pointer and not from a file was really small:

int
gpujpeg_image_load_from_data(uint8_t* image, int image_size)
{
    cudaMallocHost((void**)&image, image_size * sizeof(uint8_t));
    gpujpeg_cuda_check_error("Initialize CUDA host buffer", return -1);
    return 0;
}

Encoding using multiple GPUs

Hi and thanks for this awesome project.

Is there any way to do encoding on two GPUs simultaneously? I have a use case where I need to compress data to jpeg as quickly as possible, and using just a single GPU has turned out to be too slow (at least in my preliminary tests). I'm feeding images to GPUJPEG through a queue, so I would ideally like to have two threads, each with its own GPUJPEG encoder, grabbing data from the queue, then compressing and writing the image to disk.

example command output error

I used the following command but the output is not as excepted, please help:
(the test.jpg is already put under the dir)

GPUJPEG-master$ ./gpujpeg --decode test.jpg test.rgb
gpujpeg [options] input.rgb output.jpg [input2.rgb output2.jpg ...]
   -h, --help             print help
   -v, --verbose          verbose output
   -D, --device           set cuda device id (default 0)
       --device-list      list cuda devices

   -s, --size             set input image size in pixels, e.g. 1920x1080
   -f, --pixel-format     set input/output image pixel format, one of the
                          following:
                          u8               422-u8-p1020
                          444-u8-p012      422-u8-p0p1p2
                          444-u8-p012z     420-u8-p0p1p2
                          444-u8-p0p1p2

   -c, --colorspace       set input/output image colorspace, e.g. rgb,
                          ycbcr, ycbcr-jpeg, ycbcr-bt601, ycbcr-bt709

   -q, --quality          set JPEG encoder quality level 0-100 (default 75)
   -r, --restart          set JPEG encoder restart interval (default 8)
       --subsampled       set JPEG encoder to use chroma subsampling
   -i  --interleaved      set JPEG encoder to use interleaved stream
   -g  --segment-info     set JPEG encoder to use segment info in stream
                          for fast decoding

   -e, --encode           perform JPEG encoding
   -d, --decode           perform JPEG decoding
       --convert          convert input image to output image (change
                          color space and/or sampling factor)
       --component-range  show samples range for each component in image

   -n  --iterate          perform encoding/decoding in specified number of
                          iterations for each image
   -o  --use-opengl       use an OpenGL texture as input/output
   -I  --info             print JPEG file info
   -R  --rgb              create RGB JPEG

SOF0 marker component 0 id should be 82 but 1 was presented

I am facing an issue in my code, but I have a difficulty to understand what trigger it.

I am just trying to uncompress a JPG, I have various information regarding the texture :

[INFO] Size : 7077888
[INFO] Compressed : 1 - Size : 1755787 <- std::vector<uint8_t>
[INFO] Width : 1536
[INFO] Height : 1536
[INFO] Channels : 3
[INFO] Format : RGB8

But for some reason I received this error:
[GPUJPEG] [Error] SOF0 marker component 0 id should be 82 but 1 was presented!
[GPUJPEG] [Error] Decoder failed when decoding image data!

Can you told me when this message can be trigger ?

Thanks

Great library but I would like to add a method.

tell me how can I use your library to convert pixels from yuv 420 to rgb to gpu?

im programming Golang and can use so lib and h file

I have big problems cover yuv 420 to rgb

im using c func

image load_image_from_memory_stb_v6(char *puc_y, char *puc_u, char *puc_v, int width_y, int height_y)
{
    image im = make_image(width_y, height_y, 3);
    int R,G,B,Y,U,V;
    int nWidth = width_y >> 1;
    int y,x;
    int pix = 0;
    for(y=0; y < height_y; y++)
    {
        for(x=0; x < width_y; x++)
        {
            Y = (uint8_t)*(puc_y + y*width_y + x);
            U = (uint8_t)*(puc_u + (y >> 1)*nWidth + (x>>1));
            V = (uint8_t)*(puc_v + (y >> 1)*nWidth + (x>>1));
    
            R = Y + 1.402*(V-128);
            G = Y - 0.34414*(U-128) - 0.71414*(V-128);
            B = Y + 1.772*(U-128);
            if(R > 255) R = 255;
            if(R < 0) R = 0;
            if(G > 255) G = 255;
            if(G < 0) G = 0;
            if(B > 255) B = 255;
            if(B < 0) B = 0;
            if (width_y*width_y*3 < width_y*height_y*2+pix) {
                printf("Cannot load image from memory empty buff %u < %u \n", width_y*width_y*3 , width_y*height_y*2+pix);
                exit(0);
            }                                                                                                                    
            im.data[pix] = ((float)R)/255.;   //R                                                                                
            im.data[width_y*height_y+pix] = ((float)G)/255.;   //G                                                               
            im.data[width_y*height_y*2+pix] = ((float)B)/255.;   //B                                                             
            pix++;
        }
    }
    return im;
}

it really slow 30-40 ms per frame 1080p

I see that you have this implementation on gpu
but I can not understand how to use it with my data, you could not help me with this.
Thank you for your work.

Multiple GPU Issue

Hi @MartinPulec,

I am facing an issue with multiple gpu, I want to have two process using GPUJPEG each of them using one of my GPU.

Here my current setup:

OSX 10.13.6
There are 2 devices supporting CUDA:

Device #0: "GeForce GTX 1080 Ti"
  Compute capability: 6.1
  Total amount of global memory: 11534144 kB
  Total amount of constant memory: 64 kB
  Total amount of shared memory per block: 48 kB
  Total number of registers available per block: 65536
  Multiprocessors: 28

Device #1: "GeForce GTX 1080"
  Compute capability: 6.1
  Total amount of global memory: 8388416 kB
  Total amount of constant memory: 64 kB
  Total amount of shared memory per block: 48 kB
  Total number of registers available per block: 65536
  Multiprocessors: 20
CUDA driver version:   10.1
CUDA runtime version:  10.1
Using Device #0:       GeForce GTX 1080 Ti (c.c. 6.1)

As you can see your function gpujpeg_print_devices_info is working but if I call gpujpeg_init_device(0) and gpujpeg_init_device(1) into different process it's failed with this error

[GPUJPEG] [Error]  src/gpujpeg_common.c (line 98): Cannot get number of CUDA devices: no CUDA-capable device is detected.

Also if I use only one process with GPU ID 1, it's seems to be running on 0.

Thanks

Tony

Compiling Error on Ubuntu 16.04

Under Ubuntu 16.04, I get this error when running the make command.

/usr/include/string.h: In function ‘void* __mempcpy_inline(void*, const void*, size_t)’:
/usr/include/string.h:652:42: error: ‘memcpy’ was not declared in this scope
   return (char *) memcpy (__dest, __src, __n) + __n;
                                          ^
Makefile:1356: recipe for target 'src/gpujpeg_huffman_gpu_encoder.cu.o' failed
make[2]: *** [src/gpujpeg_huffman_gpu_encoder.cu.o] Error 1
make[2]: Leaving directory '/home/andrew/git_code/GPUJPEG'
Makefile:887: recipe for target 'all-recursive' failed
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory '/home/andrew/git_code/GPUJPEG'
Makefile:491: recipe for target 'all' failed
make: *** [all] Error 2

gpujpeg_common.c (line 878): Coder component copy: invalid argument

Hi, I'm trying to compress image getting from the camera using opencv. The camera is realsense, so I have to use realsense api, though it is not important. The data is RGB.

The code look a little like this

// encoder
    cudaStream_t stream;
    GPUJPEG_CHECK(cudaStreamCreate(&stream), return -1);
    struct gpujpeg_parameters param;
    gpujpeg_set_default_parameters(&param);
    param.quality = 80;
    param.restart_interval = 16;
    param.interleaved = 1;

    struct gpujpeg_image_parameters param_image;
    gpujpeg_image_set_default_parameters(&param_image);
    param_image.width = 1280;
    param_image.height = 720;
    param_image.comp_count = 3;
    param_image.color_space = GPUJPEG_RGB;
    param_image.pixel_format = GPUJPEG_444_U8_P012;

    gpujpeg_parameters_chroma_subsampling_420(&param);

//    init device
    if ( gpujpeg_init_device(0, 0) )
        return -1;

//    init encoder
    struct gpujpeg_encoder* encoder = gpujpeg_encoder_create(&stream);
    if ( encoder == NULL )
        return -1;

    int image_size = gpujpeg_image_calculate_size(&param_image);
    struct gpujpeg_encoder_input encoder_input;

    while (waitKey(1) < 0 && getWindowProperty(window_name, WND_PROP_AUTOSIZE) >= 0)
    {
        rs2::frameset data = pipe.wait_for_frames(); // Wait for next set of frames from the camera
        rs2::frame depth = data.get_depth_frame().apply_filter(color_map);
        rs2::frame color = data.get_color_frame();

        // Query frame size (width and height)
        const int w = color.as<rs2::video_frame>().get_width();
        const int h = color.as<rs2::video_frame>().get_height();

        // Create OpenCV matrix of size (w,h) from the colorized depth data
        Mat image(Size(w, h), CV_8UC3, (void*)color.get_data(), Mat::AUTO_STEP);
        uint8_t* image_data = NULL;
        image_data = reinterpret_cast<uint8_t*>(const_cast<void *>(color.get_data()));
        gpujpeg_encoder_input_set_image(&encoder_input, image_data);
        uint8_t* image_compressed = NULL;
        int image_compressed_size = 0;

        if ( gpujpeg_encoder_encode(encoder, &param, &param_image, &encoder_input, &image_compressed, &image_compressed_size) != 0 )
            return -1;


        cv::cvtColor(image, image, CV_BGR2RGB);

        // Update the window with new data
        imshow(window_name, image);
        gpujpeg_image_destroy(image_data);
        gpujpeg_image_destroy(image_compressed);
    }
    gpujpeg_encoder_destroy(encoder);
    return EXIT_SUCCESS;

The full code

I am able to show the video using opencv, but the compress code is not working. I think maybe there are some error from my config. I have try many option, but it's not working. Can you help me get the encoder running? If you need me run anything to help you reproduce, please feel free to require.

I have some other questions regarding your library:

I see that you suggest using gpujpeg_encoder_input_set_image over cudaMallocHost in PR
If the data is from cpu buffer, how can gpu use the data to do the encoding?
What does the param.restart_interval means?
Is there any document to understand your code?
I have a picture I have used the command gpujpeg --encode --size=1920x1080 --quality=80 2k_wild.ppm output_test.jpg and it produce weird image. It's color channel look like being convert from RGB to BGR. The left padding look like got a piece of picture from the right side of the image.

gpujpeg --decode output_test.jpg raw.rgb produce a file that I cannot open. I hope you can help me explain how to fix it.

How to reduce encoded jpeg size

Hi，

i found that by using the same size raw file, GPUJPEG can produce a two times larger file than ffmpeg at the same quality option.

my input command is
--encode --verbose --pixel-format=420-u8-p0p1p2 --quality=100 --size=1440x1080 avatar.r avatar.jpg

GPUJPEG vs NVJPEG

Hello @MartinPulec How are you ?

I am going back to play with JPG over GPU and I wanted to compare performances between different options.

I generate a random texture and running some text for various resolution from 512x512 to 8192x8192. The decoding for such image is faster with GPUJPEG :
(The time include some IO not just GPU but for both and the time is after a first run for warmup)

 GPU JPEG TOTAL TIME '8192' : 911.242 [ms]
 NV JPEG TOTAL TIME '8192' : 1608.21 [ms]

But when using a portrait image so typically a face with uniform background my result change to the opposite :

[WARNING] GPU JPEG TOTAL TIME '8192' : 620.884 [ms]
[WARNING] NV JPEG TOTAL TIME '8192' : 441.444 [ms]

Different number of segment using gpujpeg_encoder_input_set_gpu_image

HI @MartinPulec

I am currently working with torch tensor (cuda) and I use it with GPUJPEG in order to save the JPEG.

Your API it's seems to be easy to use just calling gpujpeg_encoder_input_set_gpu_image with the gpu memory pointer.

It's worked I think, at least the jpeg look correct but surprisingly the number of segment change:

If I take the tensor bring back to CPU and save it with gpujpeg my number of segment is 10800, if I used GPU tensor directly it's 2700

Also I can see a difference on the size of the file:
CPU Version : 421 KB
GPU Version : 392 KB

Any idea why this difference, I was suspecting a trouble in my param.quality but It seems to be the same.

Thanks

Tony

Error: CUDA driver version is insufficient for CUDA runtime version

Hi Marin,

I'm trying to use GPUJPEG on my MacBook pro.
I made it a static lib which linked with CUDA libraries statically.
Once I try to run the code I get this error:

[GPUJPEG] [Error] /gpujpeg/src/gpujpeg_common.cpp (line 122): Cannot get number of CUDA devices!: CUDA driver version is insufficient for CUDA runtime version.

any idea?

My Configuration is:

MacBook Pro (Retina, 15-inch, Mid 2014)
MacOS v10.11.6
Processor: 2.5GHz Intel Core i7
Memory: 16GB
Graphics: NVIDIA GeForce GT 750M 2048MB (PCIe)
Intel Iris Pro 1536MB (Built-In)
CUDA Driver v8.0.63
NVIDIA GPU Driver v10.10.14 310.42.25f02

Floating point exception (core dumped)

When building recent master branch from the source under Linux, I was able to convert the small 16x16 image from .jpg into rgb but cannot convert the produced image (created by gpujpeg itself) from rgb back into jpg. The error messages are:

audrius@leo:~/catkin_ws/src/gpujpeg/GPUJPEG$ ./gpujpeg  sv.rgb sv.jpg -e
CUDA driver version:   9.1
CUDA runtime version:  9.0
Using Device #0:       GeForce GTX 1080 Ti (c.c. 6.1)

Encoding Image [sv.rgb]
Load Image:                0.43 ms
Floating point exception (core dumped)

gdb provides the following information

[New LWP 29663]
[New LWP 29678]
[New LWP 29677]
[New LWP 29679]
Core was generated by `/home/audrius/catkin_ws/src/gpujpeg/GPUJPEG/.libs/lt-gpujpeg sv.rgb sv.jpg -e'.
Program terminated with signal SIGFPE, Arithmetic exception.
#0  0x00007f19896d1dbb in ?? ()
[Current thread is 1 (LWP 29663)]

JPEG data unexpected ended while reading SOS marker!

Hi Martin, It's me again.

I am using python's ctypes library to import function into python, and I have this error "JPEG data unexpected ended while reading SOS marker!".

My goal is to do the encode and decode operation in c++ while passing the data using python. The data format in Python that I choose to save the image data from c++ is numpy array in python.

I think the encode operation is ok for now because I can see the saved jpg file. However, when passing data from numpy array to the decoder, I encounter this error.

My file
Python
C++ File

The command to run the file:

nvcc -Xcompiler -fPIC -shared -o libdecode.so decode.cpp -lgpujpeg `pkg-config --cflags --libs opencv`
python test.py

The test image

I have another question about the performance.
In this issue, the author use jetson nano to encode and decode image that have size of 2MB with performance around 6ms.
I am using jetson nano to run the code to encode and decode 1920x1080 image (raw size 6MB) but get performance around 40-60ms, which is slower than doing the decode in CPU. I have tweak the param.restart_interval with 8 and 16, but still, the speed not improved.
The question is: Is that speed correct for jetson nano platform with my current config?
Thanks.

Speed degradation

Thank you for the hard work developing and maintaining the project.

I used the library several years ago, and decided to upgrade to current master yesterday.
To my surprise I found that encoding has become ~30% slower compared to the previous version.
(I use the same 1080Ti board and can test previous and new versions side by side).

After quick check of the source code, I see that quite a lot of of initializations/allocations are now included in gpujpeg_encoder_encode.

For example, one of the heaviest functions - gpujpeg_coder_init_image. Huge part of this function executes on each call to gpujpeg_encoder_encode, even if all encoder's buffers were properly pre-allocated.

Moreover, if user wants to pre-allocate the internals of the encoder (to avoid re-allocations in gpujpeg_encoder_encode) by using the function gpujpeg_encoder_allocate - it doesn't improve anything.

Just because function gpujpeg_encoder_allocate completely ignores the input image size supplied by user. Instead it just tries to guess the width and height from some abstract number of pixels:

    // Set image size from pixels
    struct gpujpeg_image_parameters tmp_param_image;
    tmp_param_image = *param_image;
    tmp_param_image.width = (int) sqrt((float) pixels);
    tmp_param_image.height = (pixels + tmp_param_image.width - 1) / tmp_param_image.width;

Naturally, after this kind of "pre-allocation" for incorrect image size, internal buffers of encoder must be re-allocated properly once gpujpeg_encoder_encode is called.

So basically, now there is no way to make all the pre-allocations beforehand and then just supply the input image to encoder for (fast) compression, as it was before.

This is (huge) speed regression from the lightweight architecture of older version of GPUJPEG.
Please consider giving user the possibility to pre-allocate everything once (e.g. on encoder creation).

By simply changing the code of gpujpeg_encoder_allocate (to use the user's image size) and by removing the gpujpeg_coder_init_image from gpujpeg_encoder_encode - I have got immediate speed boost of ~15-20%.
Still, slower than previous version(s) - but this clearly demonstrates that current approach leads to speed degradation.

Please make the gpujpeg_encoder_encode as lightweight as possible with all the pre-allocations on the time when encoder is created (as it was before).

[GPUJPEG] [Error] src/gpujpeg_common.c (line 361): Device info getting: invalid argument.

I used your library to compress bitmap24 images in a real-time application, so it is possible to send 4k images at 30 fps over TCP to a server. For this, I wrote a JPEG encoder function which uses your API by default. Before image compression, a processing algorithm is executed on the GPU, which is implemented in CUDA. Therefore the GPU is already initialized. When I call the function gpujpeg_init_device(device_id, 0) I get the error message: GPU device is busy.

So I removed this function call and the image compression now works fine. With the exception that now the error message described in the title appears. Since the images are compressed correctly, I would like to suppress this error message.

Is it necessary to change the source code of your library on my device? Or can the error message be prevented in some other way?

The source code of the encoder function is shown below:
`
void CudaBase::encodeBmpToJpeg(unsigned char* idata, unsigned char* odata, int* p_jpeg_size, int width, int height)
{
static int image_num = -1;
char image_dir[256];
cudaStream_t stream;
uint8_t* image = NULL;

CUDA_CHECK(cudaStreamCreate(&stream));
image_num++;
struct gpujpeg_encoder_input encoder_input;
struct gpujpeg_parameters param;
struct gpujpeg_image_parameters param_image;
struct gpujpeg_encoder* encoder = gpujpeg_encoder_create(&stream);

sprintf(image_dir, "/home/niclas/SoftwareProjekte/Cuda/PerformanceComparsion/results/img/streaming/jpeg/CH%d_%d.jpg", image_num%4, image_num/4);
gpujpeg_set_default_parameters(&param);
param.quality = 80;
param.restart_interval = 16;
param.interleaved = 1;

gpujpeg_image_set_default_parameters(&param_image);
param_image.width = width;
param_image.height = height;
param_image.comp_count = 3;
param_image.color_space = GPUJPEG_RGB;
param_image.pixel_format = GPUJPEG_444_U8_P012;

// Use default sampling factors
gpujpeg_parameters_chroma_subsampling_422(&param);
if ( encoder == NULL )
	encoder = gpujpeg_encoder_create(&stream);

gpujpeg_encoder_input_set_image(&encoder_input, idata);
gpujpeg_encoder_encode(encoder, &param, &param_image, &encoder_input, &image, p_jpeg_size);
gpujpeg_image_save_to_file(image_dir, image, *p_jpeg_size);

odata = (unsigned char*) malloc(*p_jpeg_size);
memcpy(odata, image, *p_jpeg_size);

CUDA_CHECK(cudaStreamDestroy(stream));
gpujpeg_image_destroy(image);
gpujpeg_encoder_destroy(encoder);

}
`

GPUJPEG vs NVJPEG

Hi @MartinPulec how are you ?
I am back playing with JPEG over GPU, and I wanted to compare the timing between nvJPG and GPUJPEG

I first did a run using a complete random image, using various resolution from 512x512 to 8192x8192 in this case my timing its better using GPUJPEG:

GPU JPEG TOTAL TIME '8192' : 911.242 [ms]
NV JPEG TOTAL TIME '8192' : 1608.21 [ms]

But if I used a more standard JPG, a portrait face with uniform background the timing change to the opposite

GPU JPEG TOTAL TIME '8192' : 620.884 [ms]
NV JPEG TOTAL TIME '8192' : 441.444 [ms]

Do you have any idea why is that ?

% ../install/bin/gpujpeg -I nvjpeg_encode_4096.jpg
GPUJPEG rev 2842002
width: 4096
height: 4096
component count: 3
color space: YCbCr BT.601 256 Levels (YCbCr JPEG)
internal representation: 444-u8-p012
segment count: 1

% ../install/bin/gpujpeg -I gpujpeg_encode_4096.jpg
GPUJPEG rev 2842002
width: 4096
height: 4096
component count: 3
color space: YCbCr BT.601 256 Levels (YCbCr JPEG)
internal representation: 444-u8-p012
segment count: 16384

Ubuntu 20.04 : stack smashing detected

Hi @MartinPulec,

Running gpujpeg over various configuration I am facing a weird issue with GPUJPEG on ubuntu 20.04.

The call of one of those two functions trigger a crash :

        _decoder = gpujpeg_decoder_create(0);
        _encoder = gpujpeg_encoder_create(0);

*** stack smashing detected ***: terminated
Aborted (core dumped)

[EDIT] I rebuild gpujpeg adding those flags :

++SET(CMAKE_C_FLAGS "-fPIC -fno-stack-protector")
++SET(CUDA_NVCC_FLAGS "${CUDA_NVCC_FLAGS} --compiler-options=-fno-stack-protector")

I do not have more crash but I maybe hide the trouble :(

JPEG decode error when using CPU-based huffman decoder

Hi,
It seems there is a bug in the CPU-based Huffman decoding procedure.
When I decode a JPEG file like this one: bird

the output .rgb file is not correct with CPU-based huffman decoder (without restart markers):
gpujpeg -f 444-u8-p012 -v -d original.jpg /tmp/test_out.rgb

After putting restart markers by jpegtran (jpegtran -restart 1b original.jpg >original_rst_1b.jpg)

The decode result becomes correct:
gpujpeg -f 444-u8-p012 -v -d original_rst_1b.jpg /tmp/test_out.rgb

I use the following script to show the .rgb file in python notebook:

from matplotlib.pyplot import imshow
import os
def read_rgb_444i(fname, width,height):
    with open(fname, 'rb') as f:
        content = f.read()
    assert(width*height*3 <= len(content))
    image = np.zeros(len(content), dtype=np.uint8)
    image = image.reshape((height, width, 3))
    count = 0
    for h in range(height):
        for w in range(width):
            for ch in range(3):
                image[h][w][ch] = content[count]
                count +=1
    return image
img = read_rgb_444i("/tmp/test_out.rgb", 500,335)
imshow(img)

Please correct me if I'm wrong.

How to use BT.709

I have an RGB raw data, and I would like to generate the JPG.

But the range is incorrect when I encode the frame because I need to use raw images not full range (BT.709)

I try various parameter but I am not able to have the correct color in the jpg.

When I load the RGB raw24 how I can specify to use not the full range ?

Here the command I am using to extract the raw RGB and then the GPUJPEG command for just the standard conversion:

$ ffmpeg -i test.mp4 -frames:v 100 -pix_fmt rgb24 -color_primaries bt709 -t 10 %05d.raw
$ for i in {00001..00100}; do gpujpeg -e -g -q 100 -s 3072x3072 -R $i.raw $i.jpg; done

Question about performance

When I run tester to test the performance on decoding, something like below showed up:

Decoding Image [input.jpg]
Load Image: 0.28 ms
[GPUJPEG] [Warning] JPEG data contains not supported APP1 marker
[GPUJPEG] [Warning] JPEG data contains not supported APP2 marker
[GPUJPEG] [Warning] JPEG data contains not supported APP2 marker
Perform huffman decoding on GPU!
Decode Image GPU: 6.36 ms (only in-GPU processing)
Decode Image: 37.80 ms

Decode image time is almost 6x longer than Decode Image GPU.

Why this happened? What does "Decode Image" mean compared to the one with “GPU”?

Help with GPUJPEG and Torch Tensor

Hi @MartinPulec,

I am facing an issue using GPUJPEG and torch, I want to copy the output decoded data of the jpg to a cuda tensor. I implement a copy function in cuda in order to copy the decoder data directly to the tensor.
My cuda code is splitting the three channel and normalize. I have a tool to convert the tensor into an images to check if its working. If I put some color its ok, but if I used the &coder->d_data_raw[0] as source everything seems to be working fine except for one part of the images

I am looking if I make a mistake in my code but I would like to know if I use the correct approach. I am using coder->d_data_raw in order to access to the gpu memory, also I assuming inside the pointer I will have the data RGBRGBRGB with the same width and height. Am I correct ?

Original image save with GPUJPEG

After loading into a tensor and saved again

gpujpeg_decoder_decode hanging.

Hi @MartinPulec ,

I am facing another new issue, it could be more tricky to reproduce because its happen with a specific configuration. I am using swig to generate an interface in python for GPUJPEG.

One of my function do a decoding from a numpy array (JPEG data) to a torch tensor.

bool JPGUtil::loadFromMemory(int64_t tensorptr, const uint32_t csize, uint8_t* ptr)
{
    if (_decoder == nullptr) {
        Log::error("JPGUtil::load(): invalid decoder.\n");
        return false;
    }

    Log::warning("AL --> loadFromMemory()\n");
    
    gpujpeg_set_device(_gpu_id);

    gpujpeg_decoder_set_output_format(_decoder, _param_image.color_space, _param_image.pixel_format);

    // set decoder default output destination
    struct gpujpeg_decoder_output decoder_output;
    gpujpeg_decoder_output_set_custom_cuda(&decoder_output, (uint8_t*)tensorptr);

    Log::warning("AL --> loadFromMemory() decode %d\n", csize);
    
    if (gpujpeg_decoder_decode(_decoder, ptr, csize, &decoder_output) != 0) {
        return false;
    }

    Log::warning("AL --> loadFromMemory() finish\n");
    
    return true;
}

Some time (not all the time) it seems the gpujpeg_decoder_decode block and hang I see my GPU process going crazy and my code just stuck here.
Do you have any idea why I could have the decode function going into some kind of infinite loop ?

Weird crash with 1 channel texture

Hi @MartinPulec, I am facing an issue with GPUJPEG,

I have my own class Image using is own cuda pointer in order to create my pointer I use cudaMalloc and the size of the memory is W * H * C

When using 1 Channel I have some weird crash coming from GPU JPEG, I feel I am maybe not using the API correctly but here is the function I am using for doing the decoding into my own cuda memory

For 1 channel my option param imagine option are:
_param_image.pixel_format = GPUJPEG_U8;
_param_image.color_space = GPUJPEG_RGB;

I add the code I am using but I had a feeling it was a sync problem but after adding the cudaDeviceSynchronize I am still having the issue.

bool load(int64_t tensorptr, const std::string& filename)
{
    if (_decoder == nullptr) {
        Log::error("JPGUtil::load(): invalid decoder.\n");
        return false;
    }
    
   cudaSetDevice(_gpu_id);

    gpujpeg_decoder_set_output_format(_decoder, _param_image.color_space, _param_image.pixel_format);

    // load image
    uint8_t* input_image = NULL;
    int input_image_size = 0;
    if (gpujpeg_image_load_from_file(filename.c_str(), &input_image, &input_image_size) != 0) {
        return false;
    }

    // set decoder default output destination
    struct gpujpeg_decoder_output decoder_output;
    gpujpeg_decoder_output_set_custom_cuda(&decoder_output, (uint8_t*)tensorptr);

    if (gpujpeg_decoder_decode(_decoder, input_image, input_image_size, &decoder_output) != 0) {
        return false;
    }
  
    // May be better to syncrhonize
     cudaDeviceSynchronize();

    gpujpeg_image_destroy(input_image);
    return true;
}

Cannot get number of CUDA devices: CUDA driver version is insufficient for CUDA runtime version.

Hello Guys,
I compiled inside a container (docker-nvidia) with cuda 10.2. I confirmed the runtime is 10.2 as well as headers residing in the classical placeholder :

lrwxrwxrwx 1 root root 50 Jun 27 05:32 /usr/local/cuda/include -> /usr/local/cuda-10.2/targets/aarch64-linux/include

I override some build directives because CMake was not able to determine my CUDA_ARCHITECTURE and preferences for OpenGL :

cmake -D CMAKE_CUDA_ARCHITECTURES=62 -DOpenGL_GL_PREFERENCE=GLVND ..

While trying the freshly built gpujpeg i was surprised to see that :

root@cec5af656a1e:/data/GPUJPEG/build# ./gpujpeg --device-list 
[GPUJPEG] [Error] /data/GPUJPEG/src/gpujpeg_common.c (line 127): Cannot get number of CUDA devices: CUDA driver version is insufficient for CUDA runtime version.
[GPUJPEG] [Error] /data/GPUJPEG/src/gpujpeg_common.c (line 127):

From the container perspective I've enabled the --privileged flag but not relevant (i guess) for this specific problem.

I used the master branch for compilation and I already make it works with previous version of GPUJPEG/Cuda(10.0) on the same architecture (TX2).

Any advices ?

APP0 marker version should be 1.1 but 1.2 was presented

Hi @MartinPulec

I am trying to generate JPG with ffmpeg and then try to load it using GPUJPEG, but doing that I have this error message.

Do you know more about this issue ?

Thanks

Is there GPUJPEG_BGR color space support?

Hi,

I have a BGR input image instead of RGB input. Can I set the color_space for jpeg encoding to be "GPUJPEG_BGR" instead of "GPUJPEG_RGB" ? If not, where should I start modifying to add support for BGR?

Right now if I pass BGR image and say it is "GPUJPEG_RGB", it compresses fine, but when I decompress it back, the colors are wrong when displayed.

Thanks,
Keerthi

can't decode correctly

Hi, I used apis GPUJPEG provided to decode an image, here is the link

https://drive.google.com/open?id=1ptnJvrsInaaxvxT2n2o5COdeYswfZYtM

and I use opencv cv::Mat to store the image by pointer to decoding result and download it. here is the result

https://drive.google.com/open?id=1trYvm7nVZT6KgnkKJ2uAFSglUc101aRv

Most of images are fine using the api GPUJPEG provided.

Could you know what happened? Huffman decoding part or something else?

need help with poor performance

GPUJPEG is a great library and I would like to use on my bmp->jpeg code routine.
When tested as followed code, the performance is not as expected:

int RGB888_To_JPEG_GPU(unsigned char *src, unsigned char **dst,
		unsigned int width, unsigned int height, int quality)
{
    // Default coder parameters
    struct gpujpeg_parameters param;
    gpujpeg_set_default_parameters(&param);
    param.quality = 75;
    param.segment_info = 0 ;
    param.interleaved = 0;
    param.verbose = 0;
    param.restart_interval = 4;
    param.color_space_internal = GPUJPEG_YCBCR_JPEG;

    // Default image parameters
    struct gpujpeg_image_parameters param_image;
    gpujpeg_image_set_default_parameters(&param_image);
    param_image.width = width;
    param_image.height = height;
    param_image.color_space = GPUJPEG_RGB;
    param_image.pixel_format = GPUJPEG_444_U8_P012;
    param_image.comp_count = 3;

    // Other parameters
    int device_id = 0;

    // Flags
    int restart_interval_default = 1; //GPU parallel!!!!!!!!!!!!!!!!!
    int chroma_subsampled = 0;

    int rc;

    if ( gpujpeg_init_device(device_id, 0) != 0 )
        return -1;

#if 1 //todo judge? 
    // Adjust restart interval
    if ( restart_interval_default == 1 ) {
        // when chroma subsampling and interleaving is enabled, the restart interval should be smaller
        if ( chroma_subsampled == 1 && param.interleaved == 1 ) {
            param.restart_interval = 2;
        }
        else {
            // Adjust according to Mpix count
            double coefficient = ((double)param_image.width * param_image.height * param_image.comp_count) / (1000000.0 * 3.0);
            if ( coefficient < 1.0 ) {
                param.restart_interval = 4;
            } else if ( coefficient < 3.0 ) {
                param.restart_interval = 8;
            } else if ( coefficient < 9.0 ) {
                param.restart_interval = 10;
            } else {
                param.restart_interval = 12;
            }
        }
        if ( param.verbose ) {
            printf("Auto-adjusting restart interval to %d for better performance.", param.restart_interval);
        }
    }
#endif


    // Create encoder
    struct gpujpeg_encoder* encoder = gpujpeg_encoder_create(NULL);
    if ( encoder == NULL ) {
        LOGE(TAG, "Failed to create encoder!");
        return -1;
    }
    
    // Load image
    uint8_t* image = NULL;
    int image_size = 0;
    cudaMallocHost((void**)&image, width*height*3);
    memcpy(image, src, width*height*3);

    // Prepare encoder input
    struct gpujpeg_encoder_input encoder_input;
    gpujpeg_encoder_input_set_image(&encoder_input, image);

    // Encode image
    uint8_t* image_compressed = NULL;
    int image_compressed_size = 0;
    rc = gpujpeg_encoder_encode(encoder, &param, &param_image, &encoder_input, &image_compressed, &image_compressed_size);
    if ( rc != GPUJPEG_NOERR ) {
        if ( rc == GPUJPEG_ERR_WRONG_SUBSAMPLING ) {
            LOGE(TAG, "Consider using '--subsampling' optionn");
        }
        LOGE(TAG, "Failed to encode image !");
        return -1;
    }


    *dst = (unsigned char*)malloc(image_compressed_size);
    memcpy(*dst, image_compressed, image_compressed_size);

    // Destroy image
    gpujpeg_image_destroy(image); //call cudaFreeHost

    // Destroy encoder
    gpujpeg_encoder_destroy(encoder);

    return image_compressed_size;
}


int BMP_To_JPEG_GPU(unsigned char *src, unsigned char **dst,
		unsigned int width, unsigned int height, int quality)
{
    return RGB888_To_JPEG_GPU(src + 54, dst, width, height, quality);
}

BMP_To_JPEG_GPU and RGB888_To_JPEG_GPU are APIs to be exposed to user, the user test code is as followed:

void gpu_jpeg_test()
{

	printf("test start\n");
	char *filename="1.bmp";
	char *buffer;
	int size;
	File_Read_Binary(filename, &buffer, &size); //read bmp to buffer with 2592x2048

	char *dst;
	int res;
	while(1) {
		res = BMP_To_JPEG_GPU(buffer, &dst, 2592, 2048, 75);
		printf("."); fflush(stdout);
		free(dst);
	}

	printf("test end\n");
}

the result shows the speed is around 3fps with cpu 89%, which is much more slower than libjpeg-torbo(11fps) on same platform.

Could you help to point out some places that can help to accelerate the conversion speed, thanks too much!

Runtime Error on Jetson TX2

Hey,

I installed GPUJPEG on a Jetson TX2 board. However, when I try to use the library in my own code the following error is thrown:

[GPUJPEG] [Error] Failed to initialize CUDA device.
terminate called after throwing an instance of 'std::runtime_error'
  what():  
Aborted (core dumped)

The cuda samples are running perfectly fine, so CUDA should be set up correctly. Any suggestions?

Cannot reproduce the same performance

Hi, I am using gpujpeg in order to load images bigger than 4k (6144x3456) but my timing is not create with my 1080 ti : 6144x3456 : 241 msec
I tried a smaller resolution but still not really good : 2160x2160 : 62 msec

I have a feeling I am missing something because your benchmark give much better result. The current jpeg I am using are extracted from a video using OpenCV (using a compression 80).
I am using OpenGL in order to render the frame. All the code is wrapped using SWIG in order to use it from Python, I did some check regarding the overhead here but it really minimal.
The wrapper only take the filename and return the gl texture id. In order to optimize, I generate the param_image just for the first frame and then reuse it for the other frames of the video

Should I do something different regarding how the jpeg is encoded ?

OpenCv GpuMat to gpujpeg_encoder_input_set_gpu_image

Hello Guys and thanks for this helpful contrib!

I'm trying to pass a cv::cuda::GpuMat g_img to gpujpeg_encoder_input_set_gpu_image using g_img.data There is no errors raised by GPUJPEG However, the image looks weird as follow :

Pls note, If I download the GpuMat g_img to the host via a regular cv::Mat h_img and allocate via a cudaMalloc and passing back to gpujpeg_encoder_input_set_image the h_img.data it works..

...but downloading to the host is costly, thus reducing the benefits of this lib and performance of our pipeline as everybody can understand.

The GpuMat is a CV_8UC3 with a demosaicing using the color space COLOR_BayerBG2RGB + a multiply (debayering from a raw image)

Any advices are welcome !

Working on TX2 / Cuda9 / OpenCV 4.0.1 / Aarch64 - Unified Memory

Build failure

Hello,

I would like to report that with the latest commits of yesterday, May 11th, build fails.

mv -f src/.deps/libgpujpeg_la-gpujpeg_decoder.Tpo src/.deps/libgpujpeg_la-gpujpeg_decoder.Plo
 /bin/bash ./libtool  --tag=CC   --mode=compile gcc -DHAVE_CONFIG_H -I.    -std=c99 -fPIC  -O2 -I. -I/usr/local/cuda/include -g -O2 -MT src/libgpujpeg_la-gpujpeg_encoder.lo -MD -MP -MF src/.deps/libgpujpeg_la-gpujpeg_encoder.Tpo -c -o src/libgpujpeg_la-gpujpeg_encoder.lo `test -f 'src/gpujpeg_encoder.c' || echo './'`src/gpujpeg_encoder.c
 libtool: compile:  gcc -DHAVE_CONFIG_H -I. -std=c99 -fPIC -O2 -I. -I/usr/local/cuda/include -g -O2 -MT src/libgpujpeg_la-gpujpeg_encoder.lo -MD -MP -MF src/.deps/libgpujpeg_la-gpujpeg_encoder.Tpo -c src/gpujpeg_encoder.c  -fPIC -DPIC -o src/.libs/libgpujpeg_la-gpujpeg_encoder.o
 In file included from src/gpujpeg_encoder.c:39:0:
 src/gpujpeg_huffman_gpu_encoder.h:34:10: fatal error: libgpujpeg/gpujpeg_encoder_internal.h: No such file or directory
  #include <libgpujpeg/gpujpeg_encoder_internal.h>
           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 compilation terminated.
 Makefile:704: recipe for target 'src/libgpujpeg_la-gpujpeg_encoder.lo' failed
 make[1]: Leaving directory '/GPUJPEG'
 make[1]: *** [src/libgpujpeg_la-gpujpeg_encoder.lo] Error 1
 Makefile:839: recipe for target 'install-recursive' failed
 make: *** [install-recursive] Error 1

[Error] src/gpujpeg_common.cpp (line 159): Cannot get number of CUDA devices: operation not supported

./gpujpeg --decode ../poster_7088×10630.jpg output.rgb

[GPUJPEG] [Error] src/gpujpeg_common.cpp (line 159): Cannot get number of CUDA devices: operation not supported.

nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130

lspci | grep -i VGA

19:00.0 VGA compatible controller: NVIDIA Corporation Device 1b06 (rev a1)
1a:00.0 VGA compatible controller: NVIDIA Corporation Device 1b06 (rev a1)
67:00.0 VGA compatible controller: NVIDIA Corporation Device 1b06 (rev a1)
68:00.0 VGA compatible controller: NVIDIA Corporation Device 1b06 (rev a1)

Fail to pre-allocate encoding

I am using GPUJPEG do image compress.
I will compress many different size of RGB images and I found I have to call the method gpujpeg_encoder_allocate every time before I compress one image, otherwise it will be failed for the second time.
I call as the following code:

gpujpeg_encoder_input encoder_input;
gpujpeg_encoder_input_set_image(&encoder_input, pImageInput);

uint8_t* image_compressed = NULL;
int image_compressed_size = 0;
			
gpujpeg_encoder_allocate(m_jpegEncoder, &param, &param_image, GPUJPEG_ENCODER_INPUT_IMAGE , 16);
if ( gpujpeg_encoder_encode(m_jpegEncoder,&param, &param_image, &encoder_input, &image_compressed, 
	&image_compressed_size) != 0 )
{
	
	return;
}

the program working well.
But it often print some error message in console
[GPUJPEG][Error] ..\src\gpujpeg_common.cpp (line 844): Coder component copy: in valid argument.
[GPUJPEG][Error] Fail to pre-allocate encoding!

also the program is working well. But i want to know the reason for print the error. And I think if has no such error, the program will run more fast.

Anybody can tell me the reason?
Thanks!

rgb encoded jpg file cannot be correctly opened in MAC preview?

Is there a way to handle this? Thanks

Supported JPEG decoder ?

Does GPUJPEG create a "normal" JPEG that can be read by any JPEG decoder ?

encoder failed on raw file: format:u8 size:2586x2048

Hi guys, I found a problem with some files. It seems to be related to the width of the file, because the excact same file type with smaller width is working.

Input: 8bit (grayscale) 2586x2048, no padding.
Output jpeg is broken (color is OK, but width calcualtions are wrong).

I also tried the tool gpujpeg.exe that comes with the library, but it's the same result file.
I'm using the latest version from gpujpeg, latest NVIDIA CUDA libs 10.2 on NVidia Geforce 730.

This is the jpeg produced by the library:

Here are the raw pixels
badpixels.zip

This is how the image will look correctly (like gimp raw image display)

Best Regards
Bruno

GPUJPEG failed with DRI marker, added glfw in windows for gpujpeg_opengl_init

Hi,

I try to run GPUJPEG in windows with GLFW. the code only can compile with 32 bit in visual studio 2015. (64 bit Windows). The code failed with command line as:-d Hutton_in_the_Forest_4K.jpg output.yuv --verbose -n 10 and result as:

[GPUJPEG] [Error] gpujpeg_opengl_init not implemented in current OS! <<<< this is warning since the library didn't implement windows, I manually add glfw for windows bypass it.
CUDA driver version: 11.0
CUDA runtime version: 11.0
Using Device #0: GeForce GTX 1660 SUPER (c.c. 7.5)

Decoding Image [Hutton_in_the_Forest_4K.jpg]
Load Image: 4.50 ms
[GPUJPEG] [Warning] JPEG data contains not supported APP1 marker
[GPUJPEG] [Warning] APP13 marker (segment info) scan index should be 0 but 80 was presented! (marker not a segment info?)
[GPUJPEG] [Warning] JPEG data contains not supported APP1 marker
[GPUJPEG] [Warning] JPEG data contains not supported APP2 marker
[GPUJPEG] [Error] DRI marker can't redefine restart interval (16 to 480)!
This may be caused when more DRI markers are presented which is not supported!
[GPUJPEG] [Error] Decoder failed when decoding image data!
Failed to decode image [Hutton_in_the_Forest_4K.jpg]!
Press any key to continue . . .

The image is 3840x2160. Since it failed with line "DRI marker can't redefine restart interval", sounds this is for Linux. Is it possible fixed for Windows?

Thanks

Hubert Yang

No api to load image from memory buffer?

In my use case, I need to encode an image directly from a memory buffer. There doesn't seem to be an api for that, other than loading images from files.

Distribution package requirements.

On my machine, I've Installed the Cuda toolkit (version 8.0) to be able to build the gpujpeg project.
Now in my application I'm linking with the gpujpeb.lib and depending on gpujpeg.dll which on his place, depends on cudart_32_80.dll. So I guess that at least, I have to include those dlls in my package. Should I include something else?

And what about the system requirements? Of-course it should be with nvidia cuda enabled GPU. Something else?

Thanks!

YUV422 Interleaved to RGB conversion and then to JPEG

Hi,

I would like to convert "YUV422 Interleaved" format to RGB and then convert that to JPEG using this library. Where should I start, to add this function to convert YUV422 to RGB?

YUV422 Interleaved is explained in section 3.1.5 in http://www.ti.com/lit/an/sprab77a/sprab77a.pdf
It basically has separate Y channels for each pixel, but shared U and V channels between two pixels. I think this library expects 3 channels per pixel (correct me if I am wrong)

Thanks,
Keerthi

cudaThreadSynchronize is deprecated, suggest changing to cudaDeviceSynchronize

GPUJPEG/src/gpujpeg_huffman_gpu_encoder.cu:976:23: error: 'cudaError_t cudaThreadSynchronize()' is deprecated [-Werror=deprecated-declarations]
cudaThreadSynchronize();
^
https://stackoverflow.com/questions/13485018/cudastreamsynchronize-vs-cudadevicesynchronize-vs-cudathreadsynchronize

Golang wrapper

Hi!)
src

Would like to add 12bit greyscale support, where to start?

Hello,
I'd like to add 12bit greyscale support if its not already on the horizon. I'm still trying to figure out how difficult a task it is.
Any pointers on where to start?

Cheers
Russell

Mac os x: How to compile in release mode for best performance?

Hi,
I've built the project (with cmake and make ) as a static library, and linked it to my application.
I see that for 1920x1080 frames and 95% quality, the average encoding time is ~10ms.
This is a little bit too high. With software (libjpeg-turbo) I'm getting the same (~10ms).
In addition, I've noticed that drastically reducing the quality to 10% doesn't effect the encoding time (I do see the quality reduction). I've tried to explicitly set the NDEBUG flag but it didn't help.
My question is: How to properly compile the project in "release" mode and if you thing I should get a better performance from what I've mentioned.

Thanks!

My setup:
OS X El Capitan (v. 10.11.6)
MacBook Pro (Retina, 15-inch, Mid 2014)
Processor 2.5 GHz Intel Core i7
Memory 16 GB 1600 MHz DDR3
Graphics NVIDIA GeForce GT 750M 2048 MB

cesnet / gpujpeg Goto Github PK

gpujpeg's Introduction

GPUJPEG

Table of contents

Authors

Features

Overview

Performance

Encoding

Decoding

Quality

Compile

Usage

libgpujpeg library

Encoding

Decoding

GPUJPEG console application

Requirements

License

References

gpujpeg's People

Contributors

Stargazers

Watchers

Forkers

gpujpeg's Issues

Recommend Projects

Recommend Topics

Recommend Org