The picture in PDF has alpha value. So when I extract png from PDF, I get the xref fir

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

[Extract PNG From PDF] the extracted picture has some pixels with different color from what we see in PDF about pymupdf-utilities HOT 9 CLOSED

pymupdf commented on August 11, 2024

[Extract PNG From PDF] the extracted picture has some pixels with different color from what we see in PDF

from pymupdf-utilities.

Comments (9)

JorjMcKie commented on August 11, 2024 1

Ok, thanks for the hint concerning wand (never heard of it before).
I'll have a look into it.

from pymupdf-utilities.

YongxinLai commented on August 11, 2024

In addition, if I don’t set the alpha, the wrong color also don’t appear in the same position. So I would like to know what happens when we use setAlpha method and why applied the samples value to set alpha value?

from pymupdf-utilities.

JorjMcKie commented on August 11, 2024

I found there are several pixel which has different value and color from what we see in PDF

Hard to tell without seeing an example PDF and the code snippet extracting the PNG.

In general, transparent images in PDF are stored as separate sub images: one image contains the color pixels, e.g. (R, G, B) for an RGB image, and the other the alphas (transparency in formation).
If you extract such an image and make a pixmap of it, it will contain the 4-tuple (R, G, B, A) for each pixel. All these 4 values are integers from 0 to 255. In case of the alpha value A, this is interpreted as transparency, e.g. 128 => 128/255 ~ 50%.
The setAlpha method changes all those alpha values in a pixmap - based on an array of integers. If all values are set to 255, the image will no longer be transparent. If 0, it will be fully transparent and you won't see anything any more, etc.

from pymupdf-utilities.

JorjMcKie commented on August 11, 2024

Closing for lack of reaction.

from pymupdf-utilities.

meruiden commented on August 11, 2024

I am having the same issue. however some images get fixed when i wrap it with fitz.Pixmap(fitz.csRGB, fitz.Pixmap(pix1)) right before setting the alpha. but when I do that some other images get these weird artifacts again.

this is how the pdf looks:

this is without the conversion before applying the alpha:

this is with the fitz.csRGBconversion

you can reproduce this by changing the recoverpix function.

this is the code I used (its basically a simple version of recoverpix)

def recoverpix(doc, item):
    x = item[0]  # xref of PDF image
    s = item[1]  # xref of its /SMask

    pix1 = fitz.Pixmap(doc, x)
    if s == 0:
        return fitz.Pixmap(fitz.csRGB, pix1)

    pix2 = fitz.Pixmap(doc, s)

    #pix3 = fitz.Pixmap(fitz.csRGB, fitz.Pixmap(pix1)) # with conversion
    pix3 = fitz.Pixmap(pix1) # without conversion

    pix3.setAlpha(pix2.samples)

    return fitz.Pixmap(fitz.csRGB, pix3)

what would be perfect is if I could somehow detect if the conversion is needed

any idea @JorjMcKie ?

edit:

its not very nice but for now i managed to fix it by applying the mask using the wand library rather then using the setAlpha function:

def apply_mask(image, mask, invert=False):
    image.alpha_channel = True
    if invert:
        mask.negate()
    with Image(width=image.width, height=image.height, background=Color("transparent")) as alpha_image:
        alpha_image.composite_channel(
            "alpha",
            mask,
            "copy_alpha",
            0, 0)
        image.composite_channel(
            "alpha",
            alpha_image,
            "multiply",
            0, 0)


def recoverpix(doc, item):
    x = item[0]  # xref of PDF image
    s = item[1]  # xref of its /SMask

    pix1 = fitz.Pixmap(doc, x)
    if s == 0:
        return fitz.Pixmap(fitz.csRGB, pix1), None

    pix2 = fitz.Pixmap(doc, s)

    pix3 = fitz.Pixmap(fitz.csRGB, fitz.Pixmap(pix1))

    return pix3, pix2

pix, pix_alpha = recoverpix(self.fitz_pdf, img)

with Image(blob=pix.getPNGData()) as image:
    with image.clone() as image:
        if pix_alpha is not None:
            with Image(blob=pix_alpha.getPNGData()) as image_a:
                with image_a.clone() as image_a:
                    apply_mask(image, image_a)
        image.save(filename=path)

from pymupdf-utilities.

meruiden commented on August 11, 2024

Ok, thanks for the hint concerning wand (never heard of it before).
I'll have a look into it.

No problem. I made some small changes to the code tog et it a little cleaner but the functionality is still the same. (edited the original post) I hope this helps finding the issue with the random pixels

from pymupdf-utilities.

JorjMcKie commented on August 11, 2024

@meruiden
My approach to recover the original, transparency-loaden image simply is too naive.
MuPDF actually does provide the necessary functionality for recovering that original. I erroneously thought I can take this type of shortcut.

What I will do, is reworking the doc.extractImage() method such that it uses MuPDF more consequently and either hide any smask xrefs (e.g. by always set them to 0) in the response dictionary, and / or change the example scripts containing that cited type of code.

from pymupdf-utilities.

JorjMcKie commented on August 11, 2024

@meruiden - could you please send me one of those problem examples?
Made a few changes and would like to test ...

from pymupdf-utilities.

JorjMcKie commented on August 11, 2024

@meruiden - in the meantime I have changed my position on this: I gave up resolving issues around transparent images using MuPDF. There simply is no reliable way to do it (or I am too dumb figuring it out).
Their own code also has defects around the same topic. If you page.apply_redactions() overlapping transarent images, those images will be updated ignoring their /SMask. Similarly if you extract page text to HTML, transparent images are not processed correctly, etc.

Instead I am using PIL/Pillow. The image extraction scripts in the examples directory already are updated accordingly. I hope they are worry-free to use now.

from pymupdf-utilities.

[Extract PNG From PDF] the extracted picture has some pixels with different color from what we see in PDF about pymupdf-utilities HOT 9 CLOSED

Comments (9)

edit:

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent