Git Product home page Git Product logo

Comments (18)

scanny avatar scanny commented on June 18, 2024 68

You should be able to do this for the simple case with this code:

def delete_paragraph(paragraph):
    p = paragraph._element
    p.getparent().remove(p)
    p._p = p._element = None

Any subsequent access to the "deleted" paragraph object will raise AttributeError, so you should be careful not to keep the reference hanging around, including as a member of a stored value of Document.paragraphs.

The reason it's not in the library yet is because the general case is much trickier, in particular needing to detect and handle the variety of linked items that can be present in a paragraph; things like a picture, a hyperlink, or chart etc.

But if you know for sure none of those are present, these few lines should get the job done.

from python-docx.

scanny avatar scanny commented on June 18, 2024 1

It depends a little on what you mean by link, but deleting is not so much a problem in practice as copying is.

If you have a hyperlink, for example, in a paragraph, that hyperlink element in the XML contains a relationship reference (like "rId7") to a Relationship element in the .rels "file" associated with the part containing the paragraph (maybe the document-part most commonly). That Relationship element contains the URL of the hyperlink and that's the extent of the relationship (a so-called "external" relationship). If you delete the paragraph but don't delete the Relationship element in the .rels collection that Relationship element will hang around and be saved with the document. This actually shouldn't cause a problem and I don't believe by itself represents a file "corruption" that might give rise to a so-called "repair error" when opening the file.

If you have something "bigger", like say an image embedded in the paragraph (a so-called inline-shape), and you delete the paragraph without attending to the now-dangling relationship, then both the Relationship element in the .rels _as well as the Image-part it refers to will be retained in the document. That bloats the file a little but again, shouldn't cause a problem and may or may not give rise to a "repair-error" on opening the document. You'd have to experiment and behavior might vary by client, like maybe PowerPoint doesn't complain but LibreOffice does or vice-versa.

So deleting a paragraph is worth trying if you don't mind a little wasted space.

But if you copy a paragraph and don't re-establish the relationships (which may need to change "name", e.g. "rId7" -> "rId9") and also copy over target part(s) (e.g. the image in the example above) then that will definitely trigger a repair error on loading the document because Word can't find the image to render in that paragraph.

from python-docx.

jeffreinhart avatar jeffreinhart commented on June 18, 2024

Would like to see this available for python-docx. It would be very useful in populating a document full of placeholders given that it would allow the placeholder paragraph to be deleted if the value to populate the placeholder is None.

from python-docx.

jeffreinhart avatar jeffreinhart commented on June 18, 2024

That works! Thank you!!

from python-docx.

scanny avatar scanny commented on June 18, 2024

Glad it worked out Jeff :)

from python-docx.

waynerth avatar waynerth commented on June 18, 2024

Steve, thanks so much. I was having trouble after merging cells in a table which left extra empty paragraphs. Used your function and worked great, which let the cells shrink back by getting rid of empty space. Used it in a nested loop as follows:

    delete_paragraph(table.rows[rx].cells[cx].paragraphs[-1])

thanks - wayne (retired HW designer, having fun with python while hopefully helping out the non-profit I volunteer for)

from python-docx.

zooyf avatar zooyf commented on June 18, 2024

Hi @scanny
Why not implement the feature and close the issue?

from python-docx.

zooyf avatar zooyf commented on June 18, 2024

You should be able to do this for the simple case with this code:

def delete_paragraph(paragraph):
    p = paragraph._element
    p.getparent().remove(p)
    p._p = p._element = None

Any subsequent access to the "deleted" paragraph object will raise AttributeError, so you should be careful not to keep the reference hanging around, including as a member of a stored value of Document.paragraphs.

The reason it's not in the library yet is because the general case is much trickier, in particular needing to detect and handle the variety of linked items that can be present in a paragraph; things like a picture, a hyperlink, or chart etc.

But if you know for sure none of those are present, these few lines should get the job done.

What's the difference compared to this solution?

def delete_element(el):
    el._element.getparent().remove(el._element)

from python-docx.

scanny avatar scanny commented on June 18, 2024

Well, in fact, on review, there is an error in that code. The last line should be:

paragraph._p = paragraph._element = None

But as for the rest of it:

  1. delete_element and el are misleading name choices in my view. A Paragraph object is an element-proxy object which composes an element object; it is not itself an element. So in general we reserve the name element and its derivatives for the XML element objects themselves.

  2. The core code is essentially the first two lines combined into one, so that's a matter of taste; the operation is the same. I would personally probably choose something like yours in my own code, but for someone learning, sometimes breaking things down more step-by-step eases figuring out what the underlying process is, like first get the element from the proxy, then do this thing with the element, etc.

  3. The (previously incorrect) last line is setting the _p and _element attributes of the "host" Paragraph proxy object to None so the now-deleted (or actually only orphaned) element is not accidentally accessed in later code and also is freed up for garbage collection. Removing an element in lxml does not delete it, it only breaks its relationship with its parent. So the original Paragraph object could still make changes to it and the user might puzzle for quite a while to figure out why their code wasn't working but wasn't raising an error. So you can think of it as preventative medicine.

from python-docx.

abubelinha avatar abubelinha commented on June 18, 2024

Thanks for this @scanny
I suggest you to edit the original previously incorrect last line, because that's the answer which is still linked by you from Stackoverflow.

from python-docx.

mrufsvold avatar mrufsvold commented on June 18, 2024

Steve, thanks so much. I was having trouble after merging cells in a table which left extra empty paragraphs. Used your function and worked great, which let the cells shrink back by getting rid of empty space. Used it in a nested loop as follows:

    delete_paragraph(table.rows[rx].cells[cx].paragraphs[-1])

thanks - wayne (retired HW designer, having fun with python while hopefully helping out the non-profit I volunteer for)

I have this same problem. However, when I use the delete_paragraph function with the corrected last line, the resulting document throws an error when opened that reads "Word found unreadable content in document_name.docx. Do you want to recover the contents of this document?" Clicking yes works to open the document, but I'm trying to figure out why deleting the paragraphs is causing this problem.

I think it might be related to the fact that this paragraph exists in a merged cell, but it sounds like @waynerth didn't experience this problem.

Any thoughts?

Thanks for your work on this @scanny!

from python-docx.

scanny avatar scanny commented on June 18, 2024

@mrufsvold each cell must contain at least one block item, so a paragraph or a table. If you get rid of all the paragraphs, that leaves the cell in an invalid state. You might want to delete paragraphs[1:] or something like that, just be sure there's at least one left.

from python-docx.

mrufsvold avatar mrufsvold commented on June 18, 2024

@scanny That makes complete sense! Thanks for your quick reply. I'll give that a shot when I get back to that project!

from python-docx.

mrufsvold avatar mrufsvold commented on June 18, 2024

It worked!

from python-docx.

scanny avatar scanny commented on June 18, 2024

Glad you got it working @mrufsvold :)

from python-docx.

abubelinha avatar abubelinha commented on June 18, 2024

The reason it's not in the library yet is because the general case is much trickier, in particular needing to detect and handle the variety of linked items that can be present in a paragraph; things like a picture, a hyperlink, or chart etc.

@scanny Does that mean that if I delete a paragraph containing a link, my document will/might crash because the linked stuff is still kept/referenced somewhere else in the document ... or something alike?

from python-docx.

abubelinha avatar abubelinha commented on June 18, 2024

I think deleting is working for me, at least for the tests I made with many small controlled documents.

Now with a big document (where I do lots of things, not just deleting paragraphs) I am getting errors when opening it.
Word gives the chance to correct them and save the document, but I wonder if I have any chances of finding out the error source:

  • Do you know of any way to make Word report where the "unreadable content" is?
    I tried opc-diag but the output is so huge I can't really see anything there (BTW, no diff colours, just black and white interface: probably not designed for my Windows 7 machine?)
  • Reading again your last comment, I wonder what you exactly mean with copying a paragraph. Could you post a simple code example? (maybe I am unconsciously doing it since I reuse quite a few functions made by some other people).

Thanks @scanny

from python-docx.

zhangxingyang avatar zhangxingyang commented on June 18, 2024

Wow, thank you. It works!!!

from python-docx.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.