I'm doing a lot of work with existing docx (creating many docx from a template). I hac

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Feature: Find / replace text and generally *modify* docx files,about python-openxml/python-docx

Comments (32)

Animal-Machine commented on June 17, 2024 1

For those still looking for a solution, here is the python module you're looking for: https://github.com/elapouya/python-docx-template

from python-docx.

scanny commented on June 17, 2024

Hi Marcus, can you say a little more about what your use case is? I'll be looking to add a generalized solution and that comes from understanding the variety of typical use cases out there.

This approach of substituting in individual text elements is okay, but can run into missed matches when a search phrase is split across multiple <w:t> elements, as it surprisingly often seems to be.

A general solution would be tricky to specify. I expect once specified the implementation would actually turn out to take less time than the specification :)

from python-docx.

scanny commented on June 17, 2024

By the way, if what you are using the search/replace for is to build up a document from template substitution, there are much better ways than using search/replace. Happy to share those if that's your use case.

from python-docx.

scanny commented on June 17, 2024

Closing this one out since no response. Feel free to re-open Marcus if you get back to this.

from python-docx.

TheBlueCasket commented on June 17, 2024

Hi scanny,
i really like to know, how i could build up a document from a template.
Can you please provide an indication of doing this.
Thanks!

from python-docx.

scanny commented on June 17, 2024

Just a caution going in here that this is an advanced topic. To make this work you'll need to understand a lot more about the structure of Word documents than you do to use the rest of the library. That said, here you go:

The general gist is that you create a template for the document.xml part, render it using something like Jinja, then inject the resulting XML into the document to replace whatever was there before. The middle part is very much like rendering a web page, except instead of HTML, the base content is WordprocessingML.

Something like this would be a start:

from docx.opc.oxml import oxml_fromstring

new_document_xml = Template(template_xml).render(context)
document_part = document._document_part
new_document_element = oxml_fromstring(new_document_xml)
document_part._element = new_document_element

There are a couple challenges you'll need to meet in order to get this to work:

You're very likely to want the expanded versions of some existing document as a starting place for your template XML file(s). I recommend checking out opc-diag and its extract sub-command as a handy way to get those from a .docx file: http://opc-diag.readthedocs.org/
Your rendered document.xml file must be valid WordprocessingML, otherwise the resulting document won't open cleanly with Word and possibly won't open at all. For most folks, just creating a Word file that does what you want and copying and pasting the resulting XML into your template will be the quickest way to success. Again, opc-diag is your friend for doing before vs. after diffs so you can narrow down just the bits you need and then substitute in template placeholders and logic where needed. The Open XML spec is handy when you run into challenges, you can find it here: http://standards.iso.org/ittf/PubliclyAvailableStandards/index.html; look for ISO/IEC 29500. Part 1 is the place to start.

Let me know how you go :)

from python-docx.

sneakyglibc commented on June 17, 2024

Hi,
I am using your library for templates docx. My purpose is to replace, in the templates docx, keys by values. For the moment I replace of the text by the text and by the table. But I do not manage to replace of the text by a picture because the picture needs relationships = relationshiplist () to be able to be to create and to be to add in the docx. My question is: what is that there is a way to add an image in a file template or how get back the relationships of the docx.

Thanks.

from python-docx.

scanny commented on June 17, 2024

There is currently not a direct way to do this. However, if you're open to making some adaptations, the method docx.shape.InlineShape.new_picture() should give you most of what you need. Something roughly like this should do the trick:

document = Document(template_docx_path)
...
run = run_containing_image_key  # however you manage to get that
document_part = document._document_part
image_part, rId = document_part.get_or_add_image_part(image_file_path)
shape_id = document_part.next_id
picture = InlineShape.new_picture(run, image_part, rId, shape_id)
picture.width = ...
picture.height = ...

Best if you trace through the code to see why that works.

from python-docx.

PradeepSmg commented on June 17, 2024

Hi scanny i want to know how i could build a template from a word document in order to generate many such documents for different contents in future.

from python-docx.

scanny commented on June 17, 2024

@PradeepSmg you'll need to provide more details about your use case. There are a lot of options.

from python-docx.

PradeepSmg commented on June 17, 2024

@scanny Here is my use case i have a word document which has some variable contents, i need to replace those variable contents with Place holders.
Note:Same variable contents may present many places in the document i do not want to replace all the occurrences i want to replace variables at particular Location\Line Number
ex:
I have document like this

My name Is Pradeep,
I am Working at Pradeep & co.

I want to replace Only First occurrence of Pradeep with Place holder like this:

My name Is $name,
I am Working at Pradeep & co.

from python-docx.

scanny commented on June 17, 2024

Are you familiar with Django templates?

One way I've had good success with is using a curated template containing placeholders at the right spots in the XML and then feeding that XML through the template engine with a dictionary retrieved from a database.

python-docx allows you to open an expanded .docx file from a directory. I recommend using opc-diag for that job. The extract sub-command automatically reformats all the XML as well as extracting it to a directory structure.

The important limitation is that the XML must remain parseable because python-docx will parse it on load, before you are able to get to it using the API. A consequence of this is that you can't embed for-loop structures in the XML because they are not parseable.

Let me know if you're interested in this approach and I can identify some of the internals you'll need to get it done.

from python-docx.

aprasanth90 commented on June 17, 2024

@scanny
Hello Scanny,

I am currently trying replicated exactly what Pradeep tried to do.

My name Is Mr.X,
I am Working at Company x.

I want to replace multiple place holders on the word document.

My name Is $name,

I am Working at $company

I have template with pre-filled variable that I would like to replace with the value I provide in my script.

I have worked with Django and I was reading on your suggestion on curated template.

Can you please expand on it.

Thank you for time

from python-docx.

scanny commented on June 17, 2024

@aprasanth90, @holli-holzer The basic idea is this:

get a template from somewhere
render it into final XML using Django Templating or Mako or whatever
inject that XML into the ._blob field of the target part, likely the DocumentPart object.

The template part can come from anywhere, but the last time I did it I stored it directly in the starting document as its document.xml part. I might like to do that differently in future because it limits what templating features you can use, is not well suited to version control if kept in its .docx (Zip) format, and is prone to corruption if edited in Word.

The template itself is based on the XML from one of the document 'parts'. The main document part is 'word/document.xml' when you unzip the .docx file. The opc-diag package is a handy way to get that neatly unzipped and pretty-printed ready for templating.

One aspect of this overall approach is you end up having to understand the WordprocessingML well enough to accomplish what you want. Usually that's easy enough by creating a document with Word that looks like what you want and then just finding the bits you want to change and replacing them with templating placeholders. But it does generally place it outside what's possible for end-users to do for themselves.

The rendering part I expect you understand well enough.

Once you have the final XML, you inject it into the DocumentPart something like this. Note the XML must be bytes before you inject it, not unicode. So depending on what you get back from your templating engine you may have to encode it as 'utf-8', e.g. xml_bytes = xml_unicode.encode('utf-8'):

document = Document()
document_part = document._document_part
document_part._blob = xml_bytes_rendered_from_template

If you store the template in the starting document, you access it something like this:

document = Document('my-template.docx')
document_part = document._document_part
template = document_part._blob

Note that python-docx supports opening a document from an expanded .docx zip archive. If your template .docx was expanded into the directory 'my-template', loading it would look like this:

document = Document('my-template')
document_part = document._document_part
template = document_part._blob

I like that approach better than using a .docx file because you can keep the entire expanded template directory tree under version control with the rest of your source. This is what I referred to above as a "curated" template.

That should give you a start :)

from python-docx.

holli-holzer commented on June 17, 2024

I have added a templating proof of concept with support for django and jinja engines.

Here's a diff: http://hastebin.com/cakuxusaci.diff

Usage is as easy as

###################################################

import os, sys

sys.path.append("..")

from docx.api import Document

c = {
    'scalar': 'The FSM', 
    'liste': [
        {'name':'foo', 'data':[1,2,3], "baz":"abc", "date":"112"},
        {'name':'bar', 'data':['three', 'four', 'five'], "baz":"def", "date":"1132"},        
    ]
}

#
# loop over all test files and process them
#
for p in ["test_template_edge_case_1.docx", "test_template_edge_case_2.docx", "test_template_features.docx"]:
    for e in ["django", "jinja"]:
        i = os.path.join("test_files", e, "in", p)
        o = os.path.join("test_files", e, "out", p)
        d = Document(i)
        d.save(o, context=c, engine=e)

###################################################

from python-docx.

holli-holzer commented on June 17, 2024

Oh, I guess I should mention, that this solution supports templates to be written in word (or ooffice or whatever) itself as well as handcrafted templates. I have created a fork for the time being.

https://github.com/holli-holzer/python-docx

from python-docx.

Epithumia commented on June 17, 2024

I've been trying to replace the _part (I assume that's what _document_part is supposed to be) of my document, but upon saving, it just saves the original version. render(...) calls jinja2 on a jinja2 template which is the original _blob which I saved to a file then turned into a jinja2 template.

Here's what I was trying:

from docx import Document
doc = Document('modele.docx')
xml_unicode = render('/templates/my_doc.jinja2', {'data': some_data})
doc._part._blob = xml_unicode.encode('utf-8')
doc.write('output.docx')

On the other hand, using zipfile to flat out replace the document.xml and then repacking the file works.

from python-docx.

scanny commented on June 17, 2024

DocumentPart._blob is not used to write the document part. DocumentPart._element is serialized to get a new blob based on changes made to the XML.

So you'll need to replace DocumentPart._element, something like this:

from docx.oxml import parse_xml
doc._part._element = parse_xml(xml_unicode.encode('utf-8'))
doc.save('output.docx')

You're on your own for any external relationships that might get put out of sync, such as for hyperlinks or images; but if you can avoid or accommodate that this should work fine :)

from python-docx.

Epithumia commented on June 17, 2024

Right, this works now. Thanks.

from python-docx.

backbohne commented on June 17, 2024

I wrote a docx XSL tranformation module based on holli-holzer basic ideas: https://github.com/backbohne/docx-xslt

from python-docx.

karelv commented on June 17, 2024

@holli-holzer, It sounds like a great feature you added.
Do you have any plans to make a pull request?

from python-docx.

holli-holzer commented on June 17, 2024

@karelv no. i have uploaded the fork and if steve wants to he can easily pull it in.

from python-docx.

ullrichc commented on June 17, 2024

@scanny Any plans to pull it in?

from python-docx.

scanny commented on June 17, 2024

No; no plans to pull this fork in. There are no tests and minimally the structure would need to be adjusted to be consistent with the rest of the library. I expect there are a number of other work items that would be involved as well.

I believe @holli-holzer intended it as a proof-of-concept, which it does nicely as far as I can tell. But the heavy lifting of incorporating it into the broader code base remains to be done and I have no current plans to undertake it.

from python-docx.

mrufsvold commented on June 17, 2024

I have a question related to this conversation.

I need to edit the text of the footer of an existing document that relies on a custom template. However, doc.sections[0].footer.paragraphs returns an empty list.

When I unzipped the underlying xml, I find that the text that is currently in the document's footer, is stored in a file called footer2.xml. I assume that it is broken out like this because of the way the original document is constructed by the custom template which is why I'm posting in this topic.

After reading all of your solutions above, @scanny, I think I'm having a mental block because I don't know how the various docx and xml functions you're using map on to reading the actual xml files as they exist. So if I want to access an element in a specific file and change its contents, I don't know how to find it. But I think that mental block is mostly because I'm thinking about the problem in the wrong way.

Any advice you can give would be much appreciated!

from python-docx.

scanny commented on June 17, 2024

The "separate files" you mention are known as "parts" (like "document parts") in the Open Packaging Convention (OPC). Things can be broken up for various reasons, but yes, headers and footers are separate parts. All document parts are loaded at the Document(...) call time. A part is "connected" to another part by a relationship (sometimes rel or rels in the code). The code for part base classes is in docx.opc.part. The classes for custom parts (most of the ones you see) are in docx.part.

The user (programmer) doesn't generally interact directly with a part object. Instead the main element of the XML in the part is used to instantiate an element-proxy object (like Document, Header, etc.) and the API is the methods on that object.

When you call, for example, section.header, that relationship is traversed, the header part (HeaderPart object) is accessed, and the Header object you get back is a proxy for the w:hdr element at the root of the XML stored in the header part.

So you need to get to the part by traversing whatever relationships are involved and then access the part contents from there.

What part and what contents were you after in particular?

from python-docx.

mrufsvold commented on June 17, 2024

I'd like to update the date in the footer of the document when my program runs:

There is a date in the header that I'm able to access and edit from Document('mydoc.docx').sections[0].header.paragraph[0].runs. For some reason, I can't access the footer using the same method (replacing header with footer of course).

If it is helpful, I searched the XML and found the text I'd like to change in the part called footer2.xml. Screensheet for context --

I hope that is the context you need.

from python-docx.

scanny commented on June 17, 2024

check in .first_page_footer and .even_page_footer as well to see if you find it there. .footer returns the default footer (which also serves as the odd-page footer when .even_page_footer is defined).

from python-docx.

mrufsvold commented on June 17, 2024

Both of these attributes exist and .paragraphs return a list with one paragraph. The paragraph.text attribute returns an empty list for both. Only difference is that .footer.paragraphs returns an empty list with no paragraph object.

from python-docx.

scanny commented on June 17, 2024

What about .tables. Sometimes folks like the idea of placing footer items in a table as an easy way to keep the page number distinct from whatever is in the text part.

from python-docx.

mrufsvold commented on June 17, 2024

No dice! .tables is empty. I think I may just have the analysts using this document manually update the footer when they go in 🤷
Edit: all types of footer you listed above have empty .table attributes.

from python-docx.

scanny commented on June 17, 2024

If you want to pursue, then look in the .rels file/part for the document and see which relationship points to footer2.xml. Then find the rId of that relationship and find it in the XML for the document. Is it possible there is more than one section in the document and you're inspecting the wrong one?

from python-docx.

Feature: Find / replace text and generally modify docx files about python-docx HOT 32 CLOSED

Comments (32)

I am Working at $company

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent