python-openxml / python-docx Goto Github PK

View Code? Open in Web Editor NEW

4.2K 148.0 1.1K 44.17 MB

Create and modify Word documents with Python

License: MIT License

Makefile 0.13% Python 93.05% Gherkin 6.82%

python-docx's Introduction

python-docx

python-docx is a Python library for reading, creating, and updating Microsoft Word 2007+ (.docx) files.

Installation

pip install python-docx

Example

>>> from docx import Document

>>> document = Document()
>>> document.add_paragraph("It was a dark and stormy night.")
<docx.text.paragraph.Paragraph object at 0x10f19e760>
>>> document.save("dark-and-stormy.docx")

>>> document = Document("dark-and-stormy.docx")
>>> document.paragraphs[0].text
'It was a dark and stormy night.'

More information is available in the python-docx documentation

python-docx's People

Contributors

Stargazers

Watchers

Forkers

ljean ajf58 kinnarr pengyingchuan jackieee emmanuelsa sk1tt1sh netsyno githubhy mitchellzen alessando-guida farhanakhtar fokoenecke kamalx bernieh2005 pscamodio fictive-kin karthik1024 robline evandempsey ljoli eyalbd1 tonyo natsteinmetz madevelopers apteryks lafolle aschmolck humblepaper sebassbm robertdodd petergauss onlyjus alecat juandesant taliastocks mzmansour toplayer morannachum dkrogers jfroco mohamedattahri esaye naeka niyaspavil arjunr1432 kakakacool alvations-all colin-mcdonald virajkanwade holli-holzer wtayyeb mit-ufa qiwsir stevecohen42 jacobyf awatar rohitvmr why-not-sky sbluen d0c-s4vage mohammad22 pythoninglearning kggoyani christiantremblay xiliangsong taurustiger yahalit seanmiller168 morty arshsingh diguin goldielocks anderstornkvist zhaoweisonake kelvinhammond asd1355215911 oliver-li mattjbray noahkim11 kcl-ddh lfigueira fstraw falgore88 pythonpunters eruffaldi bjinwright jean nuos ekoziol bunbun yurac ezc andreymmc e42s defanlt camayak wugren chapayevdauren martingalloar

python-docx's Issues

feature: Run.text interprets line breaks as '\n'

When inside a cell of a table i have some text that contains a line break the line break is missing when i parse the file.
The cell seem to have only one paragraph (that is correct) but without the Line Break.
The document is created with python-docx and the Line Break is created automatically when i add the text that contains "\n" in the middle at the text member of the cell.
Then i load the document in office and libreoffice, resave the document and reload it in python.
In office and libreoffice the line feed is correct. It's missing only when parsing the file again with python-docx
If i try to replace the "\n" with "\r" the table disapper and i'm left with a list of paragraph

feature: insert picture in table cell

Hi scanny!

I searched about Cell Class API, And i find it can't support now.

so could you tell me , is it can be support in the future ?

Table too wide for document

I'm creating a docx containing a table with three columns, some of them containing 80 characters of text.

When I open the doc, the first column of the table is wider than the page. When I select table properties and set the width to relative and 100%, it fits the whole table nicely and wraps the text where necessary.

Is there a way to specify the width of the created table to 100% relative?

At the moment I'm digging around in docx/oxml/table.py and using this http://www.docx4java.org/forums/pdf-output-f27/pdf-conversion-table-width-t1233.html as a hint, but pointers would be greatly appreciated!

can't add ems, or eps images

it would be nice to have the ability to insert *.emf images.

about font and size

How could I set the paragraph about the font and font size, is there any way to do it?

feature: Paragraph.alignment

Request to add support for paragraph alignment:
left, right, both, center...

Should it support True, False, and None.

Is this the correct enumeration: http://msdn.microsoft.com/en-us/library/office/ff835817(v=office.15).aspx

feature: _Cell.add_table()

I have a use case in which a Cell within a table contains another table. I can extract the paragraphs of the Cell but not the sub-table. I am able to workaround this by traversing the element tree and searching for sub-rows.

Can search and replace functions be added to python-docx?

It is very easy to create a docx file by python-docx, but I like to search some specific words and count the number it occurs, how can I do in python-docx. I know this can be done in mikemaccana/python-docx, but the mikemaccana/python-docx code grammer is different from python-openxml / python-docx, I do not like to switch to mikemaccana/python-docx .

feature: BlockItemContainer.iter_block_items()

I'd like to iterate over the elements of they document as they appear in it. For example if there is a paragraph a table and then a paragraph again, I want to get them in that order. AFAIK currently there are two properties on Document, paragraphs and tables but have no notion of ordering between them.

New Objects as duplicate of existing object

Hello,

it would be nice, if an object could be added to a document as a duplicate of an existing object. Don't I just see how this is done or does it not yet work? Any chance, this feature will be implemented?

Example:

tblList = wdoc.tables
t = tblList[2]
t2 = wdoc.add_table(t)
...

Markdown Conversion

Hi! Thanks for this great library. After looking at some of the XML docs, i really see the pain in creating it ;)

I am developing a webapplication taking user input in a form and generating a docx file from it. Amongst others some fields are formatted in markdown. I am planning to take the markdown fields, convert them to XML (with pandoc or python markdown) and put it into the document via your low level API.

Is there a better/easier way to do this or any plans for implementing markdown directly into python-docx?

greatings

Example of how to merge two word documents into one

Hello

Does anyone have an example of how to merge 2 word documents into one file?

Thanks,
Greg

Support for footnote.

A footnote is two thing :

in the document.xml, at the place of the referecence :

<w:r>
  <w:rPr>
    <w:rStyle w:val="FootnoteReference"/>
  </w:rPr>
  <w:footnoteReference w:id="1"/>
</w:r>

in the footnote.xml the content of the footnote :

<w:footnote w:id="1">
  <w:p w:rsidRDefault="00D935D7" w:rsidR="00D935D7">
    <w:pPr>
      <w:pStyle w:val="FootnoteText"/>
    </w:pPr>
    <w:r>
      <w:rPr>
        <w:rStyle w:val="FootnoteReference"/>
      </w:rPr>
      <w:footnoteRef/>
    </w:r>
    <w:r>
      <w:t xml:space="preserve"> Note</w:t>
    </w:r>
  </w:p>
</w:footnote>

UnicodeDecodeError with setup.py

Using setup.py install throws the following error for me in Windows 7.

C:\Users\efredericksen\Documents\GitHub\python-docx>python setup.py install Traceback (most recent call last): File "setup.py", line 24, in <module> LICENSE = open(license).read() File "C:\Users\efredericksen\python33\lib\encodings\cp1252.py", line 23, in de code return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 231: char acter maps to <undefined>

I replaced the “ characters in the LICENSE file with " and the installation ran fine afterwards.

feature: _Cell.add_picture()

I understand I cannot do something like table_cell[0].add_picture() as well as with document?

Can you implement this feautre or I I shouldn't want this issue? :-)

"Table of Contents" Feature

I just discovered this great project and I wonder if there is a feature to add a Table of Contents to a document that I create with python-docx.
I need to generate a .docx file for a customer and he wants to have a TOC in it.

feature: Paragraph.add_text()

Right now, the quickest way to add text to the last run of paragraph p (preserving formatting/styles, versus p.add_run()) is through p.runs[-1].add_text(), which doesn't look particularly clean. I know that Texts aren't children of Paragraphs, but p.add_text() is intuitive and I suspect that many people will try to call it when they first start using the library.

Non-numeric id_str 'fix' for function "next_id" in /docx/parts/document.py

Hi all,

Firstly, great work on this project.

I believe I've found a bug in /docx/parts/document.py, in function "next_id".

In some of the documents that I have been using python-docx with, it turns out that some of the IDs are non-numeric. For example, inserting a "print(id_str_lst)" at line 90 in the aforementioned file gives me:

['4', '_x0000_t202', 'Text Box 5', '7', 'Text Box 9', '9', 'Text Box 7', '8', 'Text Box 11', '6', 'Text Box 6', '10', '0', '1', '3', 'Group 4', 'AutoShape 3', '5', '0', '12', '0', '26', '0', '25', '0', '2', '0', '13', '1', '14', '1', '15', '1', '16', '1', '39', '0', '40', '0', '35', '0', '21', '21', '22', '22', '20', '0', '18', '1']

Thus, I would get a ValueError as soon as the second element in the list was processed with "int(id_str)".

I have implemented a workaround by modifying the code for the "next_id" function to the following, to perform a quick check to ensure the id is numeric prior to adding to the list of used IDs:

def next_id(self):
    """
    The next available positive integer id value in this document. Gaps
    in id sequence are filled. The id attribute value is unique in the
    document, without regard to the element type it appears on.
    """
    id_str_lst = self._element.xpath('//@id')
    used_ids = []
    for id_str in id_str_lst:
        if id_str.isdigit():
            used_ids.append(int(id_str))
    for n in range(1, len(used_ids)+2):
        if n not in used_ids:
            return n

This appears to fix the problem for me.

This is the first time I've ever had input to an open source project, so I am not certain how to go about officially submitting this 'fix' to the repository, and surely a better programmer than I will have a more efficient fix. :-)

Thanks again, and I hope this helps.

Kind regards,
Mike Nye

feature: document lxml element of proxy classes for advanced users

Proxy classes such as Document, Paragraph, and Table each hold a private reference to the lxml element they correspond to, <w:document>, <w:p>, and <w:tbl> respectively. With these elements, advanced users can call the underlying lxml API directly to develop customized solutions the existing API does not yet support.

Add documentation so advanced users can readily access these elements without consulting the source code.

page orientation change mid-document

I see section breaks are discussed in the Analysis section. Does it mean that it would be added sometime soon? I would need a feature where I can switch page-orientation mid-document. Is there a way to achieve this?

feature: Run.add_picture()

Thanks for python-docx!

I need to be able to add an image in the middle of a document, so document.add_picture doesn't work for my purposes.

To be precise, I have a template .docx which contains the text [$signature], and I need to be able to replace that text wherever it appears with a signature image. Ideally, there would be an add_picture method on the Run class. Would you be willing to accept a pull request that added this?

Here's how I'm currently doing this:

d = Document('template_doc.docx')
# p = paragraph to append image to
# ...
image_part, r_id = d.inline_shapes.part.get_or_add_image_part('sig.png')
shape_id = d.inline_shapes.part.next_id
r=p.add_run()

InlineShape.new_picture(r._r, image_part, r_id, shape_id)

can't open existing file using "Document" object

I can't open an existing docx using Document

Feature: Find / replace text and generally modify docx files

I'm doing a lot of work with existing docx (creating many docx from a template). I hacked this together but there are better ways I think, any plans to natively increase support in modifying docx? XPATH? This is my main use case.

def replace(document, search, replace):
    """Walk the tree down to w:t xml and update text node"""
    searchre = re.compile(search)
    count = 0

    # Loop over all paras in doc
    for para in document.paragraphs:
        # Loop over all runs in para
        for run in para.runs:
            if len(run._r.t_lst) > 1:
                raise
            if len(run._r.t_lst) == 1:
                element_wt = run._r.t_lst[0]
                this_text = element_wt.text
                if searchre.search(this_text):
                    newtext = re.sub(search, replace, this_text)
                    count += 1
                    element_wt.text = newtext
            else:
                continue

    logging.debug("Replaced {} with {} {} times".format(search, replace, count))

feature: _Column.width

Really liking the new api, but I have a need to set column widths for a table and am unable to. I would appreciate this module being enhanced to allow for setting column and row properties such as width and height.

Extracting data from tables

Hi,

Is it possible to extract data from tables using the docx module?

It would be useful to have more examples with regards to learning to use this library.

Regards,

Ben

feature: section page size

Thanks for the docs sharing! Got a question, does docx support page orientation, I think this feature (landscape & portrait) is quite useful for common usage, if not support for the moment, does it on a schedule?

feature: Core Properties (read/write)

It seems, from the discussion at http://stackoverflow.com/questions/22625022/reading-coreproperties-keywords-from-docx-file-with-python-docx
that python-docx can write keywords but not read them.

Could a method/function/etc for reading them be added?

Document can't import

I tried import Document function from docx after installation but it throws import error.
when i import docx alone it works.

"from docx import Document" throws error.

docx version - 0.2.4
Python 2.7.3

Document.add_picture Fails on Subsequent Pictures

When inserting pictures into a document, the first picture works just fine, as shown by the example, etc. This problem arises when you go to add a second picture. The second picture addition triggers the sha1 function in the ImagePart class in docx/parts/image.py. This function is currently:

    @property
    def sha1(self):
        """
        SHA1 hash digest of the blob of this image part.
        """
        raise NotImplementedError

Which is clearly not very helpful. The fix is quite simple, just add the same sha1 functionality from the Image class in the same file. The resulting routine is:

    @property
    def sha1(self):
        """
        SHA1 hash digest of the blob of this image part.
        """
        return hashlib.sha1(self.blob).hexdigest()

Note that this should also probably be a lazyproperty instead of property, but either will work.

I experienced this error under Python 3.3.0 and 3.3.3, but it seems it will happen under any version.

Regards,

Steve

How to create numbered structured text and numbered figure?

I'd like to create in docx by python-docx such a structured text：

chapter 1
1.1. section 1
1.2. section 2
chapter 2
2.1. section 1
2.2. section 2
2.3. chapter 3
2.3.1. subsection 1
2.3.2. subsection 2

which can also be edited in ms office, when I delete "2.2. section 2", then "2.3. chapter 3", will become “2.2. chapter 3” automatically, and its subsection number will change automatically too, that is "2.3.1. subsection 1" to "2.2.1. subsection 1", "2.3.2. subsection 1" to "2.2.2. subsection 1"

In fact, the structured text in ms word format is from a .xmind file created by xmind 3.4.1, so I wonder whether it can be created by python-docx?

A similar question is about the numbered figures, how can the number of figures can change automatically, for example, when I delete a figure, those figure number behand this figure will reduce 1 automatically.

feature: _Cell.width (read/write)

After reading the documentation, I cannot find a way to change the size of a cell in a table, much like mentioned here: https://stackoverflow.com/questions/15688389/cell-spanning-multiple-columns-in-table-using-python-docx Is there a way to do this?

feature: superscript/subscript

I'd like to add a feature request to support superscripts and subscripts. Thank you.

features: merge/concatenate two documents

1)will be possible to have a function for to merge/concatenate two file docx with all image paragraphs etc?

feature: add field

Feature Suggestion/Request

Support for inserting/adding Field Codes in a word document. They are a handy feature for report generation type applications (originally intended for automatic mailout merges I believe).

In Office, they make it easy to add dynamic features to a document without getting your fingers all slimey with macros/VBA (although, if designed properly they are accessible from VBA using custom DocProperties and clever references). You work with them using text "markups" that form a restricted scripting framework (hit ctrl+F9 to get started within Word).

They can easily work with DocProperties, named elements (tables, lists, headings), and external text documents that will drive the dynamic content. You can still use styles to drive a document, but with field codes you can adjust/apply syles conditionally.But field codes are admittedly ackward to work with (odd syntax, updating, poor UI tools). Thats where python-docx needs to come in.

I think with a little love, working with field codes could actually be neat, organized, readable, and very functional. They would slip in just lile your other elements...

document.add_fcode("ASK", "chap1_caption", "Type in caption for Chaper 1")

Nesting field codes arbitrarily would be important requirement.

Instead of making custom routines in python-docx that help a user hack together structured portions of a document or specific output patterns, let us use field codes to define those structures expliciltly. And then when we export back into an Office driven workflow, all of the glue is still intact and fully functional.

The main weakness of field codes is that they are typically hidden/disabled by default in Word, but in that regard, its not much different than shipping a document with embedded macros: it is understood that you know how to interact with the extended features.

character style

I understand how to "style" a complete paragraph, but how do I apply a named style to a part of a paragraph (a "run")? I can "style" it with 'bold' or 'italics' but how about a named character style like "emphasis" or "link"?

feature: character style

Hi,Scanny,

Is there any way to change font, size, color or style of a paragraph or a character?

lxml

It looks like you want the user to work entirely through python-docx, as Etree elements are abstracted away through wrapper classes. If that's the case, what are you planning with regards to methods such as iter(), find(), xpath expressions etc.? I know that for simpler documents, statements like document.add_paragraph() are sufficient, but I've found lxml methods like the ones I mentioned above to be invaluable for more involved Docx scripting.

feature: Document.paragraphs includes paragraphs nested in <w:ins> elements

Document.paragraphs shall contain the sequence of paragraphs corresponding to the "Final" view of the document. Inserted paragraphs appear in the sequence. Deleted paragraphs do not. Moved paragraphs appear in their new location.

Adding elements nonsequentially

It would be nice, especially when using loaded documents, to be able to manipulate a document in some manner beyond appending elements to their parent element. Two possibilities that leap to mind:

Add an index kwarg to the various add methods so that they can be used in lieu of insert. Possibly include a delete method as well.
Subclass the paragraphs and runs properties of the Document and Paragraph classes, respectively. Overwrite the various list methods to appropriately modify the underlying Etree elements. For instance, d.paragraphs[3] = "Hello world" would replace the 4th paragraph with a new hello world paragraph. This could be powerful and flexible, but it also feels hackish. I'm not really sure.

Thoughts?

feature: underline property on Run

Provide an underline property on Run with semantics similar to .bold and .italic, allowing simple underline formatting to be applied to a run, but not precluding the broader set of possible underline styles that are possible, such as dashed, wavy, and double-underline.

Restart numbering of an ordered list in document.

We can easily add ordered list with document.add_paragraph(style='ListNumber') code. But how can we restart its numbering?

Legacy getdocumenttext

With this version how can i get only de Text from docx as getdocumenttext ?

Support for word docx templates

Does this package support usage of word-docx templates?
If so how?

can't support add several picture ?

Hi scanny.

I need to insert some picture , but that throw Exception.

code in below.

from docx import Document
from docx.shared import Inches

document = Document()

document.add_heading('Document Title', 0)

p = document.add_paragraph('A plain paragraph having some ')

document.add_picture('amazon.png', width=Inches(1.25))
document.add_picture('web_report.png', width=Inches(1.25))

table = document.add_table(rows=1, cols=3)
hdr_cells = table.rows[0].cells
hdr_cells[0].text = 'Qty'
hdr_cells[1].text = 'Id'
hdr_cells[2].text = 'Desc'

document.add_page_break()

document.save('demo.docx')

and Exception is

Traceback (most recent call last):
File "test_add_picture.py", line 11, in
document.add_picture('web_report.png', width=Inches(1.25))
File "C:\Python27\lib\site-packages\python_docx-0.3.0a1-py2.7.egg\docx\api.py"
, line 83, in add_picture
picture = self.inline_shapes.add_picture(image_path_or_stream)
File "C:\Python27\lib\site-packages\python_docx-0.3.0a1-py2.7.egg\docx\parts\d
ocument.py", line 207, in add_picture
image_part, rId = self.part.get_or_add_image_part(image_descriptor)
File "C:\Python27\lib\site-packages\python_docx-0.3.0a1-py2.7.egg\docx\parts\d
ocument.py", line 64, in get_or_add_image_part
image_part = image_parts.get_or_add_image_part(image_descriptor)
File "C:\Python27\lib\site-packages\python_docx-0.3.0a1-py2.7.egg\docx\package
.py", line 76, in get_or_add_image_part
matching_image_part = self._get_by_sha1(image.sha1)
File "C:\Python27\lib\site-packages\python_docx-0.3.0a1-py2.7.egg\docx\package
.py", line 97, in _get_by_sha1
if image_part.sha1 == sha1:
File "C:\Python27\lib\site-packages\python_docx-0.3.0a1-py2.7.egg\docx\parts\i
mage.py", line 269, in sha1
raise NotImplementedError
NotImplementedError

inserting xml-snippet into docx using the python-docx api

We need to change header text-orientation of tables. We are aware that this may not be possible with the current state of the API. We identify the xml-snippet to be inserted using opc-diag as suggested elsewhere. Can we use xml-snippet insertion to achieve this? If yes what is the API-command to do the xml-insertion at a specific point of the docx?
-- sub

Center Text

is there a way to center text?

document.add_picture(img, width=Inches(7))
fileName,extension=os.path.splitext(img)
capt='Figure %d,Meter number %s' % (figureNum, fileName)
c=document.add_paragraph(capt, style='Caption')

Now I would like the caption to be centered on the page?

<w:pStyle w:val="Caption"/>
<w:jc w:val="center"/>

docs: Working with Tables user guide page

Hi, in your documentation, you detail how to add a table. How about detailing how to read a table from an existing document?

feature: Paragraph.delete()

In order to modify an existing document
As a developer using python-pptx
I need a way to delete a paragraph

Need to account for the possibility the paragraph contains the last reference to a relationship, such as might a hyperlink or inline picture.

How do i set table style

Hi scanny.

How do i set table style?

When i using

document.add_table(rows=len(table_data), cols=len(table_data[0]),style="TableGrid")

I search the API , but i didn't find method do it, I need set table size and width and height and color and so on.

"Cell class" also didn't find any method support it.

feature: translate embedded '\t' chars to <w:tab> elements

in Run.add_text maybe, or might be better to do it closer to the API level, for add_paragraph('string\totherstring') and Paragraph.add_run('text\tseparated\tby\ttabs').