Comments (43)
Ok got it working now! Changed it to:
if isinstance(parent, Document):
parent_elm = parent.element.body
Thanks for the help both of you!
from python-docx.
The imports are tricky to get right, so here you go. This should work for the latest version.
from docx.text.paragraph import Paragraph
from docx.document import Document
from docx.table import _Cell, Table
from docx.oxml.text.paragraph import CT_P
from docx.oxml.table import CT_Tbl
import docx
def iter_block_items(parent):
if isinstance(parent, Document):
parent_elm = parent.element.body
elif isinstance(parent, _Cell):
parent_elm = parent._tc
else:
raise ValueError("something's not right")
for child in parent_elm.iterchildren():
if isinstance(child, CT_P):
yield Paragraph(child, parent)
elif isinstance(child, CT_Tbl):
table = Table(child, parent)
for row in table.rows:
for cell in row.cells:
yield from iter_block_items(cell)
doc = docx.Document('word.docx')
for block in iter_block_items(doc):
print(block.text)
from python-docx.
This workaround should work for anyone who can't wait for the Document.iter_block_items()
feature to be implemented. I haven't tested it, so please provide feedback if it gives any trouble or you get it to work.
It can accept either a document or a table cell for its parent argument.
from docx.api import Document
from docx.oxml.table import CT_Tbl
from docx.oxml.text import CT_P
from docx.table import _Cell, Table
from docx.text import Paragraph
def iter_block_items(parent):
"""
Yield each paragraph and table child within *parent*, in document order.
Each returned value is an instance of either Table or Paragraph.
"""
if isinstance(parent, Document):
parent_elm = parent._document_part.body._body
elif isinstance(parent, _Cell):
parent_elm = parent._tc
else:
raise ValueError("something's not right")
for child in parent_elm.iterchildren():
if isinstance(child, CT_P):
yield Paragraph(child)
elif isinstance(child, CT_Tbl):
yield Table(child)
from python-docx.
Hi @scanny
your iter_block_items() parse paragraph and table in docx file.
I'm a beginner in python.
Could you please update to parse all headings (Heading 1, Heading 2,..etc), paragraphs and tables.
Thank you.
from python-docx.
An updated snippet that should do the trick and is consistent with the latest internals would look like this. I haven't had time to test it, so if it gives you trouble let me know and I'll help fix :)
from docx.document import Document
from docx.oxml.table import CT_Tbl
from docx.oxml.text.paragraph import CT_P
from docx.table import _Cell, Table
from docx.text.paragraph import Paragraph
def iter_block_items(parent):
"""
Yield each paragraph and table child within *parent*, in document order.
Each returned value is an instance of either Table or Paragraph. *parent*
would most commonly be a reference to a main Document object, but
also works for a _Cell object, which itself can contain paragraphs and tables.
"""
if isinstance(parent, Document):
parent_elm = parent._document_part.body._body
elif isinstance(parent, _Cell):
parent_elm = parent._tc
else:
raise ValueError("something's not right")
for child in parent_elm.iterchildren():
if isinstance(child, CT_P):
yield Paragraph(child, parent)
elif isinstance(child, CT_Tbl):
yield Table(child, parent)
from python-docx.
nested tables should be easy to handle with recursion
def iter_block_items(parent):
if isinstance(parent, Document):
parent_elm = parent.element.body
elif isinstance(parent, _Cell):
parent_elm = parent._tc
else:
raise ValueError("something's not right")
for child in parent_elm.iterchildren():
if isinstance(child, CT_P):
yield Paragraph(child, parent)
elif isinstance(child, CT_Tbl):
table = Table(child, parent)
for row in table.rows:
for cell in row.cells:
yield from iter_block_items(cell)
from python-docx.
Not related so not a good use of this thread.
That topic comes up from time to time, search should be your first stop. Google knows many things about python-docx
:)
from python-docx.
Well, a solution for the general case would yield a proxy object (e.g. Paragraph, Table) for each element encountered so the developer could operate on the object without having to go down to the XML level. This gets a little tricky because there are a surprisingly large array of types that can possibly appear within a block context or inline context and not nearly all of them have proxy objects yet. Things like a <w:del>
and <w:ins>
element that have to do with the revision tracking, for example.
One solution would be to return a proxy object when you could and then a generic NotImplementedObject or something when no suitable proxy class existed for the item.
Note also that there are two main contexts one might want to iterate over, a block context and an inline context. An element like the <w:body>
element of a document part contains block-level objects like Paragraph and Table. A Paragraph itself is an inline context and contains things like Run, and inline pictures, hyperlinks, etc.
This issue was originally about block-level items, but a corresponding method for iterating over inline objects would also be handy.
from python-docx.
from python-docx.
@LucianoMan Something like this should do the trick:
def iter_visible_row_cells(row: Row) -> Iterator[_Cell]:
"""Generate only "concrete" cells, those with a `tc` element.
Vertically spanned cells have a `tc` element but are skipped.
"""
yield from (_Cell(tc, row) for tc in row._tr.tc_lst if tc.vMerge != "continue")
from python-docx.
This is a feature that would be useful for data mining. My use case is such that the primary data to be extracted are within tables. The related secondary data are from paragraphs that are either precede or straddle the table.
from python-docx.
Awesome. The workaround works great for my use case. Thanks again.
from python-docx.
Glad to hear it Paul :)
I'll leave this issue open as the feature request.
from python-docx.
Had to make these changes to the code to get the function to work
if isinstance(child, CT_P):
yield Paragraph(child,parent_elm)
elif isinstance(child, CT_Tbl):
yield Table(child,parent_elm)
from python-docx.
I think None
would probably be better than parent_elm
. The parent
parameter which was added to the Paragraph
and Table
constructors since this issue opened expects the parent proxy object like _Body
or (table)_Cell
, not the lxml
parent element (e.g. <w:body>
).
These are only used when an upward reference is required, such as when inserting a picture, so depending on the use case, using None
might work well enough to get the job done. Using parent
and making sure it was a reference to _Body
or _Cell
would be better.
In any case, this hack is due for a proper solution once I can get back to it. Been very busy on python-pptx
just lately getting chart functionality going there :)
UPDATE:
On later reflection, it became clear the new parameter should simply be parent
as provided as an original call argument to iter_block_items
. An updated full version is a couple comments down.
from python-docx.
Thanks for the help.I am also interested to know how would you go about this function,more specifically how would you want to handle inline images,charts and mathematical equations when they come in the text.I am thinking of just returning the xml in case of charts or equations and returning the image in case there is an image in the run.
from python-docx.
@scanny, yes, this works perfectly. Do you want me to wrap this up, and send in a PR?
from python-docx.
The tests will be the key outstanding components for this one. If you want to take a crack at it, by all means :)
from python-docx.
Is there a way of doing this after the recent changes to docx.Document?
from python-docx.
Not yet; this feature is still in the backlog. The last release focused on styles support.
from python-docx.
Oh, I think I misinterpreted your question. Some of the imports have to change due to recent refactoring:
from docx.document import Document
from docx.oxml.text.paragraph import CT_P
from docx.text.paragraph import Paragraph
Is that what you were asking @cez81 ?
I've updated the example.
from python-docx.
Yes I think it was. Unfortunately I can't get it to work tho... I get an error creating the Document instance
"TypeError: init() missing 1 required positional argument: 'part'". I'm guessing it has to do with the first line importing the wrong Document class?
from docx.document import Document
from docx.oxml.text.paragraph import CT_P
from docx.text.paragraph import Paragraph
def iter_block_items(parent):
"""
Yield each paragraph and table child within *parent*, in document order.
Each returned value is an instance of either Table or Paragraph. *parent*
would most commonly be a reference to a main Document object, but
also works for a _Cell object, which itself can contain paragraphs and tables.
"""
if isinstance(parent, Document):
parent_elm = parent._document_part.body._body
elif isinstance(parent, _Cell):
parent_elm = parent._tc
else:
raise ValueError("something's not right")
for child in parent_elm.iterchildren():
if isinstance(child, CT_P):
yield Paragraph(child, parent)
elif isinstance(child, CT_Tbl):
yield Table(child, parent)
doc = Document('test.docx')
for block in iter_block_items(doc):
print(block.text)
from python-docx.
Ah, right. If you do this that should fix your case where you have them both in the same module and need to use both the docx.document.Document class and the docx.Document factory function:
import docx
doc = docx.Document('test.docx')
for block in iter_block_items(doc):
print(block.text)
from python-docx.
Hi, I think @cez81 is right: there seems to be something more that changed in your code lately. To make your example work again with 0.8.5 I had to change it like this:
from docx.document import Document
from docx.oxml.table import CT_Tbl
from docx.oxml.text.paragraph import CT_P
from docx.table import _Cell, Table
from docx.text.paragraph import Paragraph
def iter_block_items(parent):
"""
Yield each paragraph and table child within *parent*, in document order.
Each returned value is an instance of either Table or Paragraph. *parent*
would most commonly be a reference to a main Document object, but
also works for a _Cell object, which itself can contain paragraphs and tables.
"""
if isinstance(parent, Document):
parent_elm = parent.element
elif isinstance(parent, _Cell):
parent_elm = parent._tc
else:
raise ValueError("something's not right")
for child in parent_elm.iterchildren():
if isinstance(child, CT_P):
yield Paragraph(child, parent)
elif isinstance(child, CT_Tbl):
yield Table(child, parent)
In particular, I had to change the following line:
parent_elm = parent._document_part.body._body
to this:
parent_elm = parent.element
from python-docx.
Ah, yes, I see what you're saying. The way to get a reference to to the <w:body>
element changed as well. I think what you want is this though:
if isinstance(parent, Document):
parent_elm = parent.element._body
... because parent.element
is the <w:document>
element if I'm reading the code correctly.
Apologies I don't have time to test this right now, but hope that helps. Such are the wages of workaround functions because they rely on internals that aren't guaranteed to be stable between releases.
from python-docx.
In my case, i have double column in my paragraph , i wanted to extract that also as well as any image if there in paragraph.
if i am using
for child in parent_elm.iterchildren(): if isinstance(child, CT_P): yield Paragraph(child, parent) elif isinstance(child, CT_Tbl): yield Table(child, parent)
It is taking Double column text as a paragraph. Please help me to find double column also in my paragraph
from python-docx.
how to get text then from yielded Table object?
from python-docx.
Hi @scanny ,
I tried your code, but Im facing a problem in the section mentioned below.
for block in iter_block_items(doc):
print(block.text)
Here Im getting the error "AttributeError: 'Table' object has no attribute 'text'" .
Can anyone please help me to solve this issue?
Thanks in advance.
from python-docx.
how to get text then from yielded Table object?
Hey @igorsavinkin , this worked for me.
for block in iter_block_items(doc):
if isinstance(block, Table):
for row in block.rows:
row_data = []
for cell in row.cells:
for paragraph in cell.paragraphs:
row_data.append(paragraph.text)
print("\t".join(row_data)
from python-docx.
Hello everyone,
It helped me a lot when parsing docx files.
May you tell, if it is possible to refactor this function and also find InlineShapes
?
from python-docx.
Hello everyone.
I have a question: how to use Python to read pictures in word order?
from python-docx.
I want to know how to read pictures or charts like function "iter_block_items"
from python-docx.
I want to know how to read pictures or charts like function "iter_block_items"
def read_item_block(parent):
'''
顺序读取wordneir
:param parent: 文档
:return: p/t
'''
if isinstance(parent, _Document):
parent_elm = parent.element.body
elif isinstance(parent, _Cell):
parent_elm = parent._tc
elif isinstance(parent, _Row):
parent_elm = parent._tr
else:
raise ValueError("something's not right")
for child in parent_elm.iterchildren():
if isinstance(child, CT_P):
count = 1
count_flase = 0
res = Paragraph(child, parent)
if res.text != '':
yield (res,count_flase)
else:
try:
# 试着去取内联元素
from xml.dom.minidom import parseString
DOMTree = parseString(child.xml)
data = DOMTree.documentElement
nodelist = data.getElementsByTagName('pic:blipFill')
print('*nodelist'9,nodelist)
if len(nodelist) < 1:
yield (res,count_flase)
else:
yield (res, count)
except Exception as e:
print(''*9,e)
yield (res,count_flase)
elif isinstance(child, CT_Tbl):
yield (Table(child, parent),)
This is how I read pictures.
from python-docx.
Having this error in your code:
Traceback (most recent call last):
File "C:/Users/home/PycharmProjects/Sentiment_analysis/yup.py", line 46, in
for cell in row.cells:
File "C:\Users\home\Desktop\devu\venv\lib\site-packages\docx\table.py", line 401, in cells
return tuple(self.table.row_cells(self._index))
File "C:\Users\home\Desktop\devu\venv\lib\site-packages\docx\table.py", line 106, in row_cells
return self._cells[start:end]
File "C:\Users\home\Desktop\devu\venv\lib\site-packages\docx\table.py", line 173, in _cells
cells.append(cells[-col_count])
IndexError: list index out of range
from python-docx.
Ok got it working now! Changed it to:
if isinstance(parent, Document): parent_elm = parent.element.body
Thanks for the help both of you!
Thank you. this works to me!
from python-docx.
how to read paragraph,table,shapes all in one place....Kindly Help ASAP
from python-docx.
Hi admin,
Do you have any idea what can I change in the code?
This is the current code I have in iterating tables and paragraphs:
def iterate_tables_and_paragraphs(
parent: Union[DocxDocument, _Cell]
) -> Union[DocxParagraph, DocxTable]:
if isinstance(parent, DocxDocument):
parent_elm = parent.element.body
elif isinstance(parent, _Cell):
parent_elm = parent._tc
else:
raise ValueError("Invalid type parameter, expected DocxDocument or _Cell")
for child in parent_elm.iterchildren():
if isinstance(child, CT_P):
yield DocxParagraph(child, parent)
elif isinstance(child, CT_Tbl):
yield DocxTable(child, parent)
There were no problems in running the code but the output was incorrect.
The problem is that in my document, there are 2 tables on top of each other but they don't have the same number of columns. Table 1 has 18 columns while table 2 has 20 columns but this code sets the number of columns to the max which is 20 so when I tried reading the data, table 1 produced incorrect results since it looped 20 times so some data from the next row were included as table headers.
Please help. Thanks.
from python-docx.
@cyrillkuettel Thanks a lot for sharing!
from python-docx.
Added BlockItemContainer.iter_inner_content()
in v.1.0.2. Document
, Header
, Footer
, and (table) _Cell
are all block-item containers. The behavior is to generate Paragraph | Table
in document-order from within that container. Contrast with Section.iter_inner_content()
which does the same but only within a single section.
It is not recursive, so you'll need to take care of that aspect if you want it (not everyone will hence why it's not implemented here).
Maybe something like:
def recursively_iter_block_items(blkcntnr: BlockItemContainer) -> Iterator[Paragraph | Table]:
for item in blkcntnr.iter_inner_content():
if isinstance(item, Paragraph):
yield item
elif isinstance(item, Table):
for row in item.rows:
for cell in row.cells:
yield from recursively_iter_block_items(cell)
from python-docx.
The imports are tricky to get right, so here you go. This should work for the latest version.
from docx.text.paragraph import Paragraph from docx.document import Document from docx.table import _Cell, Table from docx.oxml.text.paragraph import CT_P from docx.oxml.table import CT_Tbl import docx def iter_block_items(parent): if isinstance(parent, Document): parent_elm = parent.element.body elif isinstance(parent, _Cell): parent_elm = parent._tc else: raise ValueError("something's not right") for child in parent_elm.iterchildren(): if isinstance(child, CT_P): yield Paragraph(child, parent) elif isinstance(child, CT_Tbl): table = Table(child, parent) for row in table.rows: for cell in row.cells: yield from iter_block_items(cell) doc = docx.Document('word.docx') for block in iter_block_items(doc): print(block.text)
This works wonderfully, however It seems to repeat text that are in tables. Does anyone know how to stop the code from doing this. I attempted to use sets but that got rid of repeated text that I need.
from python-docx.
@scanny would you happen to know how to append one word document to another?
from python-docx.
I apologize for the off topic question but my friend and I looked everywhere and could not find anything useful if you know a link it would be appreciated :'(.
from python-docx.
If you have pandoc
you can run
pandoc -s document1.docx document2.docx -o merged.docx
This can work for simple cases.
from python-docx.
Related Issues (20)
- support more keys in nsmap. HOT 1
- pip Install python-docx==1.1.1 raise error in python 3.12, ERROR: Failed building wheel for lxml<=4.9.2,>=3.1.0 (in mac os) HOT 8
- DocumentPart' object has no attribute '_rels'. HOT 1
- Non compatibility of new update 1.1.1 with python-docx-template HOT 6
- Remove "generated by python-docx" from description tag HOT 5
- track-changes in python-docx HOT 5
- `doc.paragraphs` seems not including contents inside a `<mc:AlternateContent>` tag HOT 2
- [Feature] Support EMF image
- customXML Error HOT 4
- Chinese fonts Only the non-Chinese parts are valid
- Can not read an empty docx...please fix it.
- 打开空的docx文档时报错
- Inline support for SVG file stream HOT 3
- How to add internal top and bottom table cell spacings?
- Inserting a new page before last page in the word document using python HOT 1
- Auto refresh Table of Contents using docx HOT 1
- OSS-Fuzz Integration
- Highlight certain words of a paragraph to bold
- Paragraphs/Numbering in Table of Contents Document Regions HOT 3
- no read apis HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from python-docx.