Git Product home page Git Product logo

Comments (43)

cez81 avatar cez81 commented on June 26, 2024 6

Ok got it working now! Changed it to:

if isinstance(parent, Document):
    parent_elm = parent.element.body

Thanks for the help both of you!

from python-docx.

cyrillkuettel avatar cyrillkuettel commented on June 26, 2024 5

The imports are tricky to get right, so here you go. This should work for the latest version.

from docx.text.paragraph import Paragraph
from docx.document import Document
from docx.table import _Cell, Table
from docx.oxml.text.paragraph import CT_P
from docx.oxml.table import CT_Tbl
import docx

def iter_block_items(parent):
    if isinstance(parent, Document):
        parent_elm = parent.element.body
    elif isinstance(parent, _Cell):
        parent_elm = parent._tc
    else:
        raise ValueError("something's not right")

    for child in parent_elm.iterchildren():
        if isinstance(child, CT_P):
            yield Paragraph(child, parent)
        elif isinstance(child, CT_Tbl):
            table = Table(child, parent)
            for row in table.rows:
                for cell in row.cells:
                    yield from iter_block_items(cell)
                    
doc = docx.Document('word.docx')
for block in iter_block_items(doc):
	print(block.text)

from python-docx.

scanny avatar scanny commented on June 26, 2024 3

This workaround should work for anyone who can't wait for the Document.iter_block_items() feature to be implemented. I haven't tested it, so please provide feedback if it gives any trouble or you get it to work.

It can accept either a document or a table cell for its parent argument.

from docx.api import Document
from docx.oxml.table import CT_Tbl
from docx.oxml.text import CT_P
from docx.table import _Cell, Table
from docx.text import Paragraph

def iter_block_items(parent):
    """
    Yield each paragraph and table child within *parent*, in document order.
    Each returned value is an instance of either Table or Paragraph.
    """
    if isinstance(parent, Document):
        parent_elm = parent._document_part.body._body
    elif isinstance(parent, _Cell):
        parent_elm = parent._tc
    else:
        raise ValueError("something's not right")

    for child in parent_elm.iterchildren():
        if isinstance(child, CT_P):
            yield Paragraph(child)
        elif isinstance(child, CT_Tbl):
            yield Table(child)

from python-docx.

nguyen1110935 avatar nguyen1110935 commented on June 26, 2024 3

Hi @scanny

your iter_block_items() parse paragraph and table in docx file.
I'm a beginner in python.
Could you please update to parse all headings (Heading 1, Heading 2,..etc), paragraphs and tables.

Thank you.

from python-docx.

scanny avatar scanny commented on June 26, 2024 2

An updated snippet that should do the trick and is consistent with the latest internals would look like this. I haven't had time to test it, so if it gives you trouble let me know and I'll help fix :)

from docx.document import Document
from docx.oxml.table import CT_Tbl
from docx.oxml.text.paragraph import CT_P
from docx.table import _Cell, Table
from docx.text.paragraph import Paragraph

def iter_block_items(parent):
    """
    Yield each paragraph and table child within *parent*, in document order.
    Each returned value is an instance of either Table or Paragraph. *parent*
    would most commonly be a reference to a main Document object, but
    also works for a _Cell object, which itself can contain paragraphs and tables.
    """
    if isinstance(parent, Document):
        parent_elm = parent._document_part.body._body
    elif isinstance(parent, _Cell):
        parent_elm = parent._tc
    else:
        raise ValueError("something's not right")

    for child in parent_elm.iterchildren():
        if isinstance(child, CT_P):
            yield Paragraph(child, parent)
        elif isinstance(child, CT_Tbl):
            yield Table(child, parent)

from python-docx.

aistellar avatar aistellar commented on June 26, 2024 2

nested tables should be easy to handle with recursion

def iter_block_items(parent):
    if isinstance(parent, Document):
        parent_elm = parent.element.body
    elif isinstance(parent, _Cell):
        parent_elm = parent._tc
    else:
        raise ValueError("something's not right")

    for child in parent_elm.iterchildren():
        if isinstance(child, CT_P):
            yield Paragraph(child, parent)
        elif isinstance(child, CT_Tbl):
            table = Table(child, parent)
            for row in table.rows:
                for cell in row.cells:
                    yield from iter_block_items(cell)

from python-docx.

scanny avatar scanny commented on June 26, 2024 2

Not related so not a good use of this thread.

That topic comes up from time to time, search should be your first stop. Google knows many things about python-docx :)

from python-docx.

scanny avatar scanny commented on June 26, 2024 1

Well, a solution for the general case would yield a proxy object (e.g. Paragraph, Table) for each element encountered so the developer could operate on the object without having to go down to the XML level. This gets a little tricky because there are a surprisingly large array of types that can possibly appear within a block context or inline context and not nearly all of them have proxy objects yet. Things like a <w:del> and <w:ins> element that have to do with the revision tracking, for example.

One solution would be to return a proxy object when you could and then a generic NotImplementedObject or something when no suitable proxy class existed for the item.

Note also that there are two main contexts one might want to iterate over, a block context and an inline context. An element like the <w:body> element of a document part contains block-level objects like Paragraph and Table. A Paragraph itself is an inline context and contains things like Run, and inline pictures, hyperlinks, etc.

This issue was originally about block-level items, but a corresponding method for iterating over inline objects would also be handy.

from python-docx.

igorsavinkin avatar igorsavinkin commented on June 26, 2024 1

from python-docx.

scanny avatar scanny commented on June 26, 2024 1

@LucianoMan Something like this should do the trick:

def iter_visible_row_cells(row: Row) -> Iterator[_Cell]:
    """Generate only "concrete" cells, those with a `tc` element.

    Vertically spanned cells have a `tc` element but are skipped.
    """
    yield from (_Cell(tc, row) for tc in row._tr.tc_lst if tc.vMerge != "continue")

from python-docx.

pmagsino avatar pmagsino commented on June 26, 2024

This is a feature that would be useful for data mining. My use case is such that the primary data to be extracted are within tables. The related secondary data are from paragraphs that are either precede or straddle the table.

from python-docx.

pmagsino avatar pmagsino commented on June 26, 2024

Awesome. The workaround works great for my use case. Thanks again.

from python-docx.

scanny avatar scanny commented on June 26, 2024

Glad to hear it Paul :)

I'll leave this issue open as the feature request.

from python-docx.

 avatar commented on June 26, 2024

Had to make these changes to the code to get the function to work

if isinstance(child, CT_P):
    yield Paragraph(child,parent_elm)
elif isinstance(child, CT_Tbl):
    yield Table(child,parent_elm)

from python-docx.

scanny avatar scanny commented on June 26, 2024

I think None would probably be better than parent_elm. The parent parameter which was added to the Paragraph and Table constructors since this issue opened expects the parent proxy object like _Body or (table)_Cell, not the lxml parent element (e.g. <w:body>).

These are only used when an upward reference is required, such as when inserting a picture, so depending on the use case, using None might work well enough to get the job done. Using parent and making sure it was a reference to _Body or _Cell would be better.

In any case, this hack is due for a proper solution once I can get back to it. Been very busy on python-pptx just lately getting chart functionality going there :)

UPDATE:
On later reflection, it became clear the new parameter should simply be parent as provided as an original call argument to iter_block_items. An updated full version is a couple comments down.

from python-docx.

 avatar commented on June 26, 2024

Thanks for the help.I am also interested to know how would you go about this function,more specifically how would you want to handle inline images,charts and mathematical equations when they come in the text.I am thinking of just returning the xml in case of charts or equations and returning the image in case there is an image in the run.

from python-docx.

danmilon avatar danmilon commented on June 26, 2024

@scanny, yes, this works perfectly. Do you want me to wrap this up, and send in a PR?

from python-docx.

scanny avatar scanny commented on June 26, 2024

The tests will be the key outstanding components for this one. If you want to take a crack at it, by all means :)

from python-docx.

cez81 avatar cez81 commented on June 26, 2024

Is there a way of doing this after the recent changes to docx.Document?

from python-docx.

scanny avatar scanny commented on June 26, 2024

Not yet; this feature is still in the backlog. The last release focused on styles support.

from python-docx.

scanny avatar scanny commented on June 26, 2024

Oh, I think I misinterpreted your question. Some of the imports have to change due to recent refactoring:

from docx.document import Document
from docx.oxml.text.paragraph import CT_P
from docx.text.paragraph import Paragraph

Is that what you were asking @cez81 ?

I've updated the example.

from python-docx.

cez81 avatar cez81 commented on June 26, 2024

Yes I think it was. Unfortunately I can't get it to work tho... I get an error creating the Document instance
"TypeError: init() missing 1 required positional argument: 'part'". I'm guessing it has to do with the first line importing the wrong Document class?

from docx.document import Document
from docx.oxml.text.paragraph import CT_P
from docx.text.paragraph import Paragraph

def iter_block_items(parent):
    """
    Yield each paragraph and table child within *parent*, in document order.
    Each returned value is an instance of either Table or Paragraph. *parent*
    would most commonly be a reference to a main Document object, but
    also works for a _Cell object, which itself can contain paragraphs and tables.
    """
    if isinstance(parent, Document):
        parent_elm = parent._document_part.body._body
    elif isinstance(parent, _Cell):
        parent_elm = parent._tc
    else:
        raise ValueError("something's not right")

    for child in parent_elm.iterchildren():
        if isinstance(child, CT_P):
            yield Paragraph(child, parent)
        elif isinstance(child, CT_Tbl):
            yield Table(child, parent)


doc = Document('test.docx')
for block in iter_block_items(doc):
    print(block.text)

from python-docx.

scanny avatar scanny commented on June 26, 2024

Ah, right. If you do this that should fix your case where you have them both in the same module and need to use both the docx.document.Document class and the docx.Document factory function:

import docx

doc = docx.Document('test.docx')
for block in iter_block_items(doc):
    print(block.text)

from python-docx.

pdelsante avatar pdelsante commented on June 26, 2024

Hi, I think @cez81 is right: there seems to be something more that changed in your code lately. To make your example work again with 0.8.5 I had to change it like this:

from docx.document import Document
from docx.oxml.table import CT_Tbl
from docx.oxml.text.paragraph import CT_P
from docx.table import _Cell, Table
from docx.text.paragraph import Paragraph


def iter_block_items(parent):
    """
    Yield each paragraph and table child within *parent*, in document order.
    Each returned value is an instance of either Table or Paragraph. *parent*
    would most commonly be a reference to a main Document object, but
    also works for a _Cell object, which itself can contain paragraphs and tables.
    """
    if isinstance(parent, Document):
        parent_elm = parent.element
    elif isinstance(parent, _Cell):
        parent_elm = parent._tc
    else:
        raise ValueError("something's not right")

    for child in parent_elm.iterchildren():
        if isinstance(child, CT_P):
            yield Paragraph(child, parent)
        elif isinstance(child, CT_Tbl):
            yield Table(child, parent)

In particular, I had to change the following line:

        parent_elm = parent._document_part.body._body

to this:

        parent_elm = parent.element

from python-docx.

scanny avatar scanny commented on June 26, 2024

Ah, yes, I see what you're saying. The way to get a reference to to the <w:body> element changed as well. I think what you want is this though:

if isinstance(parent, Document):
    parent_elm = parent.element._body

... because parent.element is the <w:document> element if I'm reading the code correctly.

Apologies I don't have time to test this right now, but hope that helps. Such are the wages of workaround functions because they rely on internals that aren't guaranteed to be stable between releases.

from python-docx.

emishaikh avatar emishaikh commented on June 26, 2024

In my case, i have double column in my paragraph , i wanted to extract that also as well as any image if there in paragraph.
if i am using
for child in parent_elm.iterchildren(): if isinstance(child, CT_P): yield Paragraph(child, parent) elif isinstance(child, CT_Tbl): yield Table(child, parent)

It is taking Double column text as a paragraph. Please help me to find double column also in my paragraph

from python-docx.

igorsavinkin avatar igorsavinkin commented on June 26, 2024

how to get text then from yielded Table object?

from python-docx.

alfiyafaisy avatar alfiyafaisy commented on June 26, 2024

Hi @scanny ,
I tried your code, but Im facing a problem in the section mentioned below.

for block in iter_block_items(doc):
    print(block.text)

Here Im getting the error "AttributeError: 'Table' object has no attribute 'text'" .
Can anyone please help me to solve this issue?
Thanks in advance.

from python-docx.

alfiyafaisy avatar alfiyafaisy commented on June 26, 2024

how to get text then from yielded Table object?

Hey @igorsavinkin , this worked for me.

for block in iter_block_items(doc):
    if isinstance(block, Table):
        for row in block.rows:
            row_data = []
            for cell in row.cells:
                for paragraph in cell.paragraphs:
                    row_data.append(paragraph.text)
            print("\t".join(row_data)

from python-docx.

romran avatar romran commented on June 26, 2024

Hello everyone,
It helped me a lot when parsing docx files.
May you tell, if it is possible to refactor this function and also find InlineShapes ?

from python-docx.

Slowhalfframe avatar Slowhalfframe commented on June 26, 2024

Hello everyone.
I have a question: how to use Python to read pictures in word order?

from python-docx.

lxj0276 avatar lxj0276 commented on June 26, 2024

I want to know how to read pictures or charts like function "iter_block_items"

from python-docx.

Slowhalfframe avatar Slowhalfframe commented on June 26, 2024

I want to know how to read pictures or charts like function "iter_block_items"

def read_item_block(parent):
'''
顺序读取wordneir
:param parent: 文档
:return: p/t
'''
if isinstance(parent, _Document):
parent_elm = parent.element.body
elif isinstance(parent, _Cell):
parent_elm = parent._tc
elif isinstance(parent, _Row):
parent_elm = parent._tr
else:
raise ValueError("something's not right")
for child in parent_elm.iterchildren():
if isinstance(child, CT_P):
count = 1
count_flase = 0
res = Paragraph(child, parent)
if res.text != '':
yield (res,count_flase)
else:
try:
# 试着去取内联元素
from xml.dom.minidom import parseString
DOMTree = parseString(child.xml)
data = DOMTree.documentElement
nodelist = data.getElementsByTagName('pic:blipFill')
print('*nodelist'9,nodelist)
if len(nodelist) < 1:
yield (res,count_flase)
else:
yield (res, count)
except Exception as e:
print('
'*9,e)
yield (res,count_flase)
elif isinstance(child, CT_Tbl):
yield (Table(child, parent),)

This is how I read pictures.

from python-docx.

devanshugupta avatar devanshugupta commented on June 26, 2024

Having this error in your code:

Traceback (most recent call last):
File "C:/Users/home/PycharmProjects/Sentiment_analysis/yup.py", line 46, in
for cell in row.cells:
File "C:\Users\home\Desktop\devu\venv\lib\site-packages\docx\table.py", line 401, in cells
return tuple(self.table.row_cells(self._index))
File "C:\Users\home\Desktop\devu\venv\lib\site-packages\docx\table.py", line 106, in row_cells
return self._cells[start:end]
File "C:\Users\home\Desktop\devu\venv\lib\site-packages\docx\table.py", line 173, in _cells
cells.append(cells[-col_count])
IndexError: list index out of range

from python-docx.

Labyrins avatar Labyrins commented on June 26, 2024

Ok got it working now! Changed it to:

if isinstance(parent, Document):
    parent_elm = parent.element.body

Thanks for the help both of you!

Thank you. this works to me!

from python-docx.

div1996 avatar div1996 commented on June 26, 2024

how to read paragraph,table,shapes all in one place....Kindly Help ASAP

from python-docx.

ejaca avatar ejaca commented on June 26, 2024

Hi admin,

Do you have any idea what can I change in the code?

This is the current code I have in iterating tables and paragraphs:

def iterate_tables_and_paragraphs(
    parent: Union[DocxDocument, _Cell]
) -> Union[DocxParagraph, DocxTable]:
    if isinstance(parent, DocxDocument):
        parent_elm = parent.element.body
    elif isinstance(parent, _Cell):
        parent_elm = parent._tc
    else:
        raise ValueError("Invalid type parameter, expected DocxDocument or _Cell")

    for child in parent_elm.iterchildren():
        if isinstance(child, CT_P):
            yield DocxParagraph(child, parent)
        elif isinstance(child, CT_Tbl):
            yield DocxTable(child, parent)

There were no problems in running the code but the output was incorrect.

The problem is that in my document, there are 2 tables on top of each other but they don't have the same number of columns. Table 1 has 18 columns while table 2 has 20 columns but this code sets the number of columns to the max which is 20 so when I tried reading the data, table 1 produced incorrect results since it looped 20 times so some data from the next row were included as table headers.

Please help. Thanks.

from python-docx.

abubelinha avatar abubelinha commented on June 26, 2024

@cyrillkuettel Thanks a lot for sharing!

@abubelinha

from python-docx.

scanny avatar scanny commented on June 26, 2024

Added BlockItemContainer.iter_inner_content() in v.1.0.2. Document, Header, Footer, and (table) _Cell are all block-item containers. The behavior is to generate Paragraph | Table in document-order from within that container. Contrast with Section.iter_inner_content() which does the same but only within a single section.

It is not recursive, so you'll need to take care of that aspect if you want it (not everyone will hence why it's not implemented here).

Maybe something like:

def recursively_iter_block_items(blkcntnr: BlockItemContainer) -> Iterator[Paragraph | Table]:
    for item in blkcntnr.iter_inner_content():
        if isinstance(item, Paragraph):
            yield item
        elif isinstance(item, Table):
            for row in item.rows:
                for cell in row.cells:
                    yield from recursively_iter_block_items(cell)

from python-docx.

LucianoMan avatar LucianoMan commented on June 26, 2024

The imports are tricky to get right, so here you go. This should work for the latest version.

from docx.text.paragraph import Paragraph
from docx.document import Document
from docx.table import _Cell, Table
from docx.oxml.text.paragraph import CT_P
from docx.oxml.table import CT_Tbl
import docx

def iter_block_items(parent):
    if isinstance(parent, Document):
        parent_elm = parent.element.body
    elif isinstance(parent, _Cell):
        parent_elm = parent._tc
    else:
        raise ValueError("something's not right")

    for child in parent_elm.iterchildren():
        if isinstance(child, CT_P):
            yield Paragraph(child, parent)
        elif isinstance(child, CT_Tbl):
            table = Table(child, parent)
            for row in table.rows:
                for cell in row.cells:
                    yield from iter_block_items(cell)
                    
doc = docx.Document('word.docx')
for block in iter_block_items(doc):
	print(block.text)

This works wonderfully, however It seems to repeat text that are in tables. Does anyone know how to stop the code from doing this. I attempted to use sets but that got rid of repeated text that I need.

from python-docx.

LucianoMan avatar LucianoMan commented on June 26, 2024

@scanny would you happen to know how to append one word document to another?

from python-docx.

LucianoMan avatar LucianoMan commented on June 26, 2024

I apologize for the off topic question but my friend and I looked everywhere and could not find anything useful if you know a link it would be appreciated :'(.

from python-docx.

cyrillkuettel avatar cyrillkuettel commented on June 26, 2024

If you have pandoc you can run

pandoc -s document1.docx document2.docx  -o merged.docx

This can work for simple cases.

from python-docx.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.