![GitHub last commit](https://camo.githubusercontent.com/4bc8a556e45f5ef394f92e7c5c68f0756c9895c929bdb187571bf68064124405/68747470733a2f2f696d672e736869656c64732e696f2f6769746875622f6c6173742d636f6d6d69742f737571696e67646f6e672f646f63785f706172736572)
Parse all contents of a docx file with python-docx
python3 -m pip install docx-parser
paragraph
: text paragraph, with style_id
multipart
: paragraph with image or hyperlink
table
: table data with merged_cells
docx_parser --help
# parse image as file
docx_parser tests/demo.docx -D tests/media -o tests/out.file.jl
# parse image as base64 string
docx_parser tests/demo.docx -A base64 -o tests/out.base64.jl
from docx_parser import DocumentParser
infile = 'tests/demo.docx'
doc = DocumentParser(infile)
for _type, item in doc.parse():
print(_type, item)
- parse text style: color, bgcolor, font, bold, italic ...
- parse paragraph format