Git Product home page Git Product logo

simplify-docx's Introduction

Overview

DOCX files are complex, and their complexity makes scraping documents for their content difficult. The aim of this package is to simplify .docx files to just the components which carry meaning, thereby easing the process of pattern matching and data extraction by converting a .docx file into a predictable and human readable JSON file.

Simplifying a complex document down to it's meaningful parts of course requires taking a position on what does and does-not convey meaning in a document. Generally, this package takes the stance that the document structure (body, paragraphs, tables, etc.) are meaningful as is the text itself, whereas text styling (font, font-weight, etc.) is ignored almost entirely, with the exception of paragraph indentation and numbering which is often used to create lists, block quotes, etc. Furthermore, the opinions expressed by this package are explained in the Options section below and can be changed to suite your needs.

Usage

import docx
from simplify_docx import simplify

# read in a document 
my_doc = docx.Document("/path/to/my/favorite/file.docx")

# coerce to JSON using the standard options
my_doc_as_json = simplify(my_doc)

# or with non-standard options
my_doc_as_json = simplify(my_doc,{"remove-leading-white-space":False})

Installation

This project relies on the python-docx package which can be installed via pip install python-docx. However, as of this writing, if you wish to scrape documents which contain (A) form fields such as drop down lists, checkboxes and text inputs or (B) nested documents (subdocs, altChunks, etc.), you'll need to clone this fork of the python-docx package.

Options

General

  • "friendly-name": (Default = True): Use user-friendly type names such as "table-cell", over standard element names like "CT_Tc"

  • "merge-consecutive-text": (Default = True): Sentences and even single words can be represented by multiple text elements. If True, concatenate consecutive text elements into a single text element.

Ignoring Invisible things

  • "ignore-empty-paragraphs": (Default = True): Empty paragraphs are often used for styling purpose and rarely have significance in the meaning of the document.
  • "ignore-empty-text": (Default = True): Empty text runs can make an otherwise empty paragraph appear to contain data.
  • "remove-leading-white-space": (Default = True): Leading white-space at the start of a paragraph is ocassionaly used for styling purposes and rarely has significance in the interpretation of a document.
  • "remove-trailing-white-space": (Default = True): Trailing white-space at the end of a paragraph rarely has significance in the interpretation of a document.
  • "flatten-inner-spaces": (Default = False): Collapse multiple space characters between words to a single space.
  • "ignore-joiners": (Default = False): Zero width joiner and non-joiner characters are special characters used to create ligatures in displayed text and don't typically convey meaning (at least in alphabet based languages).

Special symbols

  • "dumb-quotes": (Default = True): Replace smart quotes with dumb quotes.
  • "dumb-hyphens": (Default = True): Replace en-dash, em-dash, figure-dash, horizontal bar, and non-breaking hyphens with ordinary hyphens.
  • "dumb-spaces": (Default = True): Replace zero width spaces, hair spaces, thin spaces, punctuation spaces, figure spaces, six per em spaces, four per em spaces, three per em spaces, em spaces, en spaces, em quad spaces, and en quad spaces with ordinary spaces.
  • "special-characters-as-text": (Default = True): Coerce special characters into text equivalents according to the following table:
Character Text Equivalent
CarriageReturn \n
Break \r
TabChar \t
PositionalTab \t
NoBreakHyphen -
SoftHyphen -
  • "symbol-as-text": (Default = True): Special symbols often cary meaning other than the underlying unicode character, especially when the font is a special font such as Wingdings. If True these are included as ordinary text and their font information is omitted.
  • "empty-as-text": (Default = False): There are a variety of "Empty" tags such as the <"w:yearLong"> tag which cause the current year to be inserted into the document text. If True, include these as text formatted as "[yearLong]".
  • "ignore-left-to-right-mark": (Default = False): Ignore the left-to-right mark, which is not writeable by pythons csv writer.
  • "ignore-right-to-left-mark": (Default = False): Ignore the right-to-left mark which is not writeable by pythons csv writer.

Paragraph style:

Paragraph style markup are one exception to the styling vs. content dichotomy. For example, block quotes are often indicated by indenting whole paragraphs, and Ordered lists, Unordered lists and nesting of lists is often used to divide sections of a document into logical components.

  • "include-paragraph-indent": (Default = True): Include the indentation markup on paragraph (CT_P) elements. Indentation is measured in twips
  • "include-paragraph-numbering": (Default = True): Include the numbering styles, which are included in the CT_P.pPr.numPr element. The ilvl attribute indicates the level of nesting (zero based index) and the numId attribute refers to a specific numbering style included in the document's internal styles sheet.

Form Elements

  • "simplify-dropdown": (Default = True): Include just the selected and default values, the available options, and the name and label attributes in the form element.
  • "simplify-textinput": (Default = True): Include just the current and default values, and the name and label attributes in the form element.
  • "greedy-text-input": (Default = True): Continue consuming run elements when the text-input has not ended at the end of a paragraph, and the next block level element is also a paragraph. This typically occurs when the user preses the return key while editing a text input field.
  • "simplify-checkbox": (Default = True): Include just the current and default values, and the name and label attributes in the form element.
  • "use-checkbox-default": (Default = True): If the checkbox has no value attribute (typically because the user has not interacted with it), report the default value as the checkbox value.
  • "checkbox-as-text": (Default = False): Coerce the value of the checkbox to text, represented as either "[CheckBox:True]" or "[CheckBox:False]"
  • "dropdown-as-text": (Default = False): Coerce the value of the checkbox to text, represented as "[DropDown:<selected value>]"
  • "trim-dropdown-options": (Default = True): Remove white-space on the left and right of drop down option items.
  • "flatten-generic-field": (Default = True): generic-fields are CT_FldChar runs which are not marked as a drop-down, text-input, or checkbox. These may include special instructions which apply special formatting to a text run (e.g. a hyper link). If True, the contents of generic-fields are included in the normal flow of text

Special content

  • "flatten-hyperlink": (Default = True): Flatten hyperlinks, including their contents in the flow of normal text.
  • "flatten-smartTag": (Default = True): Flatten smartTag elements, including their contents in the flow of normal text.
  • "flatten-customXml": (Default = True): Flatten customXml elements, including their contents in the flow of normal text.
  • "flatten-simpleField": (Default = True): Flatten simpleField elements, including their contents in the flow of normal text.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.

When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

simplify-docx's People

Contributors

geoffjukes avatar jdthorpe avatar johnthagen avatar microsoft-github-policy-service[bot] avatar microsoftopensource avatar msftgits avatar njwhite avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

simplify-docx's Issues

include-paragraph-indent with float indent

At least with Google Spreadsheets, paragraph indent can be float, which causes a crash:

Traceback (most recent call last):
  File "/Users/yunake/Documents/ТРО/audit_moves/./read_pryznachennya.py", line 8, in <module>
    my_doc_as_json = simplify(my_doc)
  File "/Users/yunake/Documents/ТРО/audit_moves/.venv/lib/python3.10/site-packages/simplify_docx/__init__.py", line 33, in simplify
    out = document(doc.element).to_json(doc, _options)
  File "/Users/yunake/Documents/ТРО/audit_moves/.venv/lib/python3.10/site-packages/simplify_docx/elements/base.py", line 106, in to_json
    "VALUE": [ elt.to_json(doc, options) for elt in self],
  File "/Users/yunake/Documents/ТРО/audit_moves/.venv/lib/python3.10/site-packages/simplify_docx/elements/base.py", line 106, in <listcomp>
    "VALUE": [ elt.to_json(doc, options) for elt in self],
  File "/Users/yunake/Documents/ТРО/audit_moves/.venv/lib/python3.10/site-packages/simplify_docx/elements/body.py", line 25, in to_json
    JSON = elt.to_json(doc, options, iter_me)
  File "/Users/yunake/Documents/ТРО/audit_moves/.venv/lib/python3.10/site-packages/simplify_docx/elements/paragraph.py", line 181, in to_json
    out["style"] = {"indent": indentation(_indent).to_json(doc, options)}
  File "/Users/yunake/Documents/ТРО/audit_moves/.venv/lib/python3.10/site-packages/simplify_docx/elements/base.py", line 36, in __init__
    self.props[prop] = getattr(x, prop)
  File "/Users/yunake/Documents/ТРО/audit_moves/.venv/lib/python3.10/site-packages/docx/oxml/xmlchemy.py", line 164, in get_attr_value
    return self._simple_type.from_xml(attr_str_value)
  File "/Users/yunake/Documents/ТРО/audit_moves/.venv/lib/python3.10/site-packages/docx/oxml/simpletypes.py", line 21, in from_xml
    return cls.convert_from_xml(str_value)
  File "/Users/yunake/Documents/ТРО/audit_moves/.venv/lib/python3.10/site-packages/docx/oxml/simpletypes.py", line 335, in convert_from_xml
    return Twips(int(str_value))
ValueError: invalid literal for int() with base 10: '4960.6299212598415'

Error parsing document with bullet list

I am trying to use this library to gather indention level from a word document. I am getting the following error when the word document includes a bullet point list...

Traceback (most recent call last):
  File "/home/src/main.py", line 9, in <module>
    text_groups = convert_document_to_text_groups(DOCUMENT_NAME, PATH_TO_DOCUMENT)
  File "/home/src/modules/document_converter.py", line 7, in convert_document_to_text_groups
    my_doc_as_json = simplify(document)
  File "/usr/local/lib/python3.8/site-packages/simplify_docx-0.1.0-py3.8.egg/simplify_docx/__init__.py", line 33, in simplify
    out = document(doc.element).to_json(doc, _options)
  File "/usr/local/lib/python3.8/site-packages/simplify_docx-0.1.0-py3.8.egg/simplify_docx/elements/base.py", line 106, in to_json
    "VALUE": [ elt.to_json(doc, options) for elt in self],
  File "/usr/local/lib/python3.8/site-packages/simplify_docx-0.1.0-py3.8.egg/simplify_docx/elements/base.py", line 106, in <listcomp>
    "VALUE": [ elt.to_json(doc, options) for elt in self],
  File "/usr/local/lib/python3.8/site-packages/simplify_docx-0.1.0-py3.8.egg/simplify_docx/elements/body.py", line 25, in to_json
    JSON = elt.to_json(doc, options, iter_me)
  File "/usr/local/lib/python3.8/site-packages/simplify_docx-0.1.0-py3.8.egg/simplify_docx/elements/paragraph.py", line 167, in to_json
    _indent = get_paragraph_ind(self.fragment, doc)
  File "/usr/local/lib/python3.8/site-packages/simplify_docx-0.1.0-py3.8.egg/simplify_docx/utils/paragrapy_style.py", line 56, in get_paragraph_ind
    num_style.pPr is not None and \
AttributeError: 'lxml.etree._Element' object has no attribute 'pPr'

I have nothing in my word document except these two lines.

image

New release?

Please release the latest version (0.1.1) on PyPI. Without the loosening of requirements that has been introduced in this version, it is impossible to add simplify-docx to projects which, for example, rely on a version of lxml greater than 4.3.3.

something wrong with big docx file

Hey, I got an issue, I don't know what is my problem
My code works well, when my docx file is small, but when I change to big file, then I got the error below:

Traceback (most recent call last):
File "/home/wuyangjian/demo.py", line 123, in
extract_docx_to_excel(path)
File "/home/wuyangjian/demo.py", line 32, in extract_docx_to_excel
for i in extract_docx(path):
File "/home/wuyangjian/demo.py", line 9, in extract_docx
my_doc_as_json = simplify(my_doc)
File "/home/wuyangjian/miniconda3/lib/python3.9/site-packages/simplify_docx/init.py", line 33, in simplify
out = document(doc.element).to_json(doc, _options)
File "/home/wuyangjian/miniconda3/lib/python3.9/site-packages/simplify_docx/elements/base.py", line 106, in to_json
"VALUE": [ elt.to_json(doc, options) for elt in self],
File "/home/wuyangjian/miniconda3/lib/python3.9/site-packages/simplify_docx/elements/base.py", line 106, in
"VALUE": [ elt.to_json(doc, options) for elt in self],
File "/home/wuyangjian/miniconda3/lib/python3.9/site-packages/simplify_docx/elements/body.py", line 25, in to_json
JSON = elt.to_json(doc, options, iter_me)
File "/home/wuyangjian/miniconda3/lib/python3.9/site-packages/simplify_docx/elements/paragraph.py", line 142, in to_json
out: Dict[str, Any] = super(paragraph, self).to_json(doc, options, super_iter)
File "/home/wuyangjian/miniconda3/lib/python3.9/site-packages/simplify_docx/elements/paragraph.py", line 27, in to_json
for elt in run_iterator:
File "/home/wuyangjian/miniconda3/lib/python3.9/site-packages/simplify_docx/elements/base.py", line 61, in iter
for elt in xml_iter(node,
File "/home/wuyangjian/miniconda3/lib/python3.9/site-packages/simplify_docx/iterators/generic.py", line 167, in xml_iter
for elt in xml_iter(current, handlers.TAGS_TO_NEST[current.tag], _msg):
File "/home/wuyangjian/miniconda3/lib/python3.9/site-packages/simplify_docx/iterators/generic.py", line 156, in xml_iter
yield handlers.TAGS_TO_YIELDcurrent.tag
File "/home/wuyangjian/miniconda3/lib/python3.9/site-packages/simplify_docx/elements/form.py", line 106, in init
super(fldChar, self).init(x)
File "/home/wuyangjian/miniconda3/lib/python3.9/site-packages/simplify_docx/elements/base.py", line 36, in init
self.props[prop] = getattr(x, prop)
AttributeError: 'lxml.etree._Element' object has no attribute 'fldCharType'

AttributeError: 'lxml.etree._Element' object has no attribute 'val'

Hi Simplify-Docx team,

I tried to run simplify() on a sample word document and ran into an error: AttributeError: 'lxml.etree._Element' object has no attribute 'val'. I've included a fully reproducible example below, which I ran on Google Colab using Python 3.7.13. Would you be able to help?

Thanks for your help.

Setup

python -m pip install git+https://github.com/jdthorpe/python-docx.git
pip install folium==0.2.1
python -m pip install git+https://github.com/microsoft/Simplify-Docx.git

Reproducible example

import docx
import requests
from simplify_docx import simplify

fpath = "https://www.nidcr.nih.gov/sites/default/files/2017-12/reportable-events-table.docx"
fname = "sample.docx"
with open(fname, "wb") as f:
    f.write(requests.get(fpath).content)

doc = docx.Document(fname)
simplify(doc)

# Error
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
[<ipython-input-21-a6ef8e30e39f>](https://localhost:8080/#) in <module>()
----> 1 simplify(doc)

4 frames
[/usr/local/lib/python3.7/dist-packages/simplify_docx/elements/table.py](https://localhost:8080/#) in to_json(self, doc, options, super_iter)
     71         _desc = self.fragment.tblPr.find(qn("w:tblDescription"))
     72         if _desc is not None:
---> 73             if (not _desc.val) and options.get("ignore-empty-table-description", True):
     74                 pass
     75             else:

AttributeError: 'lxml.etree._Element' object has no attribute 'val'

AttributeError: 'lxml.etree._Element' object has no attribute 'fldCharType'

when I open my .docx file , which is saved from a .doc file use python-docx, it cames out this Error,
C:\Users\Luke\Anaconda3\lib\site-packages\simplify_docx\iterators\generic.py:194: UnexpectedElementWarning: Skipping unexpected tag: {http://schemas.openxmlformats.org/wordprocessingml/2006/main}background UnexpectedElementWarning) C:\Users\Luke\Anaconda3\lib\site-packages\simplify_docx\iterators\generic.py:194: UnexpectedElementWarning: Skipping unexpected tag: {http://schemas.openxmlformats.org/wordprocessingml/2006/main}pict UnexpectedElementWarning) Traceback (most recent call last): File "E:/_master/硕士论文/data/data_preprocess/temp.py", line 27, in <module> db_json = simplify(db) File "C:\Users\Luke\Anaconda3\lib\site-packages\simplify_docx\__init__.py", line 33, in simplify out = document(doc.element).to_json(doc, _options) File "C:\Users\Luke\Anaconda3\lib\site-packages\simplify_docx\elements\base.py", line 106, in to_json "VALUE": [ elt.to_json(doc, options) for elt in self], File "C:\Users\Luke\Anaconda3\lib\site-packages\simplify_docx\elements\base.py", line 106, in <listcomp> "VALUE": [ elt.to_json(doc, options) for elt in self], File "C:\Users\Luke\Anaconda3\lib\site-packages\simplify_docx\elements\body.py", line 25, in to_json JSON = elt.to_json(doc, options, iter_me) File "C:\Users\Luke\Anaconda3\lib\site-packages\simplify_docx\elements\paragraph.py", line 142, in to_json out: Dict[str, Any] = super(paragraph, self).to_json(doc, options, super_iter) File "C:\Users\Luke\Anaconda3\lib\site-packages\simplify_docx\elements\paragraph.py", line 27, in to_json for elt in run_iterator: File "C:\Users\Luke\Anaconda3\lib\site-packages\simplify_docx\elements\base.py", line 62, in __iter__ self.__iter_name__ if self.__iter_name__ else self.__type__): File "C:\Users\Luke\Anaconda3\lib\site-packages\simplify_docx\iterators\generic.py", line 167, in xml_iter for elt in xml_iter(current, handlers.TAGS_TO_NEST[current.tag], _msg): File "C:\Users\Luke\Anaconda3\lib\site-packages\simplify_docx\iterators\generic.py", line 156, in xml_iter yield handlers.TAGS_TO_YIELD[current.tag](current) File "C:\Users\Luke\Anaconda3\lib\site-packages\simplify_docx\elements\form.py", line 106, in __init__ super(fldChar, self).__init__(x) File "C:\Users\Luke\Anaconda3\lib\site-packages\simplify_docx\elements\base.py", line 36, in __init__ self.props[prop] = getattr(x, prop) AttributeError: 'lxml.etree._Element' object has no attribute 'fldCharType'
what should I do to solve it?

lxml error

Using lxml version 4.6.3, I get the following error when trying to simplify a document with a bulleted or numbered list:

.../lib/python3.9/site-packages/simplify_docx/__init__.py in simplify(doc, options)
     31     __set_options__(_options)
     32
---> 33     out = document(doc.element).to_json(doc, _options)
     34
     35     if _options.get("friendly-name", True):

.../lib/python3.9/site-packages/simplify_docx/elements/base.py in to_json(self, doc, options, super_iter)
    104         out.update({
    105                 "TYPE": self.__type__,
--> 106                 "VALUE": [ elt.to_json(doc, options) for elt in self],
    107                 })
    108         return out

.../lib/python3.9/site-packages/simplify_docx/elements/base.py in <listcomp>(.0)
    104         out.update({
    105                 "TYPE": self.__type__,
--> 106                 "VALUE": [ elt.to_json(doc, options) for elt in self],
    107                 })
    108         return out

.../lib/python3.9/site-packages/simplify_docx/elements/body.py in to_json(self, doc, options, super_iter)
     23         iter_me = peekable(self)
     24         for elt in iter_me:
---> 25             JSON = elt.to_json(doc, options, iter_me)
     26
     27             if (

.../lib/python3.9/site-packages/simplify_docx/elements/paragraph.py in to_json(self, doc, options, super_iter)
    165
    166         if options.get("include-paragraph-indent", True):
--> 167             _indent = get_paragraph_ind(self.fragment, doc)
    168             if _indent is not None:
    169                 out["style"] = {"indent": indentation(_indent).to_json(doc, options)}

.../lib/python3.9/site-packages/simplify_docx/utils/paragrapy_style.py in get_paragraph_ind(p, doc)
     54     num_style = get_num_style(p, doc)
     55     if num_style is not None and \
---> 56             num_style.pPr is not None and \
     57             num_style.pPr.ind is not None:
     58         return num_style.pPr.ind

AttributeError: 'lxml.etree._Element' object has no attribute 'pPr'

RuntimeError: Unhandled nesting of data fields

I'm trying to read a fairly small word document and getting the above error. I have installed the recommended version of python-docx.

document I'm trying to simplify is attached.

this is what I'm doing from ubuntu 20.04 on WSL:

#!/usr/bin/python3

import docx
from simplify_docx import simplify

doc = docx.Document("Cayley-Test-CDD-v0_1.docx")

doc_json = simplify(doc)

Cayley-Test-CDD-v0_1.docx

AttributeError: 'NoneType' object has no attribute 'abstractNumId'

When I parse a docx file, I met below exception. How can I fix it. Thanks.

File "C:/WeiWei/sourcecode/Python/gpt-doc-qa/loader/parse_docx.py", line 216, in main
doc_org_dict = simplify(doc)
File "C:\Users\nsnp577\Anaconda3\envs\gpt-doc-qa\lib\site-packages\simplify_docx_init_.py", line 33, in simplify
out = document(doc.element).to_json(doc, _options)
File "C:\Users\nsnp577\Anaconda3\envs\gpt-doc-qa\lib\site-packages\simplify_docx\elements\base.py", line 106, in to_json
"VALUE": [ elt.to_json(doc, options) for elt in self],
File "C:\Users\nsnp577\Anaconda3\envs\gpt-doc-qa\lib\site-packages\simplify_docx\elements\base.py", line 106, in
"VALUE": [ elt.to_json(doc, options) for elt in self],
File "C:\Users\nsnp577\Anaconda3\envs\gpt-doc-qa\lib\site-packages\simplify_docx\elements\body.py", line 25, in to_json
JSON = elt.to_json(doc, options, iter_me)
File "C:\Users\nsnp577\Anaconda3\envs\gpt-doc-qa\lib\site-packages\simplify_docx\elements\paragraph.py", line 167, in to_json
_indent = get_paragraph_ind(self.fragment, doc)
File "C:\Users\nsnp577\Anaconda3\envs\gpt-doc-qa\lib\site-packages\simplify_docx\utils\paragrapy_style.py", line 54, in get_paragraph_ind
num_style = get_num_style(p, doc)
File "C:\Users\nsnp577\Anaconda3\envs\gpt-doc-qa\lib\site-packages\simplify_docx\utils\paragrapy_style.py", line 28, in get_num_style

AttributeError: 'lxml.etree._Element' object has no attribute 'fldCharType'

all the fldChar tags have a fldCharType in my document.

> grep fldChar -r eee | grep -v fldCharType
(venv) [01:11:24] koom@dev1 /home/koom/docx2knownet [0|1] 

> pushd eee; zip -r ../eee.docx .; popd
updating: docProps/ (stored 0%)
updating: docProps/core.xml (deflated 47%)
updating: docProps/app.xml (deflated 51%)
updating: customXml/ (stored 0%)
updating: customXml/itemProps1.xml (deflated 36%)
updating: customXml/_rels/ (stored 0%)
updating: customXml/_rels/item1.xml.rels (deflated 36%)
updating: customXml/item1.xml (deflated 39%)
updating: [Content_Types].xml (deflated 80%)
updating: _rels/ (stored 0%)
updating: _rels/.rels (deflated 61%)
updating: word/ (stored 0%)
updating: word/header2.xml (deflated 57%)
updating: word/media/ (stored 0%)
updating: word/media/image1.png (stored 0%)
updating: word/media/image2.png (deflated 23%)
updating: word/media/image3.png (deflated 4%)
updating: word/webSettings.xml (deflated 86%)
updating: word/header1.xml (deflated 70%)
updating: word/footnotes.xml (deflated 65%)
updating: word/styles.xml (deflated 90%)
updating: word/document.xml (deflated 94%)
updating: word/theme/ (stored 0%)
updating: word/theme/theme1.xml (deflated 79%)
updating: word/numbering.xml (deflated 95%)
updating: word/endnotes.xml (deflated 64%)
updating: word/fontTable.xml (deflated 80%)
updating: word/settings.xml (deflated 77%)
updating: word/_rels/ (stored 0%)
updating: word/_rels/settings.xml.rels (deflated 36%)
updating: word/_rels/document.xml.rels (deflated 84%)
(venv) [01:11:27] koom@dev1 /home/koom/docx2knownet  

> ./main.py eee.docx
fn: eee.docx
Traceback (most recent call last):
  File "/home/koom/docx2knownet/./main.py", line 52, in <module>
    print(json.dumps(simplify(docx.Document(fn),{"remove-leading-white-space":False}), indent=4))
  File "/home/koom/docx2knownet/venv/lib/python3.10/site-packages/simplify_docx/__init__.py", line 33, in simplify
    out = document(doc.element).to_json(doc, _options)
  File "/home/koom/docx2knownet/venv/lib/python3.10/site-packages/simplify_docx/elements/base.py", line 106, in to_json
    "VALUE": [ elt.to_json(doc, options) for elt in self],
  File "/home/koom/docx2knownet/venv/lib/python3.10/site-packages/simplify_docx/elements/base.py", line 106, in <listcomp>
    "VALUE": [ elt.to_json(doc, options) for elt in self],
  File "/home/koom/docx2knownet/venv/lib/python3.10/site-packages/simplify_docx/elements/body.py", line 25, in to_json
    JSON = elt.to_json(doc, options, iter_me)
  File "/home/koom/docx2knownet/venv/lib/python3.10/site-packages/simplify_docx/elements/paragraph.py", line 142, in to_json
    out: Dict[str, Any] = super(paragraph, self).to_json(doc, options, super_iter)
  File "/home/koom/docx2knownet/venv/lib/python3.10/site-packages/simplify_docx/elements/paragraph.py", line 27, in to_json
    for elt in run_iterator:
  File "/home/koom/docx2knownet/venv/lib/python3.10/site-packages/simplify_docx/elements/base.py", line 61, in __iter__
    for elt in xml_iter(node,
  File "/home/koom/docx2knownet/venv/lib/python3.10/site-packages/simplify_docx/iterators/generic.py", line 167, in xml_iter
    for elt in xml_iter(current, handlers.TAGS_TO_NEST[current.tag], _msg):
  File "/home/koom/docx2knownet/venv/lib/python3.10/site-packages/simplify_docx/iterators/generic.py", line 156, in xml_iter
    yield handlers.TAGS_TO_YIELD[current.tag](current)
  File "/home/koom/docx2knownet/venv/lib/python3.10/site-packages/simplify_docx/elements/form.py", line 106, in __init__
    super(fldChar, self).__init__(x)
  File "/home/koom/docx2knownet/venv/lib/python3.10/site-packages/simplify_docx/elements/base.py", line 36, in __init__
    self.props[prop] = getattr(x, prop)
AttributeError: 'lxml.etree._Element' object has no attribute 'fldCharType'

the offender:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<w:document xmlns:ve="http://schemas.openxmlformats.org/markup-compatibility/2006"
            xmlns:o="urn:schemas-microsoft-com:office:office"
            xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships"
            xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml"
            xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing"
            xmlns:w10="urn:schemas-microsoft-com:office:word"
            xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
            xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml">
    <w:body>
        <w:p w:rsidR="008650CC" w:rsidRPr="00901306" w:rsidRDefault="00B62871" w:rsidP="00901306">
            <w:pPr>
                <w:pStyle w:val="DOCType"/>
            </w:pPr>
            <w:r w:rsidRPr="00901306">
                <w:fldChar w:fldCharType="begin"/>
            </w:r>
            <w:r w:rsidR="006D26CC" w:rsidRPr="00901306">
                <w:instrText xml:space="preserve"> set DOCnumber "</w:instrText>
            </w:r>
            <w:r w:rsidR="00130235" w:rsidRPr="00901306">
                <w:instrText>Dddd.</w:instrText>
            </w:r>
            <w:r w:rsidR="006D26CC" w:rsidRPr="00901306">
                <w:instrText xml:space="preserve">" </w:instrText>
            </w:r>
            <w:r w:rsidRPr="00901306">
                <w:fldChar w:fldCharType="separate"/>
            </w:r>
            <w:bookmarkStart w:id="0" w:name="DOCnumber"/>
            <w:r w:rsidR="00130235" w:rsidRPr="00901306">
                <w:t>Dddd.</w:t>
            </w:r>
            <w:bookmarkEnd w:id="0"/>
            <w:r w:rsidRPr="00901306">
                <w:fldChar w:fldCharType="end"/>
            </w:r>
        </w:p>
    </w:body>
</w:document>

'lxml.etree._Element' object has no attribute 'pPr'

I'm trying to read this file: diary.docx, and getting AttributeError on line 56 of paragraph style

I tried to print different variables and num_style inside get_paragraph_ind is being returned as NoneType by the function get_num_style even when the p.pPr value is not null so most probably it's not able to find any subElement for abstractNumbering

Will be great if someone can help me resolve this 🙏

I've attached the docx file I'm using and a screenshot of the Error message below:

Screenshot 2020-08-31 at 1 10 00 AM

Loosen Requirements

Hello,

Is it possible to loosen the requirements for this library. Right now we have:

install_requires=[
        "lxml==4.3.3",
        "more-itertools==7.0.0",
        "python-docx==0.8.10",
        "six==1.12.0",
        "wincertstore==0.2",
    ],

Could we do something like:

install_requires=[
        "lxml>=4.3.3,<5",
        "more-itertools==7.0.0",
        "python-docx==0.8.10",
        "six>=1.12.0<2",
        "wincertstore==0.2",
    ],

ValueError: invalid literal for int() with base 10: '515.3813934326172'

simplify tries to convert a float to an int somewhere.

Traceback (most recent call last):
  File "/home/a/parser.py", line 14, in <module>
    doc_json = simplify(my_doc)
  File "/home/a/.local/lib/python3.10/site-packages/simplify_docx/__init__.py", line 33, in simplify
    out = document(doc.element).to_json(doc, _options)
  File "/home/a/.local/lib/python3.10/site-packages/simplify_docx/elements/base.py", line 106, in to_json
    "VALUE": [ elt.to_json(doc, options) for elt in self],
  File "/home/a/.local/lib/python3.10/site-packages/simplify_docx/elements/base.py", line 106, in <listcomp>
    "VALUE": [ elt.to_json(doc, options) for elt in self],
  File "/home/a/.local/lib/python3.10/site-packages/simplify_docx/elements/body.py", line 25, in to_json
    JSON = elt.to_json(doc, options, iter_me)
  File "/home/a/.local/lib/python3.10/site-packages/simplify_docx/elements/paragraph.py", line 169, in to_json
    out["style"] = {"indent": indentation(_indent).to_json(doc, options)}
  File "/home/a/.local/lib/python3.10/site-packages/simplify_docx/elements/base.py", line 36, in __init__
    self.props[prop] = getattr(x, prop)
  File "/home/a/.local/lib/python3.10/site-packages/docx/oxml/xmlchemy.py", line 164, in get_attr_value
    return self._simple_type.from_xml(attr_str_value)
  File "/home/a/.local/lib/python3.10/site-packages/docx/oxml/simpletypes.py", line 21, in from_xml
    return cls.convert_from_xml(str_value)
  File "/home/a/.local/lib/python3.10/site-packages/docx/oxml/simpletypes.py", line 335, in convert_from_xml
    return Twips(int(str_value))
ValueError: invalid literal for int() with base 10: '515.3813934326172'

The document I'm using is not meant to be shared with the public; if it's needed for debugging I'd rather send it privately

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.