booktype / python-ooxml Goto Github PK

Python library for parsing .docx (Office Open XML) files

License: GNU Affero General Public License v3.0

Python 99.96% Shell 0.04%

python-ooxml's Introduction

Booktype

Booktype makes it easier and quicker for authors, companies and organisations to edit and publish books. It imports DOCX or EPUB files, converts them into single-source HTML for online editing and proofreading, and uses CSS Paged Media to produce good-looking output for print, the open web, and almost any ebook reader, in seconds. Booktype facilitates collaborative, agile production across time zones and borders.

Booktype is built on the Django web framework and many great Python libraries.

The Booktype user interface is being translated into many languages by our community of contributors. Your help with development or translation is always welcome!

Installation

Installation instructions for Booktype on GNU/Linux and OS X can be found in the Booktype user manual.

Files for installation using Docker can be found in the Booktype-docker repository.

More information

Check the #booktype hashtag on Twitter, or follow us @Booktypo
Booktype issue tracker
Booktype support forum
Booktype development forum
Booktype documentation forum
Developer documentation for Booktype

How to contribute

Fork the booktype/Booktype repository. Please see GitHub help on forking or use this direct link to fork.
Clone your fork to your local machine.
Create a new local branch.
Run tests and make sure your contribution works correctly.
Create a pull request with details of your new feature, bugfix or other contribution.
Sign and return the contributor agreement paperwork, either for an individual, or an entity such as a company, university or other organisation. This paperwork gives us the right to use your work in Booktype, and makes it clear that you retain ownership of the copyright in your contribution.

Testing

Booktype uses the py.test testing framework with the pytest-django plugin. It makes the testing process easier, and also provides the ability to run ready-made django (unittest) tests.

To run tests:

Open a terminal and activate the virtual environment (Booktype must be installed).
Go to (cd command) instance root (folder with manage.py and pytest.ini file).
Run the py.test command.
If you want pytest to print test coverage information, you should run py.test --cov-report term-missing --cov=path/to/Booktype. You can read more about coverage here: pytest-cov

License

Booktype is licensed under the GNU AGPL license.

python-ooxml's People

Contributors

Stargazers

Watchers

python-ooxml's Issues

Handle character styles properly

We do great things with paragraph style and styling done on the elements but we do not mark character styles at all. What we should do is check if character styling is defined on the elements, and if it was just add the name of the style as the CSS class to that element.

Table serialization fails

Table serialization fails when we pass None as root element. Clearly, we need to see why do we have root element as None. This should not happen but we should also not fail with exception when this happens.

Add more code samples

Add couple of sample files. We need for parsing, how to extend and how to use import. For now it will be just enough.

Text is sometimes added twice

Seems like we sometimes serialise text twice. For multiple <w:r> references we sometimes add text to .text and .tail. What we should do is just check if text has already been added.

Underline Tags

Underline tags in docx files are missed. The offending lines are parse.py:88 and parse.py:31 on commit 833e658.

The master branch treats the underline tag as having only two possible states, on or off and does not account for the fact the underline tag will actually contain string values such as 'single', 'double', 'dashed' etc. I have a branch of the code that will update the rpr dictionary accordingly by altering the code around the two lines that I mentioned, a 'u' field of the dictionary will be added with a string value representing the type of underlining.

However I do not know what further implication this will have. Does another part of this project assume that the 'u' field of rpr will either not exist or take on a true or false value.

Parsing breaks on some drawings elements

parse_drawing function fails on some documents. The issues seems to be that drawing element does not have element <a:blip> inside.

Parsing in python3 fails due to iteritems() reference.

Traceback (most recent call last):
  File "parser.py", line 16, in <module>
    print(serialize.serialize(dfile.document))
  File "/usr/local/lib/python3.5/site-packages/ooxml/serialize.py", line 1232, in serialize
    return serialize_elements(document, document.elements, options)
  File "/usr/local/lib/python3.5/site-packages/ooxml/serialize.py", line 1214, in serialize_elements
    root = _ser(ctx, document, elem, root)
  File "/usr/local/lib/python3.5/site-packages/ooxml/serialize.py", line 847, in serialize_table
    _td = _ser(ctx, document, elem, _td, embed=False)
  File "/usr/local/lib/python3.5/site-packages/ooxml/serialize.py", line 665, in serialize_paragraph
    if ctx.header.is_header(par, max_font_size, elem, style=style):
  File "/usr/local/lib/python3.5/site-packages/ooxml/serialize.py", line 976, in is_header
    sorted_list_of_sizes = list(collections.OrderedDict(sorted(list_of_sizes.iteritems(), key=lambda t: t[0])))
AttributeError: 'dict' object has no attribute 'iteritems'

Release 0.12 version

Increase the version, update documentation and tag the release.

Support for Word smartTags

We need to be able to support smartTag element. Detail description for this tag is here: http://www.datypic.com/sc/ooxml/e-w_smartTag-1.html

Here is example of the XML structure from the document:

    <w:p w14:paraId="700B8014" w14:textId="77777777" w:rsidR="00BA11BE" w:rsidRDefault="00BA11BE" w:rsidP="00AA4ADD">
      <w:pPr>
        <w:pStyle w:val="C1Heading"/>
      </w:pPr>
      <w:bookmarkStart w:id="0" w:name="_Hlk397609529"/>
      <w:smartTag w:uri="urn:schemas-microsoft-com:office:smarttags" w:element="country-region">
        <w:smartTag w:uri="urn:schemas-microsoft-com:office:smarttags" w:element="place">
          <w:r>
            <w:t>Benin</w:t>
          </w:r>
        </w:smartTag>
      </w:smartTag>
    </w:p>

Endnotes relationships are not being parsed

We are only parsing document relationships, and not endnotes ones. We should just make it work with both, and specify where to look for the content.

Change how fonts scale

We should use fixed ratio to scale font size, instead calculate as we do now.

Footnotes support only one paragraph

Footnotes support only one paragraph while they could be made out of multiple paragraphs. This is exactly like it is for endnotes. Just parse all the paragraphs and store them together.

Parsing properties will fail because we do not check if we have correct parent

Parsing properties will fail because we do not check if our parent has valid attributes. We assume it does have 'ppr' attribute and we try to check if it contains certain values. Because it does not exist we fail at this point with an exception.

Support for endnotes

We have support for footnotes but we need to be able to read endnotes also. Just like footnotes, endnotes are placed in separate file endnotes.xml. We need to parse it and store in similar to footnotes.

Parsing the footnotes

We were only parsing endnotes and not the footnotes. We should be able to parse the footnotes and be able to render them correctly.

Option for pretty_print serialization

By default we always used pretty_print=True. There are some situations when we do not want that. Just add this switch to the list of options and check during the serialization what the options is saying.

Organize setup.py file

setup.py and README file needs some organising. Write some basic info about the library.

Paragraph and character styling is not serialized correctly

There are situations where it just does not work as expected. Mainly, the problem is when the paragraph has style 1, part of the text has style 2, and the rest of the text inherits whatever is in the paragraph.

Leave class and style definitions for headers during the import

We do not leave class and style definitions for main chapter header. We do it for subheaders, but not main header.

Just leave whatever was defined in the imported chapter also.

Comment reference is not being parsed

We are parsing commentRangeStart and commentRangeEnd but commentReference (used by older Word) is not being parsed.

We should parse it, create comment mark for it and not have commented text (for obvious reasons).

Rename library

OOXML seems to be the family of document types, not just word type of files. Hence, the library's name is misleading.

Used Font size

Hi guys! There is a minor error that I faced. I fixed it changing int to float. Please review and give response is it correct or not?

My suggestion firstly convert to float and after that convert to int !

python-ooxml/ooxml/doc.py

Line 104 in b56990a

fsz = int(sz) / 2

Thanks in advance.

Support for comments

We need to be able to parse the comments and remember text which they reference in original document.

Comments are in separate file comments.xml. It would be hard to mark the text which was tagged by user but we can figure out start and end of the text which is being commented.

Do not use same styling on the parent and the children

We end up with output code i

<p style="font-size: 120%;">
  <span style="font-size: 120%">Text</span>
</p>

We should end up with something like this:

<p style="font-size: 120%;">
 Text
</p>