Git Product home page Git Product logo

python-ooxml's Introduction

Booktype

Booktype makes it easier and quicker for authors, companies and organisations to edit and publish books. It imports DOCX or EPUB files, converts them into single-source HTML for online editing and proofreading, and uses CSS Paged Media to produce good-looking output for print, the open web, and almost any ebook reader, in seconds. Booktype facilitates collaborative, agile production across time zones and borders.

Booktype is built on the Django web framework and many great Python libraries.

The Booktype user interface is being translated into many languages by our community of contributors. Your help with development or translation is always welcome!

Installation

Installation instructions for Booktype on GNU/Linux and OS X can be found in the Booktype user manual.

Files for installation using Docker can be found in the Booktype-docker repository.

More information

How to contribute

  1. Fork the booktype/Booktype repository. Please see GitHub help on forking or use this direct link to fork.
  2. Clone your fork to your local machine.
  3. Create a new local branch.
  4. Run tests and make sure your contribution works correctly.
  5. Create a pull request with details of your new feature, bugfix or other contribution.
  6. Sign and return the contributor agreement paperwork, either for an individual, or an entity such as a company, university or other organisation. This paperwork gives us the right to use your work in Booktype, and makes it clear that you retain ownership of the copyright in your contribution.

Testing

Booktype uses the py.test testing framework with the pytest-django plugin. It makes the testing process easier, and also provides the ability to run ready-made django (unittest) tests.

To run tests:

  1. Open a terminal and activate the virtual environment (Booktype must be installed).
  2. Go to (cd command) instance root (folder with manage.py and pytest.ini file).
  3. Run the py.test command.
  4. If you want pytest to print test coverage information, you should run py.test --cov-report term-missing --cov=path/to/Booktype. You can read more about coverage here: pytest-cov

License

Booktype is licensed under the GNU AGPL license.

python-ooxml's People

Contributors

aerkalov avatar danielhjames avatar hozn avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

python-ooxml's Issues

Handle character styles properly

We do great things with paragraph style and styling done on the elements but we do not mark character styles at all. What we should do is check if character styling is defined on the elements, and if it was just add the name of the style as the CSS class to that element.

Table serialization fails

Table serialization fails when we pass None as root element. Clearly, we need to see why do we have root element as None. This should not happen but we should also not fail with exception when this happens.

Add more code samples

Add couple of sample files. We need for parsing, how to extend and how to use import. For now it will be just enough.

Text is sometimes added twice

Seems like we sometimes serialise text twice. For multiple <w:r> references we sometimes add text to .text and .tail. What we should do is just check if text has already been added.

Underline Tags

Underline tags in docx files are missed. The offending lines are parse.py:88 and parse.py:31 on commit 833e658.

The master branch treats the underline tag as having only two possible states, on or off and does not account for the fact the underline tag will actually contain string values such as 'single', 'double', 'dashed' etc. I have a branch of the code that will update the rpr dictionary accordingly by altering the code around the two lines that I mentioned, a 'u' field of the dictionary will be added with a string value representing the type of underlining.

However I do not know what further implication this will have. Does another part of this project assume that the 'u' field of rpr will either not exist or take on a true or false value.

Parsing in python3 fails due to iteritems() reference.

Traceback (most recent call last):
  File "parser.py", line 16, in <module>
    print(serialize.serialize(dfile.document))
  File "/usr/local/lib/python3.5/site-packages/ooxml/serialize.py", line 1232, in serialize
    return serialize_elements(document, document.elements, options)
  File "/usr/local/lib/python3.5/site-packages/ooxml/serialize.py", line 1214, in serialize_elements
    root = _ser(ctx, document, elem, root)
  File "/usr/local/lib/python3.5/site-packages/ooxml/serialize.py", line 847, in serialize_table
    _td = _ser(ctx, document, elem, _td, embed=False)
  File "/usr/local/lib/python3.5/site-packages/ooxml/serialize.py", line 665, in serialize_paragraph
    if ctx.header.is_header(par, max_font_size, elem, style=style):
  File "/usr/local/lib/python3.5/site-packages/ooxml/serialize.py", line 976, in is_header
    sorted_list_of_sizes = list(collections.OrderedDict(sorted(list_of_sizes.iteritems(), key=lambda t: t[0])))
AttributeError: 'dict' object has no attribute 'iteritems'

Support for Word smartTags

We need to be able to support smartTag element. Detail description for this tag is here: http://www.datypic.com/sc/ooxml/e-w_smartTag-1.html

Here is example of the XML structure from the document:

    <w:p w14:paraId="700B8014" w14:textId="77777777" w:rsidR="00BA11BE" w:rsidRDefault="00BA11BE" w:rsidP="00AA4ADD">
      <w:pPr>
        <w:pStyle w:val="C1Heading"/>
      </w:pPr>
      <w:bookmarkStart w:id="0" w:name="_Hlk397609529"/>
      <w:smartTag w:uri="urn:schemas-microsoft-com:office:smarttags" w:element="country-region">
        <w:smartTag w:uri="urn:schemas-microsoft-com:office:smarttags" w:element="place">
          <w:r>
            <w:t>Benin</w:t>
          </w:r>
        </w:smartTag>
      </w:smartTag>
    </w:p>

Footnotes support only one paragraph

Footnotes support only one paragraph while they could be made out of multiple paragraphs. This is exactly like it is for endnotes. Just parse all the paragraphs and store them together.

Support for endnotes

We have support for footnotes but we need to be able to read endnotes also. Just like footnotes, endnotes are placed in separate file endnotes.xml. We need to parse it and store in similar to footnotes.

Parsing the footnotes

We were only parsing endnotes and not the footnotes. We should be able to parse the footnotes and be able to render them correctly.

Option for pretty_print serialization

By default we always used pretty_print=True. There are some situations when we do not want that. Just add this switch to the list of options and check during the serialization what the options is saying.

Organize setup.py file

setup.py and README file needs some organising. Write some basic info about the library.

Comment reference is not being parsed

We are parsing commentRangeStart and commentRangeEnd but commentReference (used by older Word) is not being parsed.

We should parse it, create comment mark for it and not have commented text (for obvious reasons).

Rename library

OOXML seems to be the family of document types, not just word type of files. Hence, the library's name is misleading.

Used Font size

Hi guys! There is a minor error that I faced. I fixed it changing int to float. Please review and give response is it correct or not?

My suggestion firstly convert to float and after that convert to int !

fsz = int(sz) / 2

image

Thanks in advance.

Support for comments

We need to be able to parse the comments and remember text which they reference in original document.

Comments are in separate file comments.xml. It would be hard to mark the text which was tagged by user but we can figure out start and end of the text which is being commented.

Scale to size option does not work with importer

There is a code for scaling font size in the importer but it seems to be setting wrong options which are never used. (in importer.py; function _serialize_chapter). It should set this option on the serialize_options variable.

Scaling should be defined for the HTML and CSS serialisation.

Add Sphinx documentation

We are missing some basic Sphinx documentation. This is about basic usage and documenting source code.

parse oMath tags

Hi, i have a project which the idea is to parse ooxml omath tags into latex. If you are interessed here is the link

Base font size is not calculated correctly

In the importer base font size is not calculated correctly. What we were using was just the biggest possible font size which was not always correct number.

With the importer we ignore document default style because people don't use it always. What we do is calculate font size usage in the document and then pick up the highest possible font size usage as default paragraph size.

Text is not fully parsed in hyperlinks

In the hyperlink we only parse for the first <w:r> tag. In case there are multiple text references inside we will not be aware of them. Besides that, we also do not serialise all the elements we parse. We only look for the first element and use it.

Tests are broken for Break element

It was fine while we only had one type of Break. Now when we have page break and inline break (break to new line) it breaks. We need to mock Break object and test it for these two lines now. We used to use None as value because it didn't matter.

sdt in table cell

During parsing of a table in docx, sdt control is found in a cell. Now parser errorneously post 1 column output for the row below. (it thinks that there is only 1 column in this table row). How to fix/handle this?

Cell structure looks like as below.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.