Git Product home page Git Product logo

Comments (24)

waylan avatar waylan commented on June 18, 2024 3

@fokoenecke I just stumbled across this and am the lead dev for Python-Markdown. If there is anything I can do to help, let me know. BTW, while it is not documented, you can swap out Python-Markdown's serializer for your own. So a library that converted ElementTree HTML objects to docx.Document objects would be were I would start. You could plug that into Python-Markdown as a serializer... or after parsing any HTML with lxml or [html5lib](both output ElementTree Objects), use it to convert any HTML document (not just markdown) to docx.

In fact, the latter is probably preferable. For example, Markdown permits raw HTML. The Python-Markdown parser runs a pre-processor that just stores the raw HTML and inserts placeholders in the document that pass through the parser. A post-processor (run after the serializer) then swaps out the placeholders and reinserts the raw HTML. For complete markdown support, you would need to run through the entire markdown parser and then pass the HTML string into an HTML parser before converting to docx. Of course, that requires getting an HTML parser working first. Swapping out Python-Markdown's serializer might be an easier place to start - even if it doesn't give you the ability to support raw HTML.

from python-docx.

fokoenecke avatar fokoenecke commented on June 18, 2024 1

@tooh i pushed my current code to https://github.com/fokoenecke/html_docx. I am using it in a project i'm working on at the moment. Unfortunately the priorities shifted and that part is not as important at the moment. But i guess, i will come back to it next year.

What it does: It takes an element tree, representing the HTML structure of your markdown (i am using https://github.com/waylan/Python-Markdown to convert) and creates the corresponding docx elements inside a paragraph. Obviously it uses python-docx to to that.

paragraph = document.add_paragraph()
html_tree = markdown(markdown_string)
add_html(paragraph, html_tree)

How it works:
It's basically a module you put inside your sources and import the interface method add_html. This method takes your container element (paragraph) and recursively steps through the html element tree. This happens in _append_docx_elements inside the DocxBuilder class. It maps the HTML tag to a dispatcher class from the tag_dispatchers directory, hands over the tag content, recursively calls its children and hands over the tag tail after doing so.
The dispatcher classes call the python-docx functions to create the appropriate objects. They share some helper methods which figure out which kind of container we are in and provide fitting paragraph objects.

Like i said, this won't be able to handle all kind of HTML input, but the basic output of markdown is mapped pretty OK to docx.

from python-docx.

scanny avatar scanny commented on June 18, 2024

I think it's an interesting idea. I believe that ReportLab has something analogous for adding content to a PDF document. There they use HTML as I recall.

It would be important for it to be fairly well separated from the main API, and I'm thinking it would be good for it to have something like a plug-in architecture since there's quite a variety of formats that folks might want. reStructuredText and HTML both come to mind in addition to Markdown.

I'm inclined to think something like this could be done with a function call as a start, something like:

append_markdown_content(document, markdown_text)

that would interpret markdown_text into Word objects and add them to document in the order they were encountered.

I don't think this will be a feature that will come soon, given we still have a good deal of core library work to do. So a separate module that implemented append_markdown_content() might be a useful adjunct for a year or two and perhaps beyond.

from python-docx.

fokoenecke avatar fokoenecke commented on June 18, 2024

Today i took a look at python-markdown. They parse the markdown code with regular expressions and convert it into HTML by using the python xml library. They also provide an Extension API for writing own processors, handling the markdown elements.

I think I'll give it a try and write an extension that generates OOXML instead of HTML based on your low level API.

from python-docx.

scanny avatar scanny commented on June 18, 2024

That would be great! Even an early prototype I'm sure would flush out a lot of learning about what would be possible and how much work it might take. I'll look forward to hearing what you come up with :)

from python-docx.

scanny avatar scanny commented on June 18, 2024

Hmm, I think @waylan's idea for making the 'translator' based on HTML is an idea worth exploring seriously. There are a lot of simplified markup languages out there, but they all have HTML in common. Plus it can be a handy way to build up some rich text in any program, even if you're doing it "manually". If there were an HTML translator then any way we ended up with it would work. Great call @waylan :)

from python-docx.

fokoenecke avatar fokoenecke commented on June 18, 2024

@waylan thanks for your reply! My idea was to build my own processors for python-markdown that generate a ElementTree of docx objects directly. But I see that it would be far more flexible parsing HTML and converting those objects to docx.

In fact, I tried that before on a higher level by rendering a HTML page with mako and converting it via pypandoc to docx. I have not been satisfied by the results and the lack of control this solution provides.

I guess I have to take a deeper look at python-docx to find the best place to insert the generated doxc. This dictates if i can use text generated by a serializer or if I have to get the results as a branch of an ElementTree.

As soon as I make a plan I will inform both of you and ask for advice ;)

from python-docx.

fokoenecke avatar fokoenecke commented on June 18, 2024

I decided to go with waylans plan and hand a html string to pydocx. That string will be parsed to an ElementTree by lxml. Then I am going to translate the different tags to pydocx functions.

A very early prototype for this already worked, at least for my requirements.

from python-docx.

scanny avatar scanny commented on June 18, 2024

Glad to hear it worked out @fokoenecke :)

I'll close this for now then since there's no immediate prospect of development on it. Happy if someone wants to reopen it in future.

from python-docx.

fokoenecke avatar fokoenecke commented on June 18, 2024

oh, I hope you didn't get me wrong on that. I am far from done now and I'd love to contribute what i did afterwards. I just don't know, if anyone can use it, because I won't do a full-flectched HTML conversion, but only the stuff needed to get from markdown to docx. I will tell you, when I am satisfied and the code is tested, so you can decide, if it makes sense.

I opened a branch for it here: https://github.com/fokoenecke/python-docx/tree/html

from python-docx.

scanny avatar scanny commented on June 18, 2024

Ah, ok, no problem, opening it back up :)

from python-docx.

fokoenecke avatar fokoenecke commented on June 18, 2024

Just a little update on what i did so far:
I created an API function that takes an HTML string and a docx container to add it to. The function hands the HTML to a builder class, that recusively iterates through the HTML elements and tries to find a fitting OOXML object to append to the given container. This works by mapping the HTML tags to various dispatcher objects.

By now my code can barely handle the HTML I get from the markdown converter. It's far from covering full-stack HTML. But it does all the things I need it to. I can append small markdown snippets to the document and paragraphs pretty well.

If no one else is interested in this, it's not likely that I will change much more on it. I will fix problems that arise from unlikely markdown combinations and add missing features as they appear in python-docx.

from python-docx.

tooh avatar tooh commented on June 18, 2024

@fokoenecke : Any progress on this ? I still have a use case for this.

from python-docx.

fokoenecke avatar fokoenecke commented on June 18, 2024

@tooh I pulled the functionality out of python-docx and implemented some improvements. Unfortunately it is still far from stable, but it works for my use cases. If you are interested in it, I could upload my current code. The basic idea behind it did not change from what you can see in the fork posted above.

from python-docx.

tooh avatar tooh commented on June 18, 2024

@fokoenecke I would appreciate it if you could upload your current code. I think your solution could work for me as well.

from python-docx.

tooh avatar tooh commented on June 18, 2024

@fokoenecke Thx for uploading. I have a basic sample running now. Headings are created correctly from # and ## and so on. I had to change Heading to Kop in heading.py, because I run a Dutch Word version. This is the same problem as I had to fix in python- docx.
However I have a problem with bullets. The code generates list items and applies style BulletList.
I assume I have a localization problem here as well. Did not figure out how to handle this.

from python-docx.

tooh avatar tooh commented on June 18, 2024

Example code

from html import add_html

from docx import Document

document = Document("template.docx")

import markdown

markdown_string= "# heading 1  \n## heading 2 \n - bullet \n - bullet" 

paragraph = document.add_paragraph()
html_tree = markdown.markdown(markdown_string)
print html_tree

add_html(paragraph, html_tree)
document.save('demo.docx')

from python-docx.

tooh avatar tooh commented on June 18, 2024

Example code is working but in my real project I encounter following error when calling it like this:

html = markdown(story['detail'])
                add_html(paragraph_detail, html)
File "/Scripts/html/tag_dispatchers/__init__.py", line 41, in get_new_paragraph
    new_paragraph = container._parent.add_paragraph()
AttributeError: '_Cell' object has no attribute 'add_paragraph'

paragraph_detail is a paragraph within a table cell.

@fokoenecke
Would appreciate it if you can give me a hint.

from python-docx.

fokoenecke avatar fokoenecke commented on June 18, 2024

I agree that these style problems originate from localization. It depends on your template.docx and the styles saved in it. If there is no ListBullet or ListNumbers, the styles will break. I took my a while to get my templates and styles working, but the docs helped a lot, it's more of a MS Word problem. I suggest trying the default template included in python-docx. If it works there, it's a style-naming problem.

The '_Cell' object has no attribute 'add_paragraph' could be an error in my code, but it's likely that you didn't install the latest python-docx version. _Cell.add_paragraph has been implemented in 0.7.4 (July). I suggest updating the package (pip install -U python-docx).

I hope this is helping! :)

from python-docx.

tooh avatar tooh commented on June 18, 2024

@fokoenecke Wow, spot on. Normally I install updates when they are released, but I missed this one.
Updating solved :The '_Cell' object has no attribute 'add_paragraph' error.
Thx.

Next to solve the styling problem. To be continued.

from python-docx.

tooh avatar tooh commented on June 18, 2024

Can't say I completely understand why the bullets are working now as well.
What I did:

  1. In my template docx I put some text in bulllets
  2. removed the bullets
  3. saved the template
  4. In list_item.py I changed
_list_style = dict(
    ol='ListNumber',
    ul='Bulletedlist',
)

and I changed

style = _list_style.get(element.getparent().tag, 'Bulletedlist')

from python-docx.

tooh avatar tooh commented on June 18, 2024

I leave it up to @scanny if this item should be closed or not.
For my use case I'm happy now.

from python-docx.

abubelinha avatar abubelinha commented on June 18, 2024

No idea if its code can be useful for this issue, but I found this project:

https://github.com/pqzx/html2docx

from python-docx.

abubelinha avatar abubelinha commented on June 18, 2024

Any advances on this? My use case:

  • I have a .docx where I want to embed chunks of python code (limited by custom markers). Mostly, for database extraction.
  • Then I'd use python-docx to locate these chunks, exec them and generate an html output.
  • Then I'd insert that output in the original .docx (replacing the original embedded python code).
    So I need a way to translate those html strings into python-docx stuff which I can insert.

What would you suggest for this last step?
I looked into @fokoenecke's html_docx repo but it looks pretty focused in a particular configuration and difficult to reuse.
There's a @scanny's comment about an rst conversor with a "Service class that knows how to render a RestructuredText string to a python-docx Document object".
So I wonder if that could be an easier way to go (i.e., learning some RestructuredText syntax -something new to me- and changing my python docx-embedded code to genrate Rst strings, instead of html strings).

@scanny would you mind to share a simple example on how to use your class to inject some rst stuff into an extant .docx document?

Thanks
@abubelinha

from python-docx.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.