elliotgao2 / tomd Goto Github PK

View Code? Open in Web Editor NEW

531.0 531.0 71.0 46 KB

Convert HTML to Markdown.

License: GNU General Public License v3.0

Python 100.00%

html markdown python

tomd's People

Stargazers

Watchers

tomd's Issues

tag parse

Hi, I find the case that
When the tag is <img></img>:
tomd.convert('''<img src="https://github.com" class="dsad"></img>''')
the parsed result is \n![](https://github.com)\n, which is what I expect,
But, when the tag is <img />:
tomd.convert('''<img src="https://github.com" class="dsad"/>''')
the result is: \n<img src=\"https://github.com\" class=\"dsad\"/>\n, so it seems that the self contained tag cannot be parsed.
Can we repair it？

Can not parse tag and all images losed

Can't convert " " tag

I am trying to convert my html codes.
But specially   tag is not replaced markdown syntax

No support in Chinese?

中文用这个就是乱码了，有什么办法吗？

Advise

It is better to add header space.

bold only works inside

tomd.convert('<p><b> bold </b></p>')  # '\n** bold **\n',   works
tomd.convert('<b> bold </b>')  # "", does not work

maybe pyquery can be useful, something like this:

from pyquery import Pyquery as pq
from tomd import MARKDOWN

html = "<b> bold </b>"
doc = pq(html)
for elm, val in MARKDOWN.items():
    # for item in doc(elm): replace item.html() with val[0] + pq(item).text() + val[1]

网页里面的图片无法解析成markdown

can not convert img tab not in p tags

html = """
<p>paragraph
<img src="https://github.com"></img>
</p>

<img src="https://github.com"></img>
"""
print tomd.convert(html)

Can't convert self-closing tag.

Hello, I found it can't convert self-closing tag like <img src="https://github.com" class="dsad"/>.
But it work fine with <img src="https://github.com" class="dsad"></img>

部分标签转化的小问题

当我在爬取CSDN文章时，下面标签转化过程中出现了问题。
原文链接为：https://blog.csdn.net/weixin_38405253/article/details/100151657

<li>
	RetentionPolicy.SOURCE: 注解只保留在源文件中
	</li>
	<li>
	RetentionPolicy.CLASS : 注解保留在class文件中，在加载到JVM虚拟机时丢弃
	</li>
	<li>
	RetentionPolicy.RUNTIME: 注解保留在程序运行期间，此时可以通过反射获得定义在某个类上的所有注解。
	</li>

看了一下tomd的源码，有点看不懂，所以不清楚怎么改，所以自行打了一个补丁，代码如下

import re

str_ = '''<li>
        RetentionPolicy.SOURCE: 注解只保留在源文件中
        </li>
        <li>
        RetentionPolicy.CLASS : 注解保留在class文件中，在加载到JVM虚拟机时丢弃
        </li>
        <li>
        RetentionPolicy.RUNTIME: 注解保留在程序运行期间，此时可以通过反射获得定义在某个类上的所有注解。
        </li>'''

pattem = re.compile(' *<li.*?>(.*?)</li>', re.S)
s = re.sub(pattem, lambda temp: "+ " + temp.group(1).strip(), str_)
print(s)

em converts to bold instead of italic

-tags convert to example instead example

issue with `'` in words.

so, guess we have a html like this:

this was jhon's car finally arrived at jane's palce

and we get:
this was jhon
'
s car finally arrived at jane
'
s place

im currently busy with something else so no time to toy around with this, but the bug is present i guess.

html里含有代码不显示

麻烦看下

Output is not clean with \n\t

Problem

The result of processed data can't build correct table in markdown.

Solution

It seems that \n\t have to be deleted before the html data process

Input

`
table='''

head1	head2	head3
content1	content2	content3

''' `

Process

md = Tomd(table).markdown

Output of md

                |head1            |head2            |head3        
|------
        |content1|            |content2|            |content3|

Cannot parse tr/th/td tags with attribute

Example:

<tr height="19">
<td style="border-bottom:#000000 1px solid;text-align:center;border-left:#000000 1px solid;font-style:normal;width:72px;height:19px;color:#000000;font-size:12px;vertical-align:middle;border-top:#000000 1px solid;font-weight:700;border-right:#000000 1px solid;text-decoration:none;mso-text-control:shrinktofit;mso-protection:locked visible" class="et2" height="19" width="72">

The above tags won't be parsed and turn into empty string.

New release ?

Hi,

can we have a new release of this, even with the code as is? I'm a big fan and user of this lib and I'd like to have a new relase instead of hand patching on every computer I use.

Would be greatly appreciated.

too strict with the html format

very strict with the html format ,not working at the situation

<ul>
<li>123123131</li></ul><ul><li>1
</li>
<li>2</li>
<li>3</li>
</ul>