Git Product home page Git Product logo

tomd's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tomd's Issues

tag parse

Hi, I find the case that
When the tag is <img></img>:
tomd.convert('''<p><img src="https://github.com" class="dsad"></img></p>''')
the parsed result is \n![](https://github.com)\n, which is what I expect,
But, when the tag is <img />:
tomd.convert('''<p><img src="https://github.com" class="dsad"/></p>''')
the result is: \n<img src=\"https://github.com\" class=\"dsad\"/>\n, so it seems that the self contained tag cannot be parsed.
Can we repair it?

Advise

It is better to add header space.

<b> bold </b> only works inside <p> </p>

tomd.convert('<p><b> bold </b></p>')  # '\n** bold **\n',   works
tomd.convert('<b> bold </b>')  # "", does not work 

maybe pyquery can be useful, something like this:

from pyquery import Pyquery as pq
from tomd import MARKDOWN

html = "<b> bold </b>"
doc = pq(html)
for elm, val in MARKDOWN.items():
    # for item in doc(elm): replace item.html() with val[0] + pq(item).text() + val[1]

Can't convert self-closing tag.

Hello, I found it can't convert self-closing tag like <img src="https://github.com" class="dsad"/>.
But it work fine with <img src="https://github.com" class="dsad"></img>

部分标签转化的小问题

当我在爬取CSDN文章时,下面标签转化过程中出现了问题。
原文链接为:https://blog.csdn.net/weixin_38405253/article/details/100151657

<li>
	RetentionPolicy.SOURCE: 注解只保留在源文件中
	</li>
	<li>
	RetentionPolicy.CLASS : 注解保留在class文件中,在加载到JVM虚拟机时丢弃
	</li>
	<li>
	RetentionPolicy.RUNTIME: 注解保留在程序运行期间,此时可以通过反射获得定义在某个类上的所有注解。
	</li>

看了一下tomd的源码,有点看不懂,所以不清楚怎么改,所以自行打了一个补丁,代码如下

import re

str_ = '''<li>
        RetentionPolicy.SOURCE: 注解只保留在源文件中
        </li>
        <li>
        RetentionPolicy.CLASS : 注解保留在class文件中,在加载到JVM虚拟机时丢弃
        </li>
        <li>
        RetentionPolicy.RUNTIME: 注解保留在程序运行期间,此时可以通过反射获得定义在某个类上的所有注解。
        </li>'''

pattem = re.compile(' *<li.*?>(.*?)</li>', re.S)
s = re.sub(pattem, lambda temp: "+ " + temp.group(1).strip(), str_)
print(s)

issue with `'` in words.

so, guess we have a html like this:

<p>this was jhon's car finally arrived at jane's palce</p>

and we get:
this was jhon
'
s car finally arrived at jane
'
s place

im currently busy with something else so no time to toy around with this, but the bug is present i guess.

Output is not clean with \n\t

Problem

The result of processed data can't build correct table in markdown.

Solution

It seems that \n\t have to be deleted before the html data process

Input

`
table='''

head1 head2 head3
content1 content2 content3
''' `

Process

md = Tomd(table).markdown

Output of md

`

                |head1            |head2            |head3        
|------
        |content1|            |content2|            |content3|        

`

Cannot parse tr/th/td tags with attribute

Example:

<tr height="19">
<td style="border-bottom:#000000 1px solid;text-align:center;border-left:#000000 1px solid;font-style:normal;width:72px;height:19px;color:#000000;font-size:12px;vertical-align:middle;border-top:#000000 1px solid;font-weight:700;border-right:#000000 1px solid;text-decoration:none;mso-text-control:shrinktofit;mso-protection:locked visible" class="et2" height="19" width="72">

The above tags won't be parsed and turn into empty string.

New release ?

Hi,

can we have a new release of this, even with the code as is? I'm a big fan and user of this lib and I'd like to have a new relase instead of hand patching on every computer I use.

Would be greatly appreciated.

too strict with the html format

very strict with the html format ,not working at the situation

<ul>
<li>123123131</li></ul><ul><li>1
</li>
<li>2</li>
<li>3</li>
</ul>

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.