The medium-to-jekyll from donohue

medium-to-jekyll's People

Contributors

Stargazers

Watchers

medium-to-jekyll's Issues

Misplaced '*' charachters for emphacized elements

When a post is converted to Markdown, the trailing '*' on emphasized text is shifted one character to the right. So for example, it should read:

"This is *italic.* The following isn't."

Is instead rendered:

"This is *italic. *The following isn't."

emoji (unicode char) into the html of the page make lxml return empty body

my posts with emoji char (unicode) in it are making lxml.html.document_fromstring(html) return <html><body><p>! D O C T Y P E h t m l > </p></body></html> instead of the proper html code.

Which then cause an error while parsing the HTML later :

Traceback (most recent call last):
  File "medium_to_jekyll.py", line 110, in <module>
    main()
  File "medium_to_jekyll.py", line 100, in main
    title, date = extract_metadata(doc)
  File "medium_to_jekyll.py", line 34, in extract_metadata
    title = etree.tostring(doc.xpath('//title')[0], method='text', encoding='unicode')
IndexError: list index out of range

While I'm looking for the document_fromstring() on the lxml doc I noticed this :

Really broken pages
The normal HTML parser is capable of handling broken HTML, but for pages that are far enough from HTML to call them 'tag soup', it may still fail to parse the page in a useful way. A way to deal with this is ElementSoup, which deploys the well-known BeautifulSoup parser to build an lxml HTML tree.

However, note that the most common problem with web pages is the lack of (or the existence of incorrect) encoding declarations. It is therefore often sufficient to only use the encoding detection of BeautifulSoup, called UnicodeDammit, and to leave the rest to lxml's own HTML parser, which is several times faster.

I think there's something to investigate on this side

I'm using Python 3 ;)

Recommend Projects