Git Product home page Git Product logo

medium-to-jekyll's People

Contributors

boltgolt avatar clawfire avatar dependabot[bot] avatar donohue avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

medium-to-jekyll's Issues

Misplaced '*' charachters for emphacized elements

When a post is converted to Markdown, the trailing '*' on emphasized text is shifted one character to the right. So for example, it should read:

"This is *italic.* The following isn't."

Is instead rendered:

"This is *italic. *The following isn't."

emoji (unicode char) into the html of the page make lxml return empty body

my posts with emoji char (unicode) in it are making lxml.html.document_fromstring(html) return <html><body><p>! D O C T Y P E h t m l &gt; </p></body></html> instead of the proper html code.

Which then cause an error while parsing the HTML later :

Traceback (most recent call last):
  File "medium_to_jekyll.py", line 110, in <module>
    main()
  File "medium_to_jekyll.py", line 100, in main
    title, date = extract_metadata(doc)
  File "medium_to_jekyll.py", line 34, in extract_metadata
    title = etree.tostring(doc.xpath('//title')[0], method='text', encoding='unicode')
IndexError: list index out of range

While I'm looking for the document_fromstring() on the lxml doc I noticed this :

Really broken pages
The normal HTML parser is capable of handling broken HTML, but for pages that are far enough from HTML to call them 'tag soup', it may still fail to parse the page in a useful way. A way to deal with this is ElementSoup, which deploys the well-known BeautifulSoup parser to build an lxml HTML tree.

However, note that the most common problem with web pages is the lack of (or the existence of incorrect) encoding declarations. It is therefore often sufficient to only use the encoding detection of BeautifulSoup, called UnicodeDammit, and to leave the rest to lxml's own HTML parser, which is several times faster.

I think there's something to investigate on this side

I'm using Python 3 ;)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.