Git Product home page Git Product logo

webpage2html's Introduction

Webpage2html

Build Status

Webpage2html: Save web page to a single html file

This is a simple script to save a web page to a single html file. No mhtml or pdf stuff, no xxx_files directory, just one single readable and editable html file.

The basic idea is to insert all css/javascript files into html directly, and use base64 data URI for image data.

Usage and Example

Save web page directly from url (recommended way):

$ python2 webpage2html.py http://www.google.com > google.html

or save web page first using browsers such as Chrome, to something.html with something_files directory beside.

$ python2 webpage2html.py /path/to/something.html > something_single.html

But note that the second method may not always work as expected, because there may be urls like //ssl.gstatic.com/gb/images/v1_c69d5271.png (from google index page), but the file is missing in Google_files directory saved by browsers.

Enable javascript, for example, save 2048 game page into a single html for offline playing

$ python2 webpage2html.py -s http://gabrielecirulli.github.io/2048/ > 2048.html

Dependency

BeautifulSoup4, lxml, termcolor(optional)

$ pip install -r requirements.txt

or install them manually

$ pip install lxml BeautifulSoup4 requests termcolor

I have tried the default HTMLParser and html5lib as the backend parser for BeautifulSoup, but both of them are buggy, HTMLParser handles self closing tags (like <br> <meta>) incorrectly(it will wait for closing tag for <br>, so If too many <br> tags exist in the html, BeautifulSoup will complain RuntimeError: maximum recursion depth exceeded), and html5lib will encode encoded html entities such as &lt; again to &amp;lt;, which is definitly unacceptable. I have tested many cases, and lxml works perfectly, so I choose to use lxml now.

The termcolor package is for colored log output support if you like.

Unsupported Cases

browser side less compiling

The page embeds less css directly and use less.js to compile in browser. In this case, I still cannot find a way to embed the less code into generated html to make it work.

<link rel="stylesheet/less" type="text/css" href="http://dghubble.com/blog/theme/css/style.less">
<script src="http://dghubble.com/blog/theme/js/less-1.5.0.min.js" type="text/javascript"></script>

srcset attribute in img tag (html5)

Currently srcset is discarded.

Contributors

  1. lukin.a.i submitted a patch to fix not recognised css link (rel=stylesheet) issue
  2. Gruber.
  3. Java port of this project. https://github.com/cedricblondeau/webpage2html-java

License

webpage2html use SATA License (Star And Thank Author License), so you have to star this project before using. Read the license carefully.

webpage2html's People

Contributors

ajmeese7 avatar reverland avatar slevon avatar ztrix avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.