Git Product home page Git Product logo

pywebcopy7's Introduction

    ____       _       __     __    ______
   / __ \__  _| |     / /__  / /_  / ____/___  ____  __  __
  / /_/ / / / / | /| / / _ \/ __ \/ /   / __ \/ __ \/ / / /
 / ____/ /_/ /| |/ |/ /  __/ /_/ / /___/ /_/ / /_/ / /_/ /
/_/    \__, / |__/|__/\___/_.___/\____/\____/ .___/\__, /
      /____/                               /_/    /____/

Created By : Raja Tomar License : Apache License 2.0 Email: [email protected]

PyWebCopy is a free tool for copying full or partial websites locally onto your hard-disk for offline viewing.

PyWebCopy will scan the specified website and download its content onto your hard-disk. Links to resources such as style-sheets, images, and other pages in the website will automatically be remapped to match the local path. Using its extensive configuration you can define which parts of a website will be copied and how.

What can PyWebCopy do?

PyWebCopy will examine the HTML mark-up of a website and attempt to discover all linked resources such as other pages, images, videos, file downloads - anything and everything. It will download all of theses resources, and continue to search for more. In this manner, WebCopy can "crawl" an entire website and download everything it sees in an effort to create a reasonable facsimile of the source website.

What can PyWebCopy not do?

PyWebCopy does not include a virtual DOM or any form of JavaScript parsing. If a website makes heavy use of JavaScript to operate, it is unlikely PyWebCopy will be able to make a true copy if it is unable to discover all of the website due to JavaScript being used to dynamically generate links.

PyWebCopy does not download the raw source code of a web site, it can only download what the HTTP server returns. While it will do its best to create an offline copy of a website, advanced data driven websites may not work as expected once they have been copied.

Installation

pywebcopy is available on PyPi and is easily installable using pip

$ pip install pywebcopy

You are ready to go. Read the tutorials below to get started.

First steps

You should always check if the latest pywebcopy is installed successfully.

>>> import pywebcopy
>>> pywebcopy.__version___
7.x.x

Your version may be different, now you can continue the tutorial.

Basic Usages

To save any single page, just type in python console

from pywebcopy import save_webpage
save_webpage(
      url="https://httpbin.org/",
      project_folder="E://savedpages//",
      project_name="my_site",
      bypass_robots=True,
      debug=True,
      open_in_browser=True,
      delay=None,
      threaded=False,
)

To save full website (This could overload the target server, So, be careful)

from pywebcopy import save_website
save_website(
url="https://httpbin.org/",
project_folder="E://savedpages//",
project_name="my_site",
bypass_robots=True,
debug=True,
open_in_browser=True,
delay=None,
threaded=False,
)

Running Tests

Running tests is simple and doesn't require any external library. Just run this command from root directory of pywebcopy package.

$ python -m pywebcopy --tests

Command Line Interface

pywebcopy have a very easy to use command-line interface which can help you do task without having to worrying about the inner long way.

  • Getting list of commands

    $ python -m pywebcopy --help
  • Using CLI

    Usage: pywebcopy [-p|--page|-s|--site|-t|--tests] [--url=URL [,--location=LOCATION [,--name=NAME [,--pop [,--bypass_robots [,--quite [,--delay=DELAY]]]]]]]
    
    Python library to clone/archive pages or sites from the Internet.
    
    Options:
      --version             show program's version number and exit
      -h, --help            show this help message and exit
      --url=URL             url of the entry point to be retrieved.
      --location=LOCATION   Location where files are to be stored.
      -n NAME, --name=NAME  Project name of this run.
      -d DELAY, --delay=DELAY
                            Delay between consecutive requests to the server.
      --bypass_robots       Bypass the robots.txt restrictions.
      --threaded            Use threads for faster downloading.
      -q, --quite           Suppress the logging from this library.
      --pop                 open the html page in default browser window after
                            finishing the task.
    
      CLI Actions List:
        Primary actions available through cli.
    
        -p, --page          Quickly saves a single page.
        -s, --site          Saves the complete site.
        -t, --tests         Runs tests for this library.
    
    
    
  • Running tests

      $ python -m pywebcopy run_tests

Authentication and Cookies

Most of the time authentication is needed to access a certain page. Its real easy to authenticate with pywebcopy because it uses an requests.Session object for base http activity which can be accessed through WebPage.session attribute. And as you know there are ton of tutorials on setting up authentication with requests.Session.

Here is an example to fill forms

from pywebcopy.configs import get_config

config = get_config('http://httpbin.org/')
wp = config.create_page()
wp.get(config['project_url'])
form = wp.get_forms()[0]
form.inputs['email'].value = 'bar' # etc
form.inputs['password'].value = 'baz' # etc
wp.submit_form(form)
wp.get_links()

You can read more in the github repositories docs folder.

pywebcopy7's People

Contributors

rajatomar788 avatar

Stargazers

Thugger069 avatar  avatar Andreas Motl avatar Athesh Pargau R avatar Jimmy Angel Pérez Díaz avatar  avatar

Watchers

James Cloos avatar  avatar

pywebcopy7's Issues

UnicodeDecodeError issue with downloaded .html file

Hi, first of all thanks for this repository. I've tried to use pywebcopy but as it was mentioned here it hangs at some point and can't even interrupt. pywebcopy7 seems much better and faster than previous, however I keep getting UnicodeDecodeError when I try to open the HTML file with Flask. I tried to convert it to utf-8 with Notepad++ but it didn't change at all.

It works like charm for very small websites, such as http://example.com/ .
But when I try for more complicated websites, like news web sites, the .html file seems like this in the browser:

encoding_issue

When I locally try to create a web page by using the html file, with Flask, it gives this error:

* Running on http://127.0.0.1:5000/ (Press CTRL+C to quit) [2020-10-07 11:40:13,621] ERROR in app: Exception on / [GET] Traceback (most recent call last): File "c:\users\test\appdata\local\programs\python\python38\lib\site-packages\flask\app.py", line 2447, in wsgi_app response = self.full_dispatch_request() File "c:\users\test\appdata\local\programs\python\python38\lib\site-packages\flask\app.py", line 1952, in full_dispatch_request rv = self.handle_user_exception(e) File "c:\users\test\appdata\local\programs\python\python38\lib\site-packages\flask\app.py", line 1821, in handle_user_exception reraise(exc_type, exc_value, tb) File "c:\users\test\appdata\local\programs\python\python38\lib\site-packages\flask\_compat.py", line 39, in reraise raise value File "c:\users\test\appdata\local\programs\python\python38\lib\site-packages\flask\app.py", line 1950, in full_dispatch_request rv = self.dispatch_request() File "c:\users\test\appdata\local\programs\python\python38\lib\site-packages\flask\app.py", line 1936, in dispatch_request return self.view_functions[rule.endpoint](**req.view_args) File "C:\Users\test\Desktop\g_cloner\local_deneme.py", line 17, in home return render_template('milliyet_index.html') File "c:\users\test\appdata\local\programs\python\python38\lib\site-packages\flask\templating.py", line 138, in render_template ctx.app.jinja_env.get_or_select_template(template_name_or_list), File "c:\users\test\appdata\local\programs\python\python38\lib\site-packages\jinja2\environment.py", line 930, in get_or_select_template return self.get_template(template_name_or_list, parent, globals) File "c:\users\test\appdata\local\programs\python\python38\lib\site-packages\jinja2\environment.py", line 883, in get_template return self._load_template(name, self.make_globals(globals)) File "c:\users\test\appdata\local\programs\python\python38\lib\site-packages\jinja2\environment.py", line 857, in _load_template template = self.loader.load(self, name, globals) File "c:\users\test\appdata\local\programs\python\python38\lib\site-packages\jinja2\loaders.py", line 115, in load source, filename, uptodate = self.get_source(environment, name) File "c:\users\test\appdata\local\programs\python\python38\lib\site-packages\flask\templating.py", line 60, in get_source return self._get_source_fast(environment, template) File "c:\users\test\appdata\local\programs\python\python38\lib\site-packages\flask\templating.py", line 86, in _get_source_fast return loader.get_source(environment, template) File "c:\users\test\appdata\local\programs\python\python38\lib\site-packages\jinja2\loaders.py", line 184, in get_source contents = f.read().decode(self.encoding) UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte 127.0.0.1 - - [07/Oct/2020 11:40:13] "GET / HTTP/1.1" 500 -

Is there any possible solution for this error?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.