Git Product home page Git Product logo

simplecrawler's Introduction

SimpleCrawler: A Single Domain Sitemap Generator

Version 1.0.0

Given a root Web Page URL that uses either the http:// or https:// protocol, SimpleCrawler will recursively traverse all the connected Web Pages that are in the same domain as the root Web Page URL. The output consists of a list of recursively connected Web pages. For each page listed, three additional lists are printed : one for all the connected pages in the same domain, one for all the connected pages that are not in the same domain, and one for all the images that are on the page.

Getting Started

Python 3 is required. You can check this by simply running:

$ python --version

You should get some output like 3.6.2. If you do not have Python, please install the latest 3.x version from python.org.

There are several ways to install SimpleCrawler. Method one is to check out the project and install all the application's dependencies. Method two is to download the source files into a directory, create a virtual environment, and use the tools in the newly created virtualenv to load the required Python libraries automatically. Another method may use a combination of method one and method two.

This document addresses method two, and so virtualenv needs to be installed on your system.

Prior to veryfying if virtualenv is installed, you’ll need to make sure you have pip available. You can check this by running:

$ pip --version

Pip can be installed in a number of ways, including using sudo, homebrew, etc. If pip is not installed , please refer to the documentation for your platform. For example, on a MAC, pip can be installed as follows:

$ sudo easy_install pip

To verify that virtualenv is installed, run the command

$ virtualenv --version

If virtualenv is not installed, install it as follows:

$ pip install --user virtualenv

Installing

Download the project files from github into a project directory. Master should have the latest stable code.

Open a new terminal window and cd to the project directory. Install a virtual environment as follows

$ virtualenv env

This will create new directory in the project called env. Now activate this virtual environment for the terminal window by entering the following command

$ source env/bin/activate

This will update the environment for the current terminal by adding VIRTUAL_ENV= and adding ../env/bin to the head of the PATH.

Now install the required packages into the newly created virtual environment

$ python3 -m pip install -r requirements.txt

This will insure that the libraries required by the SimpleCrawler appliction and that are specified in the requirements.txt file are installed in the virtual environment just created

Running the tests

After the virtual environment has been set up the installation can be verified by running the project tests in the top level project directory as follows

$ python3 -m unittest

This will run a battery of tests.

Running the application

To run the SimpleCrawler application enter the following command in the top level project directory

$ python3 run.py -r'ValidUrl'

where 'ValidUrl' is a valid url address with a protocol of either http:// or https;//

simplecrawler's People

Contributors

jdiamand avatar

Watchers

 avatar

Forkers

tongzanyang

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.