Git Product home page Git Product logo

browsercrawler's Introduction

WHAT

The BrowserCrawler plugin for Safari will pull all the linked pages under a particular root on a
website and upload them directly to your S3 bucket. There is no limit to the depth it will crawl.

WHY? 

Sometimes you want a permanent copy of something you find on the web. For example, I used it to grab a copy of the
Mozilla Javascript documentation to have offline. Use it to backup your own data on sites that require authentication
but don't have a great data portability solution.

HOW

1. Install the plugin either by downloading it or building it in Safari with the built-in Extension Builder
2. Configure your AWS S3 settings in the preferences
3. When you are on a page where you want to start the crawl either click the spider button or
   right click and start crawl.

It will continue the crawl as long as the page is open, updating a progress box with the current URL
it is crawling. It will only crawl pages at the same level or below as the seed page. You can cancel at any time
by clicking "Cancel".

WARNING

BrowserCrawler will crawl the site as YOU and so will capture any data that you would normally have access to using
your cookies. It will not run javascript but it does download the images (though doesn't upload them), this can use
a tremendous amount of bandwidth if you happen to crawl a big website. Also, be careful with small sites, they might
not have the capacity to endure a full-speed crawl, even from a single machine.

CREDITS

Thanks to l.m.orchard at pobox.com for the S3 library and Paul Johnston, Greg Holt, Andrew Kepert, Ydnar, and Lostinet
for the SHA1 implementation that it uses. JQuery 1.5.2 isn't included but it is used by the plugin to do the actual crawl.

browsercrawler's People

Contributors

spullara avatar

Stargazers

 avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.