Git Product home page Git Product logo

websitetomongo's Introduction

What Does It Do?

WebsiteToMongo is a configurable web crawler that downloads websites locally and then indexes the results in a mongo db dataase

##How To Configure Configuration is relatively easy and mostly requires inputting values into indexRules.settings

Config Explained

###Defaults #####followExternalLinks:(true/false) Whether to follow external links #####respectRobots:(true/false) Whether to respect robots.txt, but currently not implemented

###Basic Config #####homepageURL: http://www.google.com/ The URL or local file of where the webcrawler should start #####websiteURL: http://www.google.com/ URL of website (if offline this is still the actual sites url) #####workingDir: /Users/sampleUser/Documents/testWebsites/ Source Folder #####subdir: google Where to save in source folder #####saveType: (fullSite/content) fullSite downloads the entire site for offline viewing, while content just downloads certain content

###Content Config Only Required For saveType:content #####contentType:video The selector for content type to download #####contentLoc:src The tag to get the content source eg

###Database Config #####database: wikiForSchoolsTest Mongo Database Name #####collection: activities Mongo Collection Name

###Follow Rules Rules about what links to follow and files to download #####downloadFiles:jpg, png, gif, jpeg, tif, css, js File types to download #####linksToFollow:html, htm, php, asp Href links to follow #####disallow:.png.htm Example Exclude Paths #####disallow:/images/ Example Exclude Path

###Index Rules Rules about what files to Index #####includes: video Index only if has selector type eg #id .class or element eg button #####!include: .audio Don't index if has selector #####url-pattern: relevantFiles/ Index only if has url pattern #####!url-pattern uselessFiles/ Don't index if has urlPath

##To Index How Indexed Results Are loaded into the database ####### field:value ###Fields No Rules About Names ###Value ####Defaults Always Starts with $ and following default value

$fileType

The filetype eg html

$size

The size of a file in MB

$fileName

The name of the file eg index.html

$filePath

Full filepath from working directory google/index.html

$linkText

The text of value of a link that points to the page ####Literals Denotes by text inside of ' or " it is always read as such ####Commands Grab Certain Elements inside of doc. Commands are pretty limited as of now, but you can grab any selector and any of its children. #####selector Grabs the text of a selector #####selector+child(0) Grabs the text of the first child of a selector

websitetomongo's People

Contributors

iancostello avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.