seethroughdev / status-crawler Goto Github PK

View Code? Open in Web Editor NEW

126.0 15.0 48.0 102 KB

A fully configurable crawler to check your website status codes, javascript errors and anything you want.

JavaScript 100.00%

status-crawler's People

Contributors

Stargazers

Watchers

status-crawler's Issues

Trouble crawling https sites

Hello I am kind of new to programming so forgive me if this is a simple question but I have been able to use you tool against http sites with no issue but I am having trouble getting it to work with a https site. It hits the first page and returns null indicating that it did not crawl the exits. Do you have any suggestion to get it to work with an https site. Thank you.

404s not being added to errors array

The spider works well but I'm finding that 404s from the same domain are not being added to the errors array. Is this by design or an issue? Is it possible to add an errorCount property such that consumers of the JSON know if there are issues with the crawl (rather than traversing the entire links collection)?

{
  "start": "http://test.xxx.com",
  "date": "2015-03-02T18:35:48.946Z",
  "dateFileName": true,
  "requiredValues": [
    "test.xxx.com"
  ],
  "skippedValues": [
    "default"
  ],
  "links": [
    {
      "url": "http://test.xxx.com#portfolioModal6",
      "status": 200
    },
    {
      "url": "http://test.xxx.com#",
      "status": 200
    },
    {
      "url": "http://test.xxx.com/help",
      "status": 404
    },
    {
      "url": "http://test.xxx.com/contact",
      "status": 404
    }
  ],
  "errors": [],
  "messages": [],
  "skippedLinksCount": 7,
  "logFile": "./logs/2015-03-02-data.json",
  "linkCount": 4,
  "userAgent": null
}

User Agent and Proxies

Hi there.

I tested your spider and its working good. Two suggestion are:

UserAgents options, so you can change to whatever you want
Proxies use, so you can add proxies and crawl pages with it PROXY:PORT:USERNAME:PASSWORD

url bugs

casperjs --start-url=http://www.ylwyl.com/ --required-values=www.ylwyl.com spider.js

you can look:

......
http://www.ylwyl.com/uploads/plsel20120703032237aLgSg.jpg
http://www.ylwyl.com/uploads/plsel20120703032237GKVQI.jpg
http://www.ylwyl.com/uploads/plsel20120703032237SDQXP.jpg
http://www.ylwyl.com/uploads/plsel20120703032237irmei.jpg


http://www.ylwyl.com?typeid=35&page=2

http://www.ylwyl.com/uploads/plsel2012070303253109413.jpg
http://www.ylwyl.com/uploads/plsel2012070303253149819.jpg

--required-values spiders ALL

casperjs --start-url=http://www.proxymis.com --required-values=proxymis.com spider.js

does spider links that does not contain the url like:

200 http://www.google-analytics.com/ga.js
200 http://fonts.gstatic.com/s/economica/v4/UK4l2VEpwjv3gdcwbwXE9InF5uFdDttMLvmWuJdhhgs.ttf
200 http://fonts.gstatic.com/s/economica/v4/jObgDQiPUtmACAaaK3pMG6CWcynf_cDxXwCLxiixG1c.ttf
200 http://fonts.gstatic.com/s/lato/v11/v0SdcGFAl2aezM9Vq_aFTQ.ttf
200 http://fonts.gstatic.com/s/lato/v11/nj47mAZe0mYUIySgfn0wpQ.ttf
200 http://connect.facebook.net/fr_FR/all.js#xfbml=1

Shouldn't it ONLY spider resources that contain the required-values parameter ?

What versions of phantomjs

What version of phantomjs and casperjs does this work with?

Currently I'm using CasperJS 1.0.4 and PhantomJS 1.8.2.

I keep getting the following result...

^_^[sam@casperjs:~/casperjs-spider]$ casperjs spider.js
{
    "casper-path": "/home/sam/packages/casperjs",
    "cli": true
}
null https://example.com/
Crawl has completed!
Data file can be found at ./logs/data.json.

The page has many links though. I also saw this result when I tried the original spider.js from PlanZero.

load-images parameter not working

The load-images parameter doesn't seem to be working from the command line:

$ casperjs.exe --start-url=http://test.xxx.com --required-values=test.xxx.com --load-images=true spider.js

Even if I change this within config.js explictly the spider doesn't seem to crawl tags in my site's pages.

Version: 1.1.0-beta3 and 1.9.6 for phantomjs.

NPM package?

Any interest in publishing this as a CLI tool on NPM?

I have have a Travis CI build running on a static site that checks for various errors, and I'd like to be able to run the crawler during the build to make sure links/JavaScript isn't getting broken on the branch.

To do that, Travis would need a package it could grab and install.

seethroughdev / status-crawler Goto Github PK

status-crawler's People

Contributors

Stargazers

Watchers

Forkers

status-crawler's Issues

Recommend Projects

Recommend Topics

Recommend Org