amoilanen / js-crawler Goto Github PK

View Code? Open in Web Editor NEW

249.0 12.0 54.0 204 KB

Web crawler for Node.JS

License: MIT License

JavaScript 18.01% HTML 1.98% TypeScript 79.05% Shell 0.95%

js-crawler's Introduction

js-crawler

Web crawler for Node.JS, both HTTP and HTTPS are supported.

Installation

npm install js-crawler

Usage

The crawler provides intuitive interface to crawl links on web sites. Example:

var Crawler = require("js-crawler").default;

new Crawler().configure({depth: 3})
  .crawl("http://www.google.com", function onSuccess(page) {
    console.log(page.url);
  });

The call to configure is optional, if it is omitted the default option values will be used.

onSuccess callback will be called for each page that the crawler has crawled. page value passed to the callback will contain the following fields:

url - URL of the page
content - body of the page (usually HTML)
status - the HTTP status code

Extra information can be retrieved from the rest of the page fields: error, response, body which are identical to the ones passed to the callback of request invocation of the Request module. referer field will reference the url of the page that lead the crawler to the current page.

Options-based API

Alternative APIs for passing callbacks to the crawl function.

var Crawler = require("js-crawler").default;

var crawler = new Crawler().configure({ignoreRelative: false, depth: 2});

crawler.crawl({
  url: "https://github.com",
  success: function(page) {
    console.log(page.url);
  },
  failure: function(page) {
    console.log(page.status);
  },
  finished: function(crawledUrls) {
    console.log(crawledUrls);
  }
});

Handling errors

It is possible to pass an extra callback to handle errors, consider the modified example above:

var Crawler = require("js-crawler").default;

new Crawler().configure({depth: 3})
  .crawl("http://www.google.com", function(page) {
    console.log(page.url);
  }, function(response) {
    console.log("ERROR occurred:");
    console.log(response.status);
    console.log(response.url);
    console.log(response.referer);
  });

Here the second callback will be called for each page that could not be accessed (maybe because the corresponding site is down). status may be not defined.

Knowing when all crawling is finished

Extra callback can be passed that will be called when all the urls have been crawled and crawling has finished. All crawled urls will be passed to that callback as an argument.

var Crawler = require("js-crawler").default;

new Crawler().configure({depth: 2})
  .crawl("http://www.google.com", function onSuccess(page) {
    console.log(page.url);
  }, null, function onAllFinished(crawledUrls) {
    console.log('All crawling finished');
    console.log(crawledUrls);
  });

Limiting the rate at which requests are made

maxRequestsPerSecond option

By default the maximum number of HTTP requests made per second is 100, but this can be adjusted by using the option maxRequestsPerSecond if one wishes not to use too much of network or, opposite, wishes for yet faster crawling.

var Crawler = require("js-crawler").default;

var crawler = new Crawler().configure({maxRequestsPerSecond: 2});

crawler.crawl({
  url: "https://github.com",
  success: function(page) {
    console.log(page.url);
  },
  failure: function(page) {
    console.log(page.status);
  }
});

With this configuration only at most 2 requests per second will be issued. The actual request rate depends on the network speed as well, maxRequestsPerSecond configures just the upper boundary.

maxRequestsPerSecond can also be fractional, the value 0.1, for example, would mean maximum one request per 10 seconds.

maxConcurrentRequests option

Even more flexibility is possible when using maxConcurrentRequests option, it limits the number of HTTP requests that can be active simultaneously. If the number of requests per second is too high for a given set of sites/network requests may start to pile up, then specifying maxConcurrentRequests can help ensure that the network is not overloaded with piling up requests.

Specifying both options

It is possible to customize both options in case we are not sure how performant the network and sites are. Then maxRequestsPerSecond limits how many requests the crawler is allowed to make and maxConcurrentRequests allows to specify how should the crawler adjust its rate of requests depending on the real-time performance of the network/sites.

var Crawler = require("js-crawler").default;

var crawler = new Crawler().configure({
  maxRequestsPerSecond: 10,
  maxConcurrentRequests: 5
});

crawler.crawl({
  url: "https://github.com",
  success: function(page) {
    console.log(page.url);
  },
  failure: function(page) {
    console.log(page.status);
  }
});

Default option values

By default the values are as follows:

maxRequestsPerSecond 100

maxConcurrentRequests 10

That is, we expect on average that 100 requests will be made every second and only 10 will be running concurrently, and every request will take something like 100ms to complete.

Reusing the same crawler instance for repeated crawling: forgetting crawled urls

By default a crawler instance will remember all the urls it ever crawled and will not crawl them again. In order to make it forget all the crawled urls the method forgetCrawled can be used. There is another way to solve the same problem: create a new instance of a crawler. Example https://github.com/antivanov/js-crawler/blob/master/examples/github_forgetting_crawled_urls.js

Supported options

depth - the depth to which the links from the original page will be crawled. Example: if site1.com contains a link to site2.com which contains a link to site3.com, depth is 2 and we crawl from site1.com then we will crawl site2.com but will not crawl site3.com as it will be too deep.

The default value is 2.

ignoreRelative - ignore the relative URLs, the relative URLs on the same page will be ignored when crawling, so /wiki/Quick-Start will not be crawled and https://github.com/explore will be crawled. This option can be useful when we are mainly interested in sites to which the current sites refers and not just different sections of the original site.

The default value is false.

userAgent - User agent to send with crawler requests.

The default value is crawler/js-crawler

maxRequestsPerSecond - the maximum number of HTTP requests per second that can be made by the crawler, default value is 100
maxConcurrentRequests - the maximum number of concurrent requests that should not be exceeded by the crawler, the default value is 10
shouldCrawl - function that specifies whether a url should be crawled/requested, returns true or false, argument is the current url the crawler considers for crawling
shouldCrawlLinksFrom - function that specifies whether the crawler should crawl links found at a given url, returns true or false, argument is the current url being crawled

Note: shouldCrawl determines if a given URL should be requested/visited at all, whereas shouldCrawlLinksFrom determines if the links on a given URL should be harvested/added to the crawling queue. Many users may find that using shouldCrawl is sufficient, as links from a page cannot be crawled if the page is never visited/requested in the first place. However there is a common use case for having shouldCrawlLinksFrom: if a user would like to check external links on a site for errors without crawling those external links, the user could create a shouldCrawlLinksFrom function that restricts crawling to the original url without visiting external links.

Examples:

shouldCrawl: the following will crawl subreddit index pages reachable on reddit.com from the JavaScript subreddit:

var Crawler = require("js-crawler").default;

var rootUrl = "http://www.reddit.com/r/javascript";

function isSubredditUrl(url) {
  return !!url.match(/www\.reddit\.com\/r\/[a-zA-Z0-9]+\/$/g);
}

var crawler = new Crawler().configure({
  shouldCrawl: function(url) {
    return isSubredditUrl(url) || url == rootUrl;
  }
});

crawler.crawl(rootUrl, function(page) {
  console.log(page.url);
});

The default value for each is a function that always returns true.

Development

Install dependencies

npm install

Running the build

npm run build

Unit tests

npm test

launches unit tests in the console mode

npm run test:tdd

launches a browser in which unit tests can be debugged

End-to-end tests

mocha and express are used to setup and run end-to-end tests

Make sure dependencies are installed (mocha is included)

npm install

Install express globablly

npm install -g express

Start the end-to-end target server

cd e2e
node server.js

Now the server runs on the port 3000. Run the end-to-end specs:

npm run e2e

License

MIT License (c) Anton Ivanov

Credits

The crawler depends on the following Node.JS modules:

js-crawler's People

Contributors

Stargazers

Watchers

js-crawler's Issues

Is it can use by only JavaScript?

I mean it is possible for frontend to use it?

getting unknown encoding error on some pages

when parsing various urls, i came across this link: http://www.sanssouci-wien.com/

which on one page seems to throw the following error:

buffer.js:497
          throw new TypeError('Unknown encoding: ' + encoding);
          ^

TypeError: Unknown encoding: none
    at Buffer.slowToString (buffer.js:497:17)
    at Buffer.toString (buffer.js:510:27)
    at Crawler._getDecodedBody (/node_modules/js-crawler/crawler.js:267:24)
    at /node_modules/js-crawler/crawler.js:221:37
    at Request._callback (/node_modules/js-crawler/crawler.js:183:7)
    at Request.self.callback (/node_modules/js-crawler/node_modules/request/request.js:368:22)
    at emitTwo (events.js:106:13)
    at Request.emit (events.js:191:7)
    at Request.<anonymous> (/node_modules/js-crawler/node_modules/request/request.js:1219:14)
    at emitOne (events.js:101:20)
    at Request.emit (events.js:188:7)
    at IncomingMessage.<anonymous> (/node_modules/js-crawler/node_modules/request/request.js:1167:12)
    at emitNone (events.js:91:20)
    at IncomingMessage.emit (events.js:185:7)
    at endReadableNT (_stream_readable.js:974:12)
    at _combinedTickCallback (internal/process/next_tick.js:74:11)
    at process._tickCallback (internal/process/next_tick.js:98:9)

couldn't figure out yet which page exactly is to blame, here are my configs for the crawler:

new jsCrawler().configure({ depth: 3, maxRequestsPerSecond: 10, maxConcurrentRequests: 5, shouldCrawl: function (url) { let simplifiedUrl = starturl.substring(starturl.indexOf('//') + 2).replace('www.', ''); return url.includes(simplifiedUrl); } })

Crawler is not a function

Hi, I am trying to setup a basic crawler script but I am getting an error:

new Crawler().configure({depth: 3}) ^ TypeError: Crawler is not a function at Object.<anonymous> (/var/www/user/test2.js:3:1) at Module._compile (module.js:410:26) at Object.Module._extensions..js (module.js:417:10) at Module.load (module.js:344:32) at Function.Module._load (module.js:301:12) at Function.Module.runMain (module.js:442:10) at startup (node.js:136:18) at node.js:966:3

test2.js

`var Crawler = require("js-crawler").default;

new Crawler().configure({depth: 3})
.crawl("http://www.google.com", function onSuccess(page) {
console.log(page.url);
});`

What I am doing wrong?

I think shouldCrawl code example is incorrect

The code example shows:

   shouldCrawl: function (url) {
        return url.indexOf("google.com") < 0;
    },

But I think the "<" is wrong and should be ">"

stop crawling

Is it possible to force crawler to stop its crawling. I have condition that only 500 pages should be crawled when that condition is met ti want to stop this crawler

Crawler completes then cancels the output of "crawledUrls"?

I found when crawling a site with the depth set to 2, it will finish, and console.log(crawledUrls) correctly. But when using a higher depth like 4 or 6 (which of course takes longer), the crawler will finish then fail, not returning the console.log(crawledUrls).

Here is the script below.

var Crawler = require("js-crawler");
 
mm_url = "{the url}"
page_nbr = 0

function ismmUrl(url) {
    return !!url.match({the regex url});
  }

var crawler = new Crawler().configure({
    maxRequestsPerSecond: 100,
    maxConcurrentRequests: 40,
    ignoreRelative: false,
    depth: 4,

    shouldCrawl: function(url) {
        return ismmUrl(url) || url == mm_url;
      }
});

crawler.crawl({

    url: mm_url,
    success: function(page) {
        console.log(`Loaded page ${page_nbr++}. URL = ${page.url} content length = ${page.content.length} status = ${page.status}`);
        // console.log(page.content)
    },
    failure: function(page) {
        console.log(`Could not load page. URL = ${page.url} status = ${page.status}`);
    },
    finished: function(crawledUrls) {
        console.log('Forgetting all crawled...');
        crawler.forgetCrawled();
        console.log('Complete');
        console.log(crawledUrls);
    }
});

This is what the crawler outputs inside the console:

Loaded page 9078. URL = {theurl}/{morestuff} content length = 8850 status = 200
Loaded page 9079. URL = {theurl}/{morestuff} content length = 19070 status = 200
Loaded page 9080. URL = {theurl}/{morestuff} content length = 15481 status = 200
Loaded page 9081. URL = {theurl}/{morestuff} content length = 15776 status = 200
Forgetting all crawled...
Complete
Canceled

This is what the crawler outputs when the depth is 2. As you can see, the array is logged.

Loaded page 7. URL = {theurl}/{morestuff} content length = 8850 status = 200
Loaded page 8. URL = {theurl}/{morestuff} content length = 19070 status = 200
Loaded page 9. URL = {theurl}/{morestuff} content length = 15481 status = 200
Loaded page 10. URL = {theurl}/{morestuff} content length = 15776 status = 200
Forgetting all crawled...
Complete
Array(11) ["{theurl}/{morestuff}", "{theurl}/{morestuff}", "h{theurl}/{morestuff}", "{theurl}/{morestuff}", "{theurl}/{morestuff}", "{theurl}/{morestuff}", "{theurl}/{morestuff}", "{theurl}/{morestuff}", …]

How to deal with ETIMEDOUT error and pending forever?

Hello,
Thanks for the nice robust crawler. But I got ETIMEDOUT error sometimes. Besides, some http request may stay in pending state and the request died after 5 minutes.

Thanks
fanzijian

Add <base> tag support for relative urls

Some pages use HTML tag in their tag. In this case, relative urls should be processed according to the base url.

For instance,

I go to the page http://www.milliyet.com.tr/astroloji/
It has this base:
It contains an anchor with a relative url: Yengec
When I click that anchor, the browser goes to http://www.milliyet.com.tr/yengec
When js-crawler sees that link, it goes to http://www.milliyet.com.tr/astroloji/yengec , which is 404.

freeze and defrost for saving and resuming a big crawl? enhancement

Would be awesome to have the already visited urls saveable so that you can restart a crawl later and not revisit links, to start where you left off.

What the content exactly is when the requested resource are binary, e.g., images or pdf file?

When I try to write the content into a file directly using code snippet as below
fs.writeFileSync(filepath, page.content, 'binary');

It always come out a corrupted file. Can you help to look at this issue?

Many thanks!
tibetty

bug empty response

  return response.headers && response.headers['content-type']
                 ^

TypeError: Cannot read property 'headers' of undefined
    at Crawler._isTextContent (/root/test/node_modules/js-crawler/crawler.js:257:18)
    at /root/test/node_modules/js-crawler/crawler.js:220:30
    at Request._callback (/root/test/node_modules/js-crawler/crawler.js:183:7)
    at self.callback (/root/test/node_modules/request/request.js:186:22)
    at emitOne (events.js:77:13)
    at Request.emit (events.js:169:7)
    at Request.init (/root/test/node_modules/request/request.js:274:17)
    at new Request (/root/test/node_modules/request/request.js:128:8)
    at Crawler.request (/root/test/node_modules/request/index.js:54:10)
    at /root/test/node_modules/js-crawler/crawler.js:181:10```

How can we crawl local websites?

Trying to crawl a local project.

[Error: Invalid URI "./test/index.html"]

[Error: Invalid URI "/var/www/sites/test/index.html"]

[Error: Invalid URI "file:///var/www/sites/test/index.html"]

How to assign encoding of response content?

I found wrong charset from the response content from non-utf8 web page. Here's a url for example: http://www.cartoomad.com/comic/276400012051002.html

js-crawler seems to crawl the same url multiple times

Hello,

I am using js-crawler and it's awesome, but I have an issue when crawling a small website.

Here is my code

var crawler = new Crawler().configure({
    ignoreRelative: false,
    depth: Number.POSITIVE_INFINITY,
    shouldCrawl: function(url) {
        return url.indexOf('.pdf') < 0 && url.indexOf(host) >= 0;
    }
});

var crawledPages = {};

crawler.crawl({
    url: entryPage,
    success: function(page) {
        console.log('success crawling -> ', page.url);
        crawledPages[page.url] = page.body;
    },
    failure: function(page) {
        console.log("ERROR crawling url => ", page);
    },
    finished: function(crawledUrls) {
        console.log('length of pages crawled is => ', crawledUrls.length); // output 177
        console.log('length of different urls => ', Object.size(crawledPages)); // output 22
        htmlStripper.strip(crawledPages);
    }
});

As you can see, the output of my console.log in the "finished" callback are different. Why so ?
It is not a major issue because I have my 22 pages in my crawledPages variable, but how can I avoid crawling the same url twice ?

Thanks :)

Follows Redirects Outside shouldCrawl Function

If a page returns a redirect, the crawler automatically follows the redirect url without validating it against the shouldCrawl function.

This may be a separate concern, but it also seems to get confused by relative urls on the resulting page if the redirect went to a different domain. So if www.example.com returns a 301 to www.different.com and www.different.com has "/about" as a relative link, js-crawler thinks it needs to crawl www.example.com/about instead of www.different.com/about.

Know When All Crawling is Complete

I'd like to know when all crawling is complete so I can output aggregated results. Right now, it seems to spin off each page on its own which makes it difficult to pull everything back together.

How to deal with shortened URLs

Hi,

is there a way to retrieve the landing url of a shortened url like goo.gl/89234fIASVHAS ?
Right now the crawler will pass the shortened url into the callback, which messes up all relative links on the crawled pages...
Thanks!

Basics

Hi,

Sorry for this noob question but I've lost quite some time trying to get this to work.
I'm completely new to grunt, karma etc. but I know my way in Javascript.

I try to get this to run but I don't know where to put my file in order for it to run.
I've installed everything needed:

Package.json

{
  "name": "crawler",
  "version": "0.3.18",
  "description": "Web crawler for Node.js",
  "main": "crawler.js",
  "directories": {
    "example": "examples"
  },
  "dependencies": {
    "request": "~2.55.0",
    "underscore": "~1.8.3"
  },
  "devDependencies": {
    "browserify": "^13.0.0",
    "grunt": "^0.4.5",
    "grunt-eslint": "^18.0.0",
    "grunt-karma": "^0.12.2",
    "jasmine-core": "^2.4.1",
    "js-crawler": "^0.3.18",
    "karma": "^0.13.22",
    "karma-browserify": "^5.0.2",
    "karma-chrome-launcher": "^0.2.2",
    "karma-firefox-launcher": "^0.1.7",
    "karma-jasmine": "^0.3.8",
    "karma-jasmine-html-reporter": "^0.2.0",
    "karma-phantomjs-launcher": "^1.0.0",
    "phantomjs-prebuilt": "^2.1.6",
    "watchify": "^3.7.0"
  },
  "scripts": {
    "test": "echo \"Error: no test specified\" && exit 1"
  },
  "repository": {
    "type": "git",
    "url": "https://github.com/antivanov/js-crawler"
  },
  "keywords": [
    "web-crawler",
    "crawler",
    "scraping",
    "website-crawler",
    "crawling",
    "web-bot"
  ],
  "author": "Anton Ivanov, [email protected]",
  "license": "MIT",
  "bugs": {
    "url": "https://github.com/antivanov/js-crawler/issues"
  },
  "homepage": "https://github.com/antivanov/js-crawler"
}

Gruntfile:

require('grunt-karma');

module.exports = function(grunt) {

  grunt.initConfig({
    eslint: {
      target: ['Gruntfile.js', 'crawler.js', 'spec/**/*.js']
    },
    karma: {
      options: {
        frameworks: ['jasmine', 'browserify'],
        files: ['crawler.js', 'spec/*.spec.js'],
        browsers: ['PhantomJS'],
        singleRun: true,
        preprocessors: {
          'crawler.js': ['browserify'],
          'spec/**/*.js': ['browserify']
        },
        browserify: {
          debug: true
        }
      },
      unit: {
        files:
          {
            src: ['spec/**/*.js']
          }
      },
      unit_browser: {
        browsers: ['Firefox'],
        reporters: ['kjhtml'],
        singleRun: false,
      }
    }
  });

  grunt.loadNpmTasks('grunt-eslint');
  grunt.loadNpmTasks('grunt-karma');

  grunt.registerTask('default', ['eslint', 'karma:unit']);
};

My own created file:

var Crawler = require("js-crawler");
var url = "https://github.com";

var cra = new Crawler().configure({
    // maxRequestsPerSecond: 1,
    // maxConcurrentRequests: 5,
    // ignoreRelative: true,
    // depth: 1,
});

cra.crawl({
    url: url,
    success: function(page) {
        console.log(page.url);
    },
    failure: function(page) {
        console.log(page.status);
    },
    finished: function (crawledUrls) {
        console.log(crawledUrls);
    }
}, function(response) {
    console.log("ERROR occurred:");
    console.log(response.status);
    console.log(response.url);
    console.log(response.referer);
});

I've used the commands grunt karma:unit & grunt karma:unit_browser in my 'gruntfile.js' folder.
I've placed my own file in the same directory as gruntfile.js.
I also tried placing it in spec/.
Both commands give no errors and run but I'm not getting any output? When the browser opens the browserconsole is empty, if I refresh the page it takes a bit to load bu the console remains empty.

I know the solution is probably very simple but forgive me for being new to this and wanting to learn :).
I just want the output of the crawler displayed somewhere so I can start playing with it.

forgetCrawled method

How to use this method? I am confused and didnt find anything how to use this method?

Please modify the readme at the configuration examples

the part of the code that has

var Crawler = require("../crawler.js");

should look like this :)

var Crawler = require("js-crawler");

thanks

robots.txt

currently js-crawler does not look and follow robots.txt file. This feature should be added. There should be one variable to check if crawler reads and follow robots.txt or nor.

Publishing latest fixes?

Hello, it looks like your recent fixes have fixed an issue I was having when the content-type was undefined. Will you be publishing this to npm anytime soon?

Thanks for a great package!

Would be awesome to apply a selector to limit scope of crawled links

for example:

crawler.crawl({
  url: "http://localhost:8080/locations/",
  selector: ".main-content"

would only follow the links found inside .main-content

this way i don't have to keep crawling the header, footer, sidebars, etc on every page

thank you for writing this!

Ajax crawling

I'm trying to crawl a page which have some content loaded in javascript after a certain time.

Is there a way to pause the script in order to wait for the ajax content?

When run asynchronous by Executor, depth lost its scope

Yesterday I tried to crawl a simple 3-layer-structured website (http://transit.yahoo.co.jp/station/top), and found that it had never reached the 3rd layer. When looked inside into js-crawler to figure out what happened by instrumenting "console.log" in btwn crawlUrl method, I found that depth become "undefined" except for requesting the very first page. It should be a scope related issue but I have not come out a solution to fix yet. I posted this issue here for @antivanov 's consideration.

Ability to pass custom user agents to crawler requests.

Most crawlers have the ability to send out a custom user agent, but there are not options currently to do that with this crawler.

I have created the code to do this, and a pull request will be coming.

shouldCrawl doesnt call onAllFinished

When returning false to avoid crawl last url (i think) it doest call onAllFinished

Pair js-crawler with PhantomJS

Hey,

First of all I would like express my gratitude for developing this sweet web crawler.

Is there a way integrate js-crawler with PhantomJS?

I really need their functionalities in a single place as I would like to use PhantomJS for network monitoring.

Looking forward to hearing from you!

Page that linked to current page

Hello,

Is it possible to find what page linked to the current page in a callback?

Example use case, I am crawling all the links on my site for 404 errors. I can currently use the failure callback to retrieve the error, but a useful piece of information would be understanding what pages linked to that page (how the crawler 'got to that page' from my site). That way I can remedy the situation by fixing the dead link on the link-out page.

Does this feature exist currently?

Follow redirects

Does it follows JS redirects or render ajax calls?

Network Challenges with Depth > 2

I'm starting here because I can't figure this out to even know where to start looking. The moment I increase depth > 2, I start getting a lot of network errors such as "socket hang up" or "tunneling socket could not be established". If I slip down to depth 2, it works fine. This behavior seems to consistent regardless of what I crawl. For example, I was able to recreate this on htttps://www.yahoo.com. All that being said, I don't see how js-crawler changes any behavior in the way it calls request to cause this. I've been staring at request and http documentation without any luck. Manipulating js-crawler code to use an http.Agent with maxSockets didn't seem to have an effect. So looking for any help I can get on what's causing this.

Is it possible to just crawl images using this package?

Was wondering if it is possible to crawl the web for just images using this package. If possible, please how?

Can we promisfy js-crawler

Can you help me show, if we can promisfy the below js-crawler.
is there better way to return the response from each crawler state.

function runCrawler(url) {

            var crawler = new Crawler().configure({ignoreRelative: false, depth: 2});
            crawler.crawl({
                url: url,
                success: function (page) {
                    console.log(page.url +' --- ' + page.status);
                   },

                failure: function (page) {
                    console.log(page.url +' --- ' + page.status);

                },
                finished: function (page) { 
                    return console.log('COMPLETED***********');
                } 
        });

Feature to crawl up to a limited number of pages

It would be great to be able to limit the number of pages. We have the depth to limit the depth of the search, but for example, if we crawl www.google.com with depth = 3, we get an absurd amount of urls. It would be nice to be able to put a limit to it.

Crawler stopped without reason and any error

I am trying to crawl a big website (arezzo.com.br), however, after ~1700 URLs crawled, it simply stopped. No errors printed, and also the finished callback wasn't called.

Can someone give me a north regarding what can happen in this case?
I think you can reproduce the issue by trying to crawl www.arezzo.com.br with my configs.

Link crawling gets stuck in Wordpress sites

If we try to crawl websites that is WP then link gets stuck after crawling few links and nothing happens after that, crawler just gets stalled. Can you suggest me possible fix if there is.

path in variable

if i set path in variablle then crawler function is not working

like var path=process,env.path

var Crawler = require("js-crawler");

new Crawler().configure({depth: 3})
.crawl(path, function onSuccess(page) {
console.log(page.url);
});

buggy url resolve

Solution:
at line 3: var url = require('url');
at line 61: link = url.resolve(baseUrl, link);

I couldn't able to commit the change due to permission issue. You could update.

Explanation:

 Example if baseUrl is http://www.syr.edu/facultyandstaff/index.html and
anchor url is ../admissions/index.html, then it was trying to fetch
http://www.syr.edu/facultyandstaff/index.html../admissions/index.html
which is wrong.

Fixed with node builtin function url.resolve(from, to), Ref:
https://nodejs.org/docs/latest/api/url.html

`knownUrls` processing logic is incorrectly using underscore

crawler.js#L178 and crawler.js#L197 are using underscore to determine whether the provided url is a known url.

This is incorrect, as underscore doesn't handle objects on it's contains method, as such it's always returning false. It should be using the same form as crawler.js#L203.

Changing it breaks a unit test though, and I'm not sure how it's going to impact the flow as expected

Crawler stopped without reason and any error

It stops working in some urls for no reason, even without any non-standard configuration.

domains he stops:

paraleloiluminacao.com.br
tcengenhariaeletrica.com.br
kplojista.com.br
bsgrafo.com.br

Evaluate selectors

How to deal with basic auth?

Dear developers,

I am crafting a tool that let me automatically crawl a few sites.
However, they are protected by a username and password (that I have).

Which is the correct way of passing this information to js-crawler?

Best regards,
Pitter.

How do you slow this down so it's not hammering the server you are crawling?

Some kind of delay or setTimeout or something? Can you do it with this?

The "depth" for crawling a website completely

I see that setting the depth value gets you to crawl the website to that particular level.

I am wondering how to configure the crawler so as to crawl the website completely without having it to go to other domain. For example - If I want to crawl https://www.someDomain.com, I want it to return all URLs in this website like https://www.someDomain.com/something, https://www.someDomain.com/somethingMore/abc, and so on. If https://www.someDomain.com/something has an external link like https://www.externalDomain.com/, it shouldnt go ahead with that.

Ultimately, I am looking to find all the email addresses in a given website. So far, I know that the node.js module node-scrapy can help me find email addresses for a particular URL/webpage.

Thanks,
Rahul

Getting every type of url from the page source

Hello,
Thanks for the nice robust crawler, Currently js-crawler crawls thru all visible links on the page, but I was wondering if it supports other href types like img/rel/src etc to be scrapped from the page.

Thanks
shekar

Usage

Maybe I have problems with my npm, but I can't run js-crawler... in the usage only example configurations are given and usage of those js configurations with js-crawler is actually omitted.

I have ran npm install js-crawler but I didn't get a js-crawler command to specify for it a js config file. Seems not to work out with npm run or serve...

I tried and downloaded the packaged via git and run npm install, and just get two warnings which I think is not a problem:

npm WARN optional SKIPPING OPTIONAL DEPENDENCY: fsevents@^1.0.0 (node_modules/chokidar/node_modules/fsevents):
npm WARN notsup SKIPPING OPTIONAL DEPENDENCY: Unsupported platform for [email protected]: wanted {"os":"darwin","arch":"any"} (current: {"os":"linux","arch":"x64"})

I don't get it, how to started using js-crawler, please help.

amoilanen / js-crawler Goto Github PK

js-crawler's Introduction

js-crawler

Installation

Usage

Options-based API

Handling errors

Knowing when all crawling is finished

Limiting the rate at which requests are made

maxRequestsPerSecond option

maxConcurrentRequests option

Specifying both options

Default option values

Reusing the same crawler instance for repeated crawling: forgetting crawled urls

Supported options

Development

Unit tests

End-to-end tests

License

Credits

js-crawler's People

Contributors

Stargazers

Watchers

Forkers

js-crawler's Issues

Recommend Projects

Recommend Topics

Recommend Org