Git Product home page Git Product logo

node-webcrawler's People

Contributors

bkw avatar blakmatrix avatar chloe899 avatar connorweng avatar curtisj44 avatar darrenqc avatar digitalfrost avatar edrury avatar headconnect avatar heshamsafi avatar jaredmansaakintola avatar jellyfrog avatar jhurliman avatar keithpitt avatar mike442144 avatar namuol avatar ozlevka avatar paulvalla avatar racheet avatar sylvinus avatar trantorliu avatar tzbee avatar wadouk avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

node-webcrawler's Issues

travis build fails

  1. Request tests should crawl a gzip response:
    Uncaught AssertionError: expected false to be true

I have determined that the problem is with the test case.
I will send a pull request with a fix in a few minutes.

Using tor?

What would it take or how hard is it to add functionality to use this with TOR?

TypeError: $ is not a function

I'm trying to execute the sample code and I get the following error:

TypeError: $ is not a function

What do I need to do to fix this?

SECURITY: lodash out of date

npm WARN notice [SECURITY] lodash has the following vulnerability: 1 low. Go here for more details: https://nodesecurity.io/advisories?search=lodash&version=3.8.0 - Run `npm i npm@latest -g` to upgrade your npm version, and then `npm audit` to get more info.

Please update to latest version.

still have memory leaks?

with this code

var Crawler = require("node-webcrawler");
var url = require('url');
var jsdom = require('jsdom');
var c = new Crawler({
    maxConnections : 10,
    jQuery: jsdom,
    // This will be called for each crawled page
    callback : function (error, result, $) {
        // $ is Cheerio by default
        //a lean implementation of core jQuery designed specifically for the server
        if(error){
            console.log(error);
        }else{

            try {
                console.log($("title").text());
                $('a').each(function (index, a) {
                    var toQueueUrl = $(a).prop('href');
                    //console.log(toQueueUrl);
                    c.queue(toQueueUrl);
                });

            } catch (e) {
                console.log(e);
            }
        }
    }
});

c.queue('http://www.wandoujia.com/');

the memory will go up to 800MB after around 10 mins.

am I doing something wrong or it is the problem of the module itself?

thanks,

OTHER DATABASE SPEED

Very Nice Uniqe urls craw array loop urls

How this process is speeded up, and I want to re-enter the pages visited. so where the array variable. Not only will it be a lot to handle for a site, but the site :(

var Crawler = require("node-webcrawler");
var url = require('url');
var urls = [];
var c = new Crawler({
    maxConnections : 1,

    userAgent: 'Mozilla/5.0 (Windows; U; Windows NT 5.1; pt-BR; rv:1.8.1.17) Gecko/20080829 Firefox/2.0.0.17',
    //cache:true,
    // This will be called for each crawled page 
    callback : function (error, result, $) {
        // $ is Cheerio by default 
        //a lean implementation of core jQuery designed specifically for the server 
        if(error){
            console.log(error);
        }else{
             //console.log($("title").text());

            $( "a" ).each(function() {
             var url = $( this ).attr( "href" );
                spider(url);



        });
        }
    }
});

 var spider = function(url){
     //console.log(urls.indexOf(url));
     if(!url){
         return false;
     }
     if((urls.indexOf(url)<0) && (url.indexOf("#")<0) && (url.indexOf("javascript:")<0)){
     urls.push(url);
     console.log(url);
     c.queue(url);
     }  
 }
// Queue just one URL, with default callback 
c.queue('http://www.example.com');

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.