yujiosaka / headless-chrome-crawler Goto Github PK

View Code? Open in Web Editor NEW

5.5K 114.0 405.0 1.57 MB

Distributed crawler powered by Headless Chrome

License: MIT License

JavaScript 98.77% Dockerfile 1.23%

headless-chrome puppeteer jquery crawler crawling scraper scraping chrome chromium promise

headless-chrome-crawler's People

Contributors

Stargazers

Watchers

Forkers

dice-media-group natocto davidalphafox terrythibault kinget007 storkline canaanmckenzie dkachele rainboyan thanhitvl gabrielalix zuphzuph send2cloud zhoukekestar tuian jgrays hengkiardo songzcn digi-lab nilportugues polokobe international wooodhead ztplz osub weareburo bradparks jerry-goodboy forked-repositories derek-zl turpure krisstakos phpsmarter tonyle9 awesome-archive yibit linecode aglaianwoman 0xa-cc obender shelltips 5up3rc lyrl ryotaro ky0615 chikara-chan mengyou658 zerolethanh fforbeck shaunstanislauslau lczxxx123 pbrzoski rsoury pastorale hoishin hideoyanagi jmealo doreln shaozj nilesh-chavan cancheung whereyouatwimm dexter369 alisozkesen zhiqinghuang eric-seekas rad73 siberianbear shop2007 cogniteev marmeladenbrot jaggedsoft vhatuncev cometruer feload jimspider jkso cybernetics normalra lury dengfneg toddetzel xbeg9 kunlqt manojitballav tracyshen yupengyan thdtjsdn minsifansi grubinsky d3n4 msrikanth508 tylercaldwell dewdad hcxiong benkang-chen miguelramosfdz yooniks9 lyxw it950

headless-chrome-crawler's Issues

how to download use crawler for download HTML files?

for example,
input:
url1=> http://www.google.com
url2=>http://www.bing.com
...
urlN=>http://www.N.com
output:
url1.html
url2.html
..
urlN.html

Could you tip some sample code ? thanks very much.

Could I use it in Business project ?

Could I use it in Business project ?
did it have some restrictions ?

Update puppeteer to 1.0.0

There are tons of breaking changes from 0.13.0 to 1.0.0

#63

[Feature Request] Add support for WARC file format

WARC is well-known format for storing crawled captures. It can store arbitrary number of HTTP requests and responses along with other network interactions such as DNS lookups along with their header, payload, and other metadata. It is usually used by web archives, but there are some other use cases as well. WARC is the default format in which Heritrix crawler (originally developed by the Internet Archive) stores captures. Wget supports WARC format as well. There are some other tools such as WARCreate (a Chrome extension) to save web pages in WARC format along with all their page requisites while browsing and Squidwarc (a Headless Chrome-based crawler) specifically for archival purposes.

That said, adding support for WARC format will immediately make this project more useful for the web archiving community.

Output files in several styles

In most of the cases you will want to save crawled results in some DBs. However, it's nice to have a feature to export to a file in CSV, JSON and so on for experimental phase.

Import only necessary functions from loadsh

Currently functions are imported in the following styles:

const { extend } = require('loadash');

This way, unnecessary functions are imported even though they are not used.
The following way should be preferred.

const extend = require('loadash/extend');

Add docs for distributed crawler

By using Redis as a cache storage, headless-chrome-crawler can run on distributed environment.
It would be useful to write tips to setup a distributed crawler

Suggestion: BaseCache api is confusing and not efficient.

Background
LOVE this project!
I tried to write my own BaseCache instance to use LevelDB and have some general feedback.

What is the current behavior?
The difference between get(key), set(key, value), enqueue(key, value, priority), dequeue(key) is very confusing. key seems to be more a keyspace?

Since the underlying chrome browser had its own cache, the term is overload. E.g. does the clearCache setting clear Chrome's cache or the persistent priority queue?

What is the expected behavior?
I expected the API to be more like standard Priority Queue APIs (similar to what is used for the PriorityQueue class used internally in the code), but that's not the BaseCache API.

Here's what the current API looks like (for reference).

class BaseCache {
    init();
    close();
    
    clear();
    
    get(key);
    set(key, value);
    
    enqueue(key, value, priority)
    dequeue(key);
 
    size(key);
    remove(key);
}

Maybe I'm missing something, but why does the outside caller need to know anything about what key the queue is using?

How is get() / set() supposed to be used outside the class compared to enqueue() and dequeue()?

I kind of expected the API to persist a queue to look more like this:

class BasePersistentPriorityQueue {
    init(namespace);
    close();

    clear();
    
    enqueue( [{value, priority}] )
    dequeue();
 
    peek();
    queue_length();
}

Notice that enqueue() takes an array like enqueue([{value: value, priority: priory}, ...]) since batch queuing might be supported by underlying mechanism and significantly improve performance.

Higher level code queues all links found in the page in a tight loop. It can/should be done in batch.
From existing code:
each(urls, url => void this._push(extend({}, options, { url }), depth));

This loop over potentially hundreds of links found on a page. As it is now, each call reads/modifies/writes a hotspot single key. For a shared network queue. This is has really horrible for performance implications.

What is the motivation / use case for changing the behavior?
Performance and readability.

Please tell us about your environment:

Version: 1.5.0
Platform / OS version: Debian 4.9.65-3+deb9u2 (2018-01-04) x86_64
Node.js version: v6.13.0

Crawl nested html documents loaded dynamically

What is the current behavior?

#150

With robots.txt set to false

Crawler waits using WaitFor and a specified timeout of 10 seconds, but still not able to crawl nested documents. For example, I want to extract src of an <iframe> which is an AD (display advertisement).

Enabled screen shot option to check if the ad iframe has loaded before evaluatePage function was executed. I can see the ad in screen shot but function does not return the src from <iframe>.

What is the expected behavior?

Be able to crawl nested html documents such as ADs which are loaded dynamically by Java Script.

Can you please provide an example or solution for this..

Please tell us about your environment:

Version: 1.5.0
Platform / OS version:
Node.js version: 6.11.2

Is there a way to feed data in forms, such as text box, dropdowns, etc and submit the pages?

What is the current behavior?

If the current behavior is a bug, please provide the steps to reproduce

What is the expected behavior?

What is the motivation / use case for changing the behavior?

Please tell us about your environment:

Version:
Platform / OS version:
Node.js version:

Fail to insert jQuery silently due to content security policy

It happens when you crawl GitHub pages.
See puppeteer/puppeteer#1229

Export in database - what would be the way to do it?

Thanks for this nice project,

I'm playing with it and now I'm trying to store the result of a crawl within a database and I noticed that the BaseExporter expect the export to be in a file.

And I noticed that you had DB export in mind (#15).

So what would be the proper way?

Use the onSuccess callback? Or have a BaseExporter agnostic from the support (and have a FileExporter on top).

Add .d.ts file for TypeScript/IDE support

What is the current behavior?
Miss the type definition file.

What is the expected behavior?
Have the type definition file.

What is the motivation / use case for changing the behavior?
Supports TypeScript and makes it easier to code in IDE.

I've already written the type definition from API documents in README. I'd be willing to open pull request here or at https://github.com/DefinitelyTyped/DefinitelyTyped. I'd like someone to review it though.

Passes meta information from previous request

I am not sure how I can archive this, but my requirement it's that I need to know the order of the links of the initial request and pass it to the next request in order to save it with more data from the link's request.

Let's say that my initial request contains three pages in that order:

foo.html -> 1st link in the HTML
bar.html -> 2nd link in the HTML
baz.html -> etc.

When I will request foo.html (because I configured the crawler to depth: 2) I would like to know that this page was the 1st link from the previous page.

Is that possible ?

Thanks,

Obey robots.txt by default and provide an option to disable it

Although it's not a strict rule, but it's considered to be nice to obey robots.txt. I believe the option's default should be true, but have to provide an option to disable it.

Suggestion: Collect links should be extendable and/or have more infos than their URL

What is the current behavior?

_collectLinks only keep the href of URLs.

What is the expected behavior?

Would be nice to have, or be able to request also:

The anchor tag if any
The node (iframe? a? link?)
The rel attribute (was there a nofollow? a follow?)

What is the motivation / use case for changing the behavior?

You may decide to follow (or not) some type of links based on more than their depth.

Support distributed queue so that it works on multiple machines

Currently headess-chrome-crawler utilizes p-queue for its job management.
p-queue is great, but it's not suitable for distributed environment for following reasons:

it doesn't support asynchronous promise for queue/dequeue/size methods of CustomQueueClass
watch queue size and automatically run dequeue

Since the first one is the core part of p-queue, it's a bit hard not to break current APIs to support the feature.
I feel it's easier to implement its own promise queue from scratch.

By doing so, we can use redis for example for shared job management.

Provide a method to crawl an entire website using sitemap.xml

sitemap.xml tells you the architecture of the website. It's common to first find the sitemap.xml and start following links by that. It would be nice to provide a method to automatically doing it.

A way to filter followed links or processed pages

I'm considering using this crawler. I couldn't find in the doc a way to filter:

the links the crawler will follow (by url, by anchor text, by css/xpath, etc.);
the pages the crawler will parse (by page title, content, header, etc.)
the files the crawler will download (by mimetype, by size, etc.)

Is there a way that I missed? Or a workaround? Or is it a planned feature?

Three functions that return a boolean (to filter or not) might be enough.

Set maximum number of pages to queue requests

Without this configuration, the crawler does not stop and keeps crawling. You can easily program to stop at specific requests, but it would be convenient to provide an option for that.

Skip duplicate requests in intelligent way.

Currently you can use preRequest function to skip duplicate requests by returning false. However, remembering already accessed urls is painstaking and if you are not using cacheing DBs, you have to fight against memory leak. It would be nice to provide an option to choose which DB to cache to remember urls and the crawler takes care of everything.

Adding a delay before evaluating page

What is the current behavior?

Currently, evaluatePage is requested right after event specified in waitUntil option ("load" by default).

What is the expected behavior?

Be able to use the puppeteer page.waitFor method, to add a delay before evaluating the page or taking a screenshot

What is the motivation / use case for changing the behavior?

I'd like be able to take screenshot of a requested page, but after a short delay, to wait for any animation triggers by JS code to complete, or for instance, to wait for asynchronous JS code to be executed.

I tried to set the waitUntil option to "networkidle0" but it's not always enough.

Useful events

When requests are skipped, retried and given-up, there is no clue that those events happened. It would be nice to emit those events so that users have more controls on unexpected situations.

Screenshot

Checking the crawled results by extracted JSON data is painstaking. It would be nice to have a feature to screenshot the page so that it becomes easy to know whether crawled page is what you wanted or not. Since this is headless chrome, it's very easy to do it!

Support deniedDomains like allowedDomains

What is the current behavior?

Only allowedDomains option is supported.

If the current behavior is a bug, please provide the steps to reproduce

What is the expected behavior?

Support deniedDomains option.

What is the motivation / use case for changing the behavior?

I noticed that Amazon accepts bots and crawlers in its robots.txt.
However, it explicitly says it does not allow those bots in its conditions of use.

I believe it's useful to provide a feature to kindly avoid crawling these sites.

Please tell us about your environment:

Version:
Platform / OS version:
Node.js version:

Support expire option for RedisCache

In the real world crawling cases, you don't want to remember the accessed page forever. Instead, you want to specify expire duration.

Use TypeScript for type validation

Current code is written with JSDoc style for type definitions, but the type definitions are not validated at all, so I notice some declarations are broken or wrong. TypeScript support checkJs option to work for even JavaScript, so it can be used for this purpose.

Trigger function on Page's requestfinished event

Hi
First of all, thanks for relasing this, awesome work guys 👍

I would like to access the Browser's Page's requestfinished event.
I believe that event: 'requestfinished' is the Crawler's requestfinished event.

My intention is to download the ressources of the page (images, fonts, ...).
I'm planning to do something like this :

// On my previous setup, page originates from
// const browser = await puppeteer.launch()
// const page = await browser.newPage()
page.on('requestfinished', async request => { 
  // do something with request
  const url = request.url()
  const match = /.*\.(jpg|png|gif)$/.exec(url)
  if (match && match.length === 2) {
    const split = url.split('/')
    let filename = split[split.length - 1]
    const response = request.response()
    const buffer = await response.buffer()
    fs.writeFileSync(`${crawlPathImages}/${filename}`, buffer, 'base64')
  }
})

How can I do this using your crawler ?

drop istanbul coverage

This library measures coverage by istanbul.
However, it modifies function.toString, and it doesn't work well with puppeteer.

It's discussed here:

Coverage isn't so important at this moment, and I don't find a good replacement.
So I drop istanbul coverage for now until the problem is solved.

Command line support

If #13 and #15 are implemented, You can easily crawl a website without coding. That's when it should be nice to support command line.

setting HTTP_PROXY

process.env.HTTP_PROXY is not being passed to puppeteer.launch(), tried with a running proxy and:

env HTTP_PROXY=localhost:8000 node crawl.js

how to use html to text function in the export-csv sample

const htmlToText = require('html-to-text');
const FILE = './tmp/result-sample.csv';

const exporter = new CSVExporter({
file: FILE,
fields: ['response.url', 'response.status', 'links.length','depth','result.content'],
});

HCCrawler.launch({
maxDepth: 2,
exporter,
// Function to be evaluated in browsers
evaluatePage: (() => ({
content: htmlToText.fromString($('html').text()),
})),
})

Crawler hangs on non-html URL requests

Currently, when attempting to crawl a non-html page such as a PDF, the crawler seems to hang for a while, and eventually fails silently without giving any error. How should we go about handling non-html files? My thought was to use preRequest function to modify some options after checking the mime-type, like setting jQuery to false, etc.; but I've not been able to get anything to work for non-html content types. How would you recommend accommodating this functionality?

Here is a sample to reproduce this, which shows that the URL is queued, but eventually the crawler exits with neither onSuccess nor onError being called. Ideally, it would be nice to have some logic to intercept non-html requests and route them to use Apache tika or something similar.

const HCCrawler = require('headless-chrome-crawler');

HCCrawler.launch({
  args: ['--disable-web-security'],
  jQuery: false,
  onSuccess: (result => {
    console.log(`crawled: ${result.response.url}`);
  }),
  onError: (error => {
    console.error(error);
  })
})
.then(crawler => {
  crawler.queue(['https://pdfs.semanticscholar.org/8f86/834ae39f46447fd588b5817d6d9171f518e6.pdf']);
  crawler.queueSize()
    .then(size => {
      console.log('%s items in cache', size);
    });
  crawler.onIdle()
    .then(() => crawler.close());
});

Export data from the evaluatePage() serialized data

Can the exporter receive fields from the evaluatePage serialized data?

Enable timeout as a valid queue option

What is the current behavior?
timeout is not an overridable in the crawler.queue() options

If the current behavior is a bug, please provide the steps to reproduce

const HCCrawler = require('headless-chrome-crawler');

HCCrawler.launch({
  evaluatePage: (() => ({
    title: $('title').text(),
  })),
  onSuccess: (result => {
    console.log(result);
  }),
})
  .then(crawler => {
    crawler.queue({ url: 'https://example.com/', timeout: 5});
    crawler.onIdle()
      .then(() => crawler.close());
  });

What is the expected behavior?

timeout should be a valid queue option

What is the motivation / use case for changing the behavior?

I'd like to be able to set a per-request timeout like it's possible with puppeteer goto function.

Please tell us about your environment:

Version: 1.4.0
Platform / OS version: macOS 10.13.3
Node.js version: v9.7.1

Support an option to enable/disable cache for each request

What is the current behavior?

Caching is enabled by default.

It's ok for the most cases. However, especially for crawling purposes, we'd like to have content not from the cache but from the most fresh data.

If the current behavior is a bug, please provide the steps to reproduce

What is the expected behavior?

Support an option to enable/disable cache for each request, default to true

What is the motivation / use case for changing the behavior?

Please tell us about your environment:

Version:
Platform / OS version:
Node.js version:

Response delay tests are flaky on CI

CI frequently fails on response delay test cases.

For example
https://circleci.com/gh/yujiosaka/headless-chrome-crawler/569

Pause/Resume/Clear features

If you can store the already crawled urls, you will probably want to have pause/resume/clear features. You will pause the request, and schedule to resume/clear at any time.

Support regular expressions for allowedDomains

What is the current behavior?

allowedDomains option only checks the end of hostname.

If the current behavior is a bug, please provide the steps to reproduce

What is the expected behavior?

allowedDomains option should accepts a list of regular expressions.

What is the motivation / use case for changing the behavior?

When crawling an international service like Amazon, it has several TLDs for several countries (though, crawling Amazon is not allowed by its Conditions of Use. It's just an example).

In that case, simple string match to hostname is enough. It should be useful to accept regular expressions

Please tell us about your environment:

Version:
Platform / OS version:
Node.js version:

Crawl the initial URL twice

What is the current behavior?
Crawl twice the front page when the enqueued URL doesn't contain an end slash

If the current behavior is a bug, please provide the steps to reproduce
Configure a queue with a domain as https://example.com then the page has a link as <a href="/">Home</a>

What is the expected behavior?
It should only crawl once the front page

What is the motivation / use case for changing the behavior?
In case we export data from the initial page, we might have twice the data.

Please tell us about your environment:

Version: 1.4.0
Platform / OS version: Ubuntu 16.04 LTS
Node.js version: 9.7.1

The current workaround is to ensure that we set a URL in the queue with the end slash.

Automatically update yarn.lock when patches are created by greenkeeper

What is the current behavior?

Greenkeeper is great, but it does not update yarn.lock.

It is dangerous because on updated modules are not used for tests on CI without updated yarn.lock.
It may cause serious problem when I forget updating yarn.lock manually.

If the current behavior is a bug, please provide the steps to reproduce

What is the expected behavior?

Automatically update yarn.lock when the branch is created by Greenkeeper.
There is a specific package for this purpose: greenkeeperio/greenkeeper#314

What is the motivation / use case for changing the behavior?

Please tell us about your environment:

Version:
Platform / OS version:
Node.js version:

You should be able to provides the robots.txt

What is the current behavior?

Today the project automatically resolves the robots.txt.

What is the expected behavior?

It would be useful to be able to provides the robots.txt instead to bypass the default behavior of resolving it automatically.

What is the motivation / use case for changing the behavior?

You may want to provides a different set of rules (let's say I'm the owner of the site and I want to check of the crawler would behave with a different robot.txt)
In a big distributed environment, maybe you want to resolve the robots.txt once and share it with all the workers

[Feature Request] Add support for multiple sitemaps

What is the current behavior?
I don't believe the crawler is handling sitemaps broken out into multiple sitemaps. This is common in large sites since sitemaps are limited to 50k urls. See Simplify multiple sitemap management

Good example is NASA https://www.nasa.gov/sitemap.xml

What is the expected behavior?
Successfully crawl large sites via sitemap(s)

What is the motivation / use case for changing the behavior?
Large enterprise sites not being crawled via sitemap

Restrict requests by hostname

In some cases you would crawl only one or several websites. It would be nice to provide an option to restrict requests by one or multiple hostnames.

slowMo options is supported when connecting to a Chromium instance

What is the current behavior?

Currently slowMo option is supported only when launching a Chromium instance, and not when connecting an existing Chromium instance

What is the expected behavior?

It should be supported when connecting an existing Chromium instance.

What is the motivation / use case for changing the behavior?

It was fixed upstream here.

Please tell us about your environment:

Version:
Platform / OS version:
Node.js version:

Don't enqueue already crawled url

What is the current behavior?

When skipDuplicates is set true, already crawled urls will not be executed.
However, the request is queued any way even when the url is not allowed or already crawled.

What is the expected behavior?

Do not enqueue the request from the first place if it's already executed

What is the motivation / use case for changing the behavior?

I want to make this crawler memory friendly

e2e testing

Current tests just stub the method of crawling.
With current tests, it's hard to catch the bugs like #73 and #74 , which were embedded when updating puppeteer's version.
It would be nice to have actual e2e testing my mocking a real server

TypeError: Cannot read property 'priority' of undefined

What is the current behavior?

In rare possibility (in my case less than 3%), the crawler fails enqueueing next request and never stops.
The error goes like this:

TypeError: Cannot read property 'priority' of undefined
    at lowerBound (/Users/yujiosaka/work/headless-chrome-crawler/cache/session.js:71:63)
    at lowerBound (/Users/yujiosaka/work/headless-chrome-crawler/lib/helper.js:117:11)
    at SessionCache.enqueue (/Users/yujiosaka/work/headless-chrome-crawler/cache/session.js:71:15)
    at PriorityQueue.push (/Users/yujiosaka/work/headless-chrome-crawler/lib/priority-queue.js:34:17)
    at PriorityQueue.<anonymous> (/Users/yujiosaka/work/headless-chrome-crawler/lib/helper.js:178:23)
    at HCCrawler._push (/Users/yujiosaka/work/headless-chrome-crawler/lib/hccrawler.js:271:17)
    at _skipRequest.then.skip (/Users/yujiosaka/work/headless-chrome-crawler/lib/hccrawler.js:545:16)
    at process._tickCallback (internal/process/next_tick.js:103:7)

If the current behavior is a bug, please provide the steps to reproduce

The following scripts causes errors sometimes.

const HCCrawler = require('headless-chrome-crawler');

HCCrawler.launch({
  maxDepth: 3,
  maxConcurrency: 10,
  allowedDomains: ['www.emin.co.jp'],
  evaluatePage: (() => window.document.title),
  onSuccess: (result => { // 成功時に評価される関数
    console.log(`${result.options.url}\t${result.result}`);
  }),
})
  .then(crawler => {
    crawler.queue('https://www.emin.co.jp/');
    crawler.onIdle()
      .then(() => crawler.close());
  });

What is the expected behavior?

Do not cause an error
Execute all queued requests
Successfully close the crawler

What is the motivation / use case for changing the behavior?

Please tell us about your environment:

Version: 1.3.4
Platform / OS version: MacOS
Node.js version: v6.4.0

Support depth-first search algorithm

What is the current behavior?

Only breadth-first search is supported.

If the current behavior is a bug, please provide the steps to reproduce

What is the expected behavior?

Support depth-first search also.

What is the motivation / use case for changing the behavior?

Breadth-first search is great for finding a targeted page, especially when the the page is not so far from the first page. However, it tends to require more memory and its not always the fastest algorithm to find the target when the page is very far from the first page.

User should be able to choose which algorithm to crawl depending on case by case. If the user needs more control on the algorithm, he/she can do that by using priority option.

Please tell us about your environment:

Version:
Platform / OS version:
Node.js version:

Automatically collecting and following possible links

It should be a nice feature to collect possible links and automatically follow. It would be also nice to modify rules by providing an option.

yujiosaka / headless-chrome-crawler Goto Github PK

headless-chrome-crawler's People

Contributors

Stargazers

Watchers

Forkers

headless-chrome-crawler's Issues

Recommend Projects

Recommend Topics

Recommend Org