yujiosaka / headless-chrome-crawler Goto Github PK
View Code? Open in Web Editor NEWDistributed crawler powered by Headless Chrome
License: MIT License
Distributed crawler powered by Headless Chrome
License: MIT License
for example,
input:
url1=> http://www.google.com
url2=>http://www.bing.com
...
urlN=>http://www.N.com
output:
url1.html
url2.html
..
urlN.html
Could you tip some sample code ? thanks very much.
Could I use it in Business project ?
did it have some restrictions ?
There are tons of breaking changes from 0.13.0 to 1.0.0
WARC is well-known format for storing crawled captures. It can store arbitrary number of HTTP requests and responses along with other network interactions such as DNS lookups along with their header, payload, and other metadata. It is usually used by web archives, but there are some other use cases as well. WARC is the default format in which Heritrix crawler (originally developed by the Internet Archive) stores captures. Wget supports WARC format as well. There are some other tools such as WARCreate (a Chrome extension) to save web pages in WARC format along with all their page requisites while browsing and Squidwarc (a Headless Chrome-based crawler) specifically for archival purposes.
That said, adding support for WARC format will immediately make this project more useful for the web archiving community.
In most of the cases you will want to save crawled results in some DBs. However, it's nice to have a feature to export to a file in CSV, JSON and so on for experimental phase.
Currently functions are imported in the following styles:
const { extend } = require('loadash');
This way, unnecessary functions are imported even though they are not used.
The following way should be preferred.
const extend = require('loadash/extend');
By using Redis as a cache storage, headless-chrome-crawler can run on distributed environment.
It would be useful to write tips to setup a distributed crawler
Background
LOVE this project!
I tried to write my own BaseCache instance to use LevelDB and have some general feedback.
What is the current behavior?
The difference between get(key)
, set(key, value)
, enqueue(key, value, priority)
, dequeue(key)
is very confusing. key
seems to be more a keyspace?
Since the underlying chrome browser had its own cache, the term is overload. E.g. does the clearCache
setting clear Chrome's cache or the persistent priority queue?
What is the expected behavior?
I expected the API to be more like standard Priority Queue APIs (similar to what is used for the PriorityQueue
class used internally in the code), but that's not the BaseCache API.
Here's what the current API looks like (for reference).
class BaseCache {
init();
close();
clear();
get(key);
set(key, value);
enqueue(key, value, priority)
dequeue(key);
size(key);
remove(key);
}
Maybe I'm missing something, but why does the outside caller need to know anything about what key the queue is using?
How is get()
/ set()
supposed to be used outside the class compared to enqueue()
and dequeue()
?
I kind of expected the API to persist a queue to look more like this:
class BasePersistentPriorityQueue {
init(namespace);
close();
clear();
enqueue( [{value, priority}] )
dequeue();
peek();
queue_length();
}
Notice that enqueue()
takes an array like enqueue([{value: value, priority: priory}, ...])
since batch queuing might be supported by underlying mechanism and significantly improve performance.
Higher level code queues all links found in the page in a tight loop. It can/should be done in batch.
From existing code:
each(urls, url => void this._push(extend({}, options, { url }), depth));
This loop over potentially hundreds of links found on a page. As it is now, each call reads/modifies/writes a hotspot single key
. For a shared network queue. This is has really horrible for performance implications.
What is the motivation / use case for changing the behavior?
Performance and readability.
Please tell us about your environment:
What is the current behavior?
With robots.txt set to false
Crawler waits using WaitFor and a specified timeout of 10 seconds, but still not able to crawl nested documents. For example, I want to extract src of an <iframe> which is an AD (display advertisement).
Enabled screen shot option to check if the ad iframe has loaded before evaluatePage function was executed. I can see the ad in screen shot but function does not return the src from <iframe>.
What is the expected behavior?
Be able to crawl nested html documents such as ADs which are loaded dynamically by Java Script.
Can you please provide an example or solution for this..
Please tell us about your environment:
What is the current behavior?
If the current behavior is a bug, please provide the steps to reproduce
What is the expected behavior?
What is the motivation / use case for changing the behavior?
Please tell us about your environment:
It happens when you crawl GitHub pages.
See puppeteer/puppeteer#1229
Thanks for this nice project,
I'm playing with it and now I'm trying to store the result of a crawl within a database and I noticed that the BaseExporter
Β expect the export to be in a file.
And I noticed that you had DB export in mind (#15).
So what would be the proper way?
Use the onSuccess
callback? Or have a BaseExporter
agnostic from the support (and have a FileExporter
on top).
What is the current behavior?
Miss the type definition file.
What is the expected behavior?
Have the type definition file.
What is the motivation / use case for changing the behavior?
Supports TypeScript and makes it easier to code in IDE.
I've already written the type definition from API documents in README. I'd be willing to open pull request here or at https://github.com/DefinitelyTyped/DefinitelyTyped. I'd like someone to review it though.
I am not sure how I can archive this, but my requirement it's that I need to know the order of the links of the initial request and pass it to the next request in order to save it with more data from the link's request.
Let's say that my initial request contains three pages in that order:
foo.html -> 1st link in the HTML
bar.html -> 2nd link in the HTML
baz.html -> etc.
When I will request foo.html (because I configured the crawler to depth: 2) I would like to know that this page was the 1st link from the previous page.
Is that possible ?
Thanks,
Although it's not a strict rule, but it's considered to be nice to obey robots.txt. I believe the option's default should be true, but have to provide an option to disable it.
What is the current behavior?
_collectLinks only keep the href of URLs.
What is the expected behavior?
Would be nice to have, or be able to request also:
What is the motivation / use case for changing the behavior?
You may decide to follow (or not) some type of links based on more than their depth.
Currently headess-chrome-crawler utilizes p-queue for its job management.
p-queue is great, but it's not suitable for distributed environment for following reasons:
Since the first one is the core part of p-queue, it's a bit hard not to break current APIs to support the feature.
I feel it's easier to implement its own promise queue from scratch.
By doing so, we can use redis for example for shared job management.
sitemap.xml tells you the architecture of the website. It's common to first find the sitemap.xml and start following links by that. It would be nice to provide a method to automatically doing it.
I'm considering using this crawler. I couldn't find in the doc a way to filter:
Is there a way that I missed? Or a workaround? Or is it a planned feature?
Three functions that return a boolean (to filter or not) might be enough.
Without this configuration, the crawler does not stop and keeps crawling. You can easily program to stop at specific requests, but it would be convenient to provide an option for that.
Currently you can use preRequest
function to skip duplicate requests by returning false. However, remembering already accessed urls is painstaking and if you are not using cacheing DBs, you have to fight against memory leak. It would be nice to provide an option to choose which DB to cache to remember urls and the crawler takes care of everything.
What is the current behavior?
Currently, evaluatePage
is requested right after event specified in waitUntil
option ("load" by default).
What is the expected behavior?
Be able to use the puppeteer page.waitFor
method, to add a delay before evaluating the page or taking a screenshot
What is the motivation / use case for changing the behavior?
I'd like be able to take screenshot of a requested page, but after a short delay, to wait for any animation triggers by JS code to complete, or for instance, to wait for asynchronous JS code to be executed.
I tried to set the waitUntil
option to "networkidle0"
but it's not always enough.
When requests are skipped, retried and given-up, there is no clue that those events happened. It would be nice to emit those events so that users have more controls on unexpected situations.
Checking the crawled results by extracted JSON data is painstaking. It would be nice to have a feature to screenshot the page so that it becomes easy to know whether crawled page is what you wanted or not. Since this is headless chrome, it's very easy to do it!
What is the current behavior?
Only allowedDomains
option is supported.
If the current behavior is a bug, please provide the steps to reproduce
What is the expected behavior?
Support deniedDomains
option.
What is the motivation / use case for changing the behavior?
I noticed that Amazon accepts bots and crawlers in its robots.txt.
However, it explicitly says it does not allow those bots in its conditions of use.
I believe it's useful to provide a feature to kindly avoid crawling these sites.
Please tell us about your environment:
In the real world crawling cases, you don't want to remember the accessed page forever. Instead, you want to specify expire duration.
Current code is written with JSDoc style for type definitions, but the type definitions are not validated at all, so I notice some declarations are broken or wrong. TypeScript support checkJs
option to work for even JavaScript, so it can be used for this purpose.
Hi
First of all, thanks for relasing this, awesome work guys π
I would like to access the Browser's Page's requestfinished
event.
I believe that event: 'requestfinished' is the Crawler's requestfinished
event.
My intention is to download the ressources of the page (images, fonts, ...).
I'm planning to do something like this :
// On my previous setup, page originates from
// const browser = await puppeteer.launch()
// const page = await browser.newPage()
page.on('requestfinished', async request => {
// do something with request
const url = request.url()
const match = /.*\.(jpg|png|gif)$/.exec(url)
if (match && match.length === 2) {
const split = url.split('/')
let filename = split[split.length - 1]
const response = request.response()
const buffer = await response.buffer()
fs.writeFileSync(`${crawlPathImages}/${filename}`, buffer, 'base64')
}
})
How can I do this using your crawler ?
process.env.HTTP_PROXY is not being passed to puppeteer.launch(), tried with a running proxy and:
env HTTP_PROXY=localhost:8000 node crawl.js
const htmlToText = require('html-to-text');
const FILE = './tmp/result-sample.csv';
const exporter = new CSVExporter({
file: FILE,
fields: ['response.url', 'response.status', 'links.length','depth','result.content'],
});
HCCrawler.launch({
maxDepth: 2,
exporter,
// Function to be evaluated in browsers
evaluatePage: (() => ({
content: htmlToText.fromString($('html').text()),
})),
})
Currently, when attempting to crawl a non-html page such as a PDF, the crawler seems to hang for a while, and eventually fails silently without giving any error. How should we go about handling non-html files? My thought was to use preRequest
function to modify some options after checking the mime-type, like setting jQuery to false, etc.; but I've not been able to get anything to work for non-html content types. How would you recommend accommodating this functionality?
Here is a sample to reproduce this, which shows that the URL is queued, but eventually the crawler exits with neither onSuccess nor onError being called. Ideally, it would be nice to have some logic to intercept non-html requests and route them to use Apache tika or something similar.
const HCCrawler = require('headless-chrome-crawler');
HCCrawler.launch({
args: ['--disable-web-security'],
jQuery: false,
onSuccess: (result => {
console.log(`crawled: ${result.response.url}`);
}),
onError: (error => {
console.error(error);
})
})
.then(crawler => {
crawler.queue(['https://pdfs.semanticscholar.org/8f86/834ae39f46447fd588b5817d6d9171f518e6.pdf']);
crawler.queueSize()
.then(size => {
console.log('%s items in cache', size);
});
crawler.onIdle()
.then(() => crawler.close());
});
Can the exporter receive fields from the evaluatePage serialized data?
What is the current behavior?
timeout
is not an overridable in the crawler.queue()
options
If the current behavior is a bug, please provide the steps to reproduce
const HCCrawler = require('headless-chrome-crawler');
HCCrawler.launch({
evaluatePage: (() => ({
title: $('title').text(),
})),
onSuccess: (result => {
console.log(result);
}),
})
.then(crawler => {
crawler.queue({ url: 'https://example.com/', timeout: 5});
crawler.onIdle()
.then(() => crawler.close());
});
What is the expected behavior?
timeout
should be a valid queue option
What is the motivation / use case for changing the behavior?
I'd like to be able to set a per-request timeout like it's possible with puppeteer goto
function.
Please tell us about your environment:
What is the current behavior?
Caching is enabled by default.
It's ok for the most cases. However, especially for crawling purposes, we'd like to have content not from the cache but from the most fresh data.
If the current behavior is a bug, please provide the steps to reproduce
What is the expected behavior?
Support an option to enable/disable cache for each request, default to true
What is the motivation / use case for changing the behavior?
Please tell us about your environment:
CI frequently fails on response delay test cases.
For example
https://circleci.com/gh/yujiosaka/headless-chrome-crawler/569
If you can store the already crawled urls, you will probably want to have pause/resume/clear features. You will pause the request, and schedule to resume/clear at any time.
What is the current behavior?
allowedDomains
option only checks the end of hostname.
If the current behavior is a bug, please provide the steps to reproduce
What is the expected behavior?
allowedDomains
option should accepts a list of regular expressions.
What is the motivation / use case for changing the behavior?
When crawling an international service like Amazon, it has several TLDs for several countries (though, crawling Amazon is not allowed by its Conditions of Use. It's just an example).
In that case, simple string match to hostname is enough. It should be useful to accept regular expressions
Please tell us about your environment:
What is the current behavior?
Crawl twice the front page when the enqueued URL doesn't contain an end slash
If the current behavior is a bug, please provide the steps to reproduce
Configure a queue with a domain as https://example.com then the page has a link as <a href="/">Home</a>
What is the expected behavior?
It should only crawl once the front page
What is the motivation / use case for changing the behavior?
In case we export data from the initial page, we might have twice the data.
Please tell us about your environment:
The current workaround is to ensure that we set a URL in the queue with the end slash.
What is the current behavior?
Greenkeeper is great, but it does not update yarn.lock.
It is dangerous because on updated modules are not used for tests on CI without updated yarn.lock.
It may cause serious problem when I forget updating yarn.lock manually.
If the current behavior is a bug, please provide the steps to reproduce
What is the expected behavior?
Automatically update yarn.lock when the branch is created by Greenkeeper.
There is a specific package for this purpose: greenkeeperio/greenkeeper#314
What is the motivation / use case for changing the behavior?
Please tell us about your environment:
What is the current behavior?
Today the project automatically resolves the robots.txt.
What is the expected behavior?
It would be useful to be able to provides the robots.txt instead to bypass the default behavior of resolving it automatically.
What is the motivation / use case for changing the behavior?
You may want to provides a different set of rules (let's say I'm the owner of the site and I want to check of the crawler would behave with a different robot.txt)
In a big distributed environment, maybe you want to resolve the robots.txt once and share it with all the workers
What is the current behavior?
I don't believe the crawler is handling sitemaps broken out into multiple sitemaps. This is common in large sites since sitemaps are limited to 50k urls. See Simplify multiple sitemap management
Good example is NASA https://www.nasa.gov/sitemap.xml
What is the expected behavior?
Successfully crawl large sites via sitemap(s)
What is the motivation / use case for changing the behavior?
Large enterprise sites not being crawled via sitemap
In some cases you would crawl only one or several websites. It would be nice to provide an option to restrict requests by one or multiple hostnames.
What is the current behavior?
Currently slowMo
option is supported only when launching a Chromium instance, and not when connecting an existing Chromium instance
What is the expected behavior?
It should be supported when connecting an existing Chromium instance.
What is the motivation / use case for changing the behavior?
It was fixed upstream here.
Please tell us about your environment:
What is the current behavior?
When skipDuplicates
is set true
, already crawled urls will not be executed.
However, the request is queued any way even when the url is not allowed or already crawled.
What is the expected behavior?
Do not enqueue the request from the first place if it's already executed
What is the motivation / use case for changing the behavior?
I want to make this crawler memory friendly
What is the current behavior?
In rare possibility (in my case less than 3%), the crawler fails enqueueing next request and never stops.
The error goes like this:
TypeError: Cannot read property 'priority' of undefined
at lowerBound (/Users/yujiosaka/work/headless-chrome-crawler/cache/session.js:71:63)
at lowerBound (/Users/yujiosaka/work/headless-chrome-crawler/lib/helper.js:117:11)
at SessionCache.enqueue (/Users/yujiosaka/work/headless-chrome-crawler/cache/session.js:71:15)
at PriorityQueue.push (/Users/yujiosaka/work/headless-chrome-crawler/lib/priority-queue.js:34:17)
at PriorityQueue.<anonymous> (/Users/yujiosaka/work/headless-chrome-crawler/lib/helper.js:178:23)
at HCCrawler._push (/Users/yujiosaka/work/headless-chrome-crawler/lib/hccrawler.js:271:17)
at _skipRequest.then.skip (/Users/yujiosaka/work/headless-chrome-crawler/lib/hccrawler.js:545:16)
at process._tickCallback (internal/process/next_tick.js:103:7)
If the current behavior is a bug, please provide the steps to reproduce
The following scripts causes errors sometimes.
const HCCrawler = require('headless-chrome-crawler');
HCCrawler.launch({
maxDepth: 3,
maxConcurrency: 10,
allowedDomains: ['www.emin.co.jp'],
evaluatePage: (() => window.document.title),
onSuccess: (result => { // ζεζγ«θ©δΎ‘γγγι’ζ°
console.log(`${result.options.url}\t${result.result}`);
}),
})
.then(crawler => {
crawler.queue('https://www.emin.co.jp/');
crawler.onIdle()
.then(() => crawler.close());
});
What is the expected behavior?
What is the motivation / use case for changing the behavior?
Please tell us about your environment:
What is the current behavior?
Only breadth-first search is supported.
If the current behavior is a bug, please provide the steps to reproduce
What is the expected behavior?
Support depth-first search also.
What is the motivation / use case for changing the behavior?
Breadth-first search is great for finding a targeted page, especially when the the page is not so far from the first page. However, it tends to require more memory and its not always the fastest algorithm to find the target when the page is very far from the first page.
User should be able to choose which algorithm to crawl depending on case by case. If the user needs more control on the algorithm, he/she can do that by using priority
option.
Please tell us about your environment:
It should be a nice feature to collect possible links and automatically follow. It would be also nice to modify rules by providing an option.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. πππ
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google β€οΈ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.