nrabinowitz / pjscrape Goto Github PK
View Code? Open in Web Editor NEWA web-scraping framework written in Javascript, using PhantomJS and jQuery
Home Page: http://nrabinowitz.github.io/pjscrape/
License: MIT License
A web-scraping framework written in Javascript, using PhantomJS and jQuery
Home Page: http://nrabinowitz.github.io/pjscrape/
License: MIT License
I get the above error message when i try to run the tutorial on the readme file.
This happens on a Mac with Snow Leopard running phantomjs.
Don't rule out that this may be a problem on my side.
Links to //domain.tld/
are quite common to deal with http/https protocol. They should be expanded to http://domain.tld/
on http site or https://domain.tld/
on https site.
_pjs.toFullUrl() expand such links into base//domain.tld/
which is of course wrong.
Simple real world example: http://en.wikipedia.org/wiki/A
Check the footer for //wikimediafoundation.org/
and //www.mediawiki.org/
which are expanded to http://en.wikipedia.org//wikimediafoundation.org/
and http://en.wikipedia.org//www.mediawiki.org/
.
I'm getting "ERROR: Page did not load (status=fail):" errors with PhantomJS 1.5 and the latest version of pjscrape:
$ phantomjs --load-images=no --web-security=no --ignore-ssl-errors=yes --disk-cache=yes --cookies-file=/tmp/cookies.txt --load-plugins=yes pjscrape-head.js cnn.js
* Suite 0 starting
* Opening http://www.cnn.com
* Scraping http://www.cnn.com
TypeError: 'null' is not an object
phantomjs://webpage.evaluate():4
phantomjs://webpage.evaluate():1
pjscrape-head.js:723
pjscrape-head.js:720
pjscrape-head.js:488
* Found 215 additional urls to scrape
* Suite 0 complete
* Suite 0-sub0 starting
* Opening http://edition.cnn.com/
ERROR: Page did not load (status=fail): http://edition.cnn.com/
Looks similar to this issue, but I'm sure the Content-Type header is available:
http://code.google.com/p/phantomjs/issues/detail?id=128
ariya/phantomjs#135
Here's the code I'm using:
var scraper = function() {
var links = $('a')
links = links.map(function(index, elem) {
return $(elem).text()
}).toArray()
return links
}
pjs.addSuite({
url: 'http://www.cnn.com',
maxDepth: 1,
noConflict: true,
moreUrls: function() {
return _pjs.getAnchorUrls('a', false)
},
scraper: scraper
});
Adding some debug code I see that status and statusText are null:
* Opening http://edition.cnn.com/cnni/
received: {
"contentType": null,
"headers": [],
"id": 220,
"redirectURL": null,
"stage": "end",
"status": null,
"statusText": null,
"time": "2012-05-20T08:05:14.977Z",
"url": "http://edition.cnn.com/CNNI/"
}
ERROR: Page did not load (status=fail): http://edition.cnn.com/cnni/
Note that removing the noConflict option fixes the "TypeError: 'null' is not an object" issue.
searched my ass off and did not find a thing
how do you change the useragent from the config file ?
pjs.config({
userAgent : 'someuseragent'
}
am i missing somethig
Not sure how to handle pages with infinite scroll feature where a window scroll event is used to trigger a next page load.
One example could be below where next page gets loaded after you scroll down to bottom:
http://www.shopbychoice.com/mobile-phones/smartphones
Just a stub. Trying to make it work atm....
I've read the docs and checked the code, but the following config doesn't result in a file being written. I'm using the latest version of PhantomJS (1.6.1) and master branch of pjscrape (0.1.4 in VERSION.txt)
pjs.addSuite({
url: 'http://www.google.com/',
writer: 'itemfile',
// outFile: '/tmp/pjscrape-out.txt',
scraper: function() {
return {
filename: '/tmp/pjscrape-out.txt',
content: document.documentElement.outerHTML
}
}
});
I've also tried 'file' writer and outFile in the top-level config, but neither results in a file being written.
The configuration field batchSize is essentially ignored as the file writer's write() method clears the file every time it is called. If a script crashes all data is lost.
Use case:
Let's call http://www.example.com/ as "root". "root" contains links to root.1, root.2, root.3...root.250 (see hermitageart.com...an actual example with 260 links!!!). Each of these 250 links contain links to other pages. If my feature of interest was found only in root.3 and root.102, then ideally I would have liked root.4, root.5,....root.250 to not be accessed, i.e. page.open should not be called on them.
I think this would need to be addressed by setting a flag (maybe on the _pjs.state object?) to end the suite early, which could be checked in the page completion callback, emptying out the array of still-to-scrape pages. Question: this only affects the current level of recursion. Is that good? Do we need an early exit from the entire suite?
Settings in pjs.config() will cascade down to all suites. So you could set your suite settings this way, and then add a conditional block to the CL argument parsing, around here, that would check for urls instead of file names, and, if found, do:
pjs.addSuite({ url: configFile });
I'd have to think whether there are any issues adding many individual suites if many urls were supplied...
https://nrabinowitz.github.io/pjscrape/#quickstart has no content, the other tabs on the site don't either.
I am building RPM, but I am not very used to this javascript new world. So, what is the best directory to place this? (note, I have already packed phantoms in rpm)
maybe /usr/share/phantomjs/libs ???
I am trying to create a very basic scraper with PhantomJS and pjscrape framework.
pjs.config({
timeoutInterval: 6000,
timeoutLimit: 10000,
format: 'csv',
csvFields: ['productTitle','price'],
writer: 'file',
outFile: 'D:\\prod_details.csv'
});
pjs.addSuite({
title: 'ChainReactionCycles Scraper',
url: productURLs, //This is an array of URLs, two example are defined below
scrapers: [
function() {
var results [];
var linkTitle = _pjs.getText('#ModelsDisplayStyle4_LblTitle');
var linkPrice = _pjs.getText('#ModelsDisplayStyle4_LblMinPrice');
results.push([linkTitle[0],linkPrice[0]]);
return results;
}
]
});
This first array DOES NOT WORK and fails after the 3rd or 4th URL.
var productURLs = ["8649","17374","7327","7325","14892","8650","8651","14893","18090","51318"];
for(var i=0;i<productURLs.length;++i){
productURLs[i] = 'http://www.chainreactioncycles.com/Models.aspx?ModelID=' + productURLs[i];
}
This second array WORKS and does not fail, even though it is from the same site.
var categoriesURLs = ["304","2420","965","518","514","1667","521","1302","1138","510"];
for(var i=0;i<categoriesURLs.length;++i){
categoriesURLs[i] = 'http://www.chainreactioncycles.com/Categories.aspx?CategoryID=' + categoriesURLs[i];
}
When iterating through productURLs the PhantomJS page.open optional callback automatically assumes failure. Even when the page hasn't finished loading.
I know this as I started the script up while running an HTTP debugger and the HTTP request were still running even after PhantomJS had reported a a page load failure.
However, the code works fine when running with categoriesURLs.
These are solutions I have tried thus far.
Any ideas?
I think Pjscrape's .ready() function should execute last, but I'm not sure. Should add tests for $(document).ready() execution order (maybe with Prototype, etc, too?).
Wouldn't it be good to package this to npm?
Sorry to be so straight, but the last update is a year ago and here we have a lot of issues coming in.
Has the author gave up on this project? Is there any alternative you recommend?
I found https://www.npmjs.com/package/jedi-crawler https://www.npmjs.com/package/console-crawler https://www.npmjs.com/package/spa-crawler
But they also have not been updated for a while. (BTW, can this be put on NPM?)
There are a lot of stuff at https://www.npmjs.com/search?q=crawler&page=2 but I feel I should pick something built on top of phantomjs or another headless browser to not limit myself.
What do you think? Thank you very much!
Hi there,
First of all: I really like the project, good stuff!
Now to my request ;)
I use pjscrape to scrape a single file and then do some post processing of the collected data. Right now the post processing doesn't seem to be possible.
It would be great to have a post scrape callback, ideally with the data collected available.
There is no documentation in page how to use it.
I am trying to scrape the entire page and save it into a JSON file using PJScrape The following code runs and I can see the entire DOM in standard output, but I don't see the file scrape_output.json in the current directory
pjs.addSuite({
// single URL or array
url: 'http://en.wikipedia.org/wiki/List_of_towns_in_Vermont',
// single function or array, evaluated in the client
scraper: function() {
return $(document).text();
},
// options: 'json' or 'csv'
format: 'json',
// options: 'stdout' or 'file' (set in config.outFile)
writer: 'file',
outFile: 'scrape_output.json'
});
Add a client-side data persistence mechanism - localStorage would work (only if you're on the same domain - otherwise you'd need PhantomJS-level persistence, requiring a separate function...). Use case: scraping sub-pages of a category page and persisting the category.
More thinking on this: I could support cross-domain data using _pjs.state on the client side (default {}), grabbing it in page.open() at the end of the scrape and then writing it as JSON to the object at the beginning.
The selector used in the examples no longer works, update as necessary.
* CLIENT: mw.loader::execute> Exception thrown by jquery.ui.mouse: Result of expression '$.widget' [undefined] is not a function. (http://bits.wikimedia.org/en.wikipedia.org/load.php?debug=false&lang=en&modules=jquery%2Cmediawiki&only=scripts&skin=vector&version=20111213T185322Z line 142) * CLIENT: TypeError: Result of expression '$.widget' [undefined] is not a function. (http://bits.wikimedia.org/en.wikipedia.org/load.php?debug=false&lang=en&modules=jquery.ui.dialog%2Cdraggable%2Cmouse%2Cposition%2Cresizable&skin=vector&version=20110919T161434Z&* line 40)
I've posted this on the PhantomJS Google Group, but I thought I'd ask here as well in case you knew the answer.
Run the following script using: phantomjs test.coffee
page = require('webpage').create()
page.onResourceReceived = (response) ->
console.log('Received ' + response.url)
url = 'http://fiddle.jshell.net/simonvwade/wpstb/11/show/'
page.open url, (status) ->
console.log 'Finished loading'
document.location.href = '#!foo'
The fiddle that is loaded (see http://jsfiddle.net/simonvwade/wpstb/11/) makes a request using AJAX whenever the hash fragment changes (try clicking the link). However this doesn't happen when document.location.href is called in PhantomJS.
I would expect to see "Received ...normalize.css" showing after "Finished loading"
Using the first example here: http://nrabinowitz.github.io/pjscrape/
pjs.addSuite({ // url to scrape url: 'http://en.wikipedia.org/wiki/List_of_towns_in_Vermont', // selector to look for scraper: '#sortable_table_id_0 tr td:nth-child(2)' }); // Output: ["Addison","Albany","Alburgh", ...]
If I launch ita as is I get suites starting. But if I only replace the url whit anything else (e.g. https://en.wikipedia.org/wiki/List_of_cities_and_towns_in_Utah )
I get
`RangeError: Maximum call stack size exceeded.
undefined:0 in injectJs
phantomjs://code/pjscrape.js:886
:0 in forEach
RangeError: Maximum call stack size exceeded.
undefined:0 in injectJs
phantomjs://code/pjscrape.js:886
:0 in forEach
RangeError: Maximum call stack size exceeded.
phantomjs://code/pjscrape.js:894 in global code
:0 in injectJs
phantomjs://code/pjscrape.js:886
:0 in forEach
RangeError: Maximum call stack size exceeded.
undefined:0 in injectJs
phantomjs://code/pjscrape.js:886
:0 in forEach
FATAL ERROR: No suites configured
`
Notingh else changed...
Hello
I run a pjscrape script using the proxy settings in phantomjs , someting like
phantomjs --proxy=ip:port pjscrape.js my_config.js scraping.js
but I get the following error :
ERROR: Page did not load (status=fail)
Any Idea, I use different proxy ip and they work within firefox browser..?
Thanks
Right now there's no great way to recover from a stalled scrape - setting a timeout in suite.scrape() could prevent this from killing the rest of the scrape.
I'm trying to figure out how to do a nested scrape which relies on data from the first scraper in the second.
I'm pulling the Artist names from: http://www.billboard.com/artists/top-100?page=0
This part works:
pjs.addSuite({
url: 'http://www.billboard.com/artists/top-100?page=0',
scraper: function() {
var artists=[];
$(".artist-top-100 h1 > a").each(function(i,el){
artists.push( {'name':$(el).text(), 'url':$(el).attr("href")} );
});
return artists;
}
});
I then want to go into the individual Artist's page and grab the top songs: http://www.billboard.com/artist/371422/taylor-swift
Individually this works too:
pjs.addSuite({
url: 'http://www.billboard.com/artist/371422/taylor-swift',
scraper: function() {
var songs=[];
$(".module_chart_position b").each(function(i,el){
songs.push( $(el).text() );
});
return songs;
}
});
but what I want to get is the return from scrape #2 as a part of the return for scrape #1, so that it looks more like:
[
{
name: 'Taylor Swift',
url: '/artist/371422/taylor-swift',
songs: ['I Knew You Were Trouble', '22']
}
...
]
When I try and nest, then it says _name
and _url
are undefined.
pjs.addSuite({
url: 'http://www.billboard.com/artists/top-100?page=0',
scraper: function() {
var artists=[];
$(".artist-top-100 h1 > a:eq(0)").each(function(i,el){
var _name = $(el).text();
var _url = $(el).attr("href");
artists.push( {'name':_name, 'url':_url} );
});
(function(_name,_url){
pjs.addSuite({
url: _url,
scraper: function() {
var songs=[];
$(".module_chart_position b").each(function(i,el){
songs.push( $(el).text() );
});
return songs;
}
});
})(_name, _url);
return artists;
}
});
I see the note on the Documentation page about the private scope, and I don't quite understand how to apply the evaluate
suggested. I guess the question is: Is there a work around to this, or is there another way to accomplish the above?
How do I specify and export scraped data into a file instead printing to the screen? Would be nice to have in documentation. I am using PhantomJS 1.6. Thanks!
Allows for delays between runs.
pjs.config({
delayBetweenRuns: 1000,
});
This is the original inspiration for the code:
http://stackoverflow.com/a/14238881
I've searched but can't seem to find any documentation on how to use the pageSettings option.
Due an issue with PanthomJS (I've tested on 1.7.0 in both MacOSX and Debian 6) issue 353, the page.open()
event is being called multiple times on some url's. This is related to iframes being created within the page (you can find more details in the open issue).
line 680 // run the scrape
line 681 page.open(url, function(status) {
Below you can see an output example of how the log looks like when scraping is invoked many times:
xxxxx@ip-xxxxxxxxx:~/crawler$ phantomjs --web-security=no --load-images=no --ignore-ssl-errors=yes ./pjscrape-600e20a/pjscrape.js ./bin/pjscrape-600e20a/pjscrape.js ./config.js
Using config file: src/main/resources/com/apicube/crawler/pjscrape/config.js
* Suite 0 starting
* Opening http://www.lemonde.fr/planete/article/2012/11/29/eolien-ou-gaz-de-schiste-le-debat-sur-l-energie-debute-aujourd-hui_1797825_3244.html
* Scraping http://www.lemonde.fr/planete/article/2012/11/29/eolien-ou-gaz-de-schiste-le-debat-sur-l-energie-debute-aujourd-hui_1797825_3244.html
* Scraping http://www.lemonde.fr/planete/article/2012/11/29/eolien-ou-gaz-de-schiste-le-debat-sur-l-energie-debute-aujourd-hui_1797825_3244.html
* Scraping http://www.lemonde.fr/planete/article/2012/11/29/eolien-ou-gaz-de-schiste-le-debat-sur-l-energie-debute-aujourd-hui_1797825_3244.html
* Scraping http://www.lemonde.fr/planete/article/2012/11/29/eolien-ou-gaz-de-schiste-le-debat-sur-l-energie-debute-aujourd-hui_1797825_3244.html
* Scraping http://www.lemonde.fr/planete/article/2012/11/29/eolien-ou-gaz-de-schiste-le-debat-sur-l-energie-debute-aujourd-hui_1797825_3244.html
* Scraping http://www.lemonde.fr/planete/article/2012/11/29/eolien-ou-gaz-de-schiste-le-debat-sur-l-energie-debute-aujourd-hui_1797825_3244.html
* Scraping http://www.lemonde.fr/planete/article/2012/11/29/eolien-ou-gaz-de-schiste-le-debat-sur-l-energie-debute-aujourd-hui_1797825_3244.html
* Suite 0 complete
* Writing 1 items
* Saved 1 items
* Suite 0 complete
* Writing 1 items
* Saved 1 items
* Suite 0 complete
* Writing 1 items
* Saved 1 items
* Suite 0 complete
* Writing 1 items
* Saved 1 items
* Suite 0 complete
* Writing 1 items
* Saved 1 items
I've applied a workaround in the meantime in order to stop duplicated events. I use the visited array to know if that page was already visited. I added a condition before line 700 as you can see below:
if(visited[url]) {
log.msg('Page recalled: ' + url);
return;
}
// mark as visited
visited[url] = true;
Hope this help to fix this bug.
Diego
It is not possible to pass any arguments to the script, cause there is an error evaluating the arguments:
phantomjs pjscrape/pjscrape.js pjs.config.js fooparm
FATAL ERROR: Config file not found: fooparm
Any thoughts on obeying robots.txt file?
Is it possible to output to multiple files when scraping multiple URLs? For example, if I'm scraping page1.html and page2.html is it possible to have each of the scrapes output to page1.csv and page2.csv instead of simply one concatenated output.csv?
Hello,
I would like to keep the result in a variable for subsequent manipulation. In JS terms, is there a complete event I can attach a function to?
Similar to:
pjs.addSuite({
url: urlCPB,
moreUrls: '#recherche > table > tbody > tr > td.torrent-aff > a',
maxDepth: 1,
scraper: function() {
return {
name: $('#center-middle > div.content > div.torrent > h1').text(),
fileSize: $('#center-middle > div.content > div.torrent > div:nth-child(6) > fieldset > strong:nth-child(2)').text(),
torrent: $('#center-middle > div.content > div.torrent > div:nth-child(8) > div:nth-child(2) > a').attr('href')
};
},
complete: function(data){
console.log('This is the final data object being scrapped: ' + data);
}
});
pjs.addSuite({
url: 'http://en.wikipedia.org/wiki/List_of_towns_in_Vermont',
scraper: function() {
return $('#sortable_table_id_0 tr').slice(1).map(function() {
var name = $('td:nth-child(2)', this).text(),
county = $('td:nth-child(3)', this).text(),
// convert relative URLs to absolute
link = _pjs.toFullUrl(
$('td:nth-child(2) a', this).attr('href')
);
return {
model: "myapp.town",
fields: {
name: name,
county: county,
link: link
}
}
}).toArray(); // don't forget .toArray() if you're using .map()
}
});
an example on the tutorials does not work anymore. I cant reference _pjs within the map function above. I get "ReferenceError: Can't find variable: _pjs" on the command line.
UPDATE: Ok, non issue, can be closed. Hope this helps somebody.
I was running my scraper as:
$ phantomjs pjscrape.js myscraper.js
instead you want all the contents of the zip/tar ball you download in the path
$phantomjs pjscrape/pjscrape.js myscraper.js // Here pjscrape dir has all files from the zip or tar ball
This project is really great, and easy to set up and get started with.
So I've once used it do download files, by getting their URLs, but now I'm facing the problem that the URLs (to images) are tied to a session, and I can't just grab their URLs and get them later anymore. Ideally I should grab them while I scrape.
So, how can I download files from within the scraper (maybe with some writer other than JSON or STDOUT)? Is it even possible to use pjscrape to download files?
Sometimes, an a tag contains an href like "//www.foo.com/bar"
meaning that the protocol of the link is the same as the current window but it is indeed not a relative url but an absolute one.
On the youtube home page
<a href="//www.youtube.com/upload">...</a>
pjscrape function will get the url as http://www.foo.com//www.foo.com/bar leading to a 404 error when the page is visited.
Is there any to call pjscrape via commandline? like pjscrape.sh my_config.json "http://myurl.org" and append the results to a file?
i´ve to scrape thousends of similar sites (only id in url changes). all those urls are saved in a database.... so cli-call would be perfect...
thanks for your help! great project!
PhantomJS phantom.args.length method call is outdated.
TypeError: undefined is not an object (evaluating 'phantom.args.length')
pjscrape.js:877 in global code
Add an option for a preSuite function, in the PhantomJS environment, with page passed in as an argument, to support things like session-based authentication before a scrape
I'm interested in using pjscrape to scrape pages with an infinite scroll. Trying to figure out the best way to do this. I'm happy to try and implement this feature myself, but I'm curious if you've given this any thought and if so, how you would approach doing this.
How can I combine data from number of pages into one object as if they were a single page (without post-processing)?
I am trying to figure out if this script can run indipendently on a users browser. By this I mean, could a url be entered into the browder by any browser which in effect could execute the scrape using the users IP all without communicating back to a central host server?
Additionally, can one specifiy variables such as custom user agents and dns to be utilized when such requests are made?
If a value for timeoutInterval is provided, PhantomJS will eventually stop utilizing cookies. phantom.addCookie() will return false (as will page.addCookie()).
The larger the timeoutInterval, the fewer suites can be run before cookie functionality fails.
With a timeoutInterval of 3,000, cookies fail around suite 22. With an interval of 10,000 they fail at around 7 suites.
I believe the issue stems from the number of times page.evaluate() is called, but I'm note sure.
Suite 0 starting
Opening http://golem.de
TypeError: 'undefined' is not an object
pjscrape/pjscrape.js:660
I'm trying to save the html in a web with this code, but no result
A help would be much appreciated.
var scraper = function() {
return $().html();
};
pjs.addSuite({
url: 'http://www.expoquimia.com/exhibitors',
moreUrls: function() {
return _pjs.getAnchorUrls('li a');
},
maxDepth: 1,
scraper: scraper
});
pjs.config({
// options: 'stdout', 'file' (set in config.logFile) or 'none'
log: 'stdout',
// options: 'json' or 'csv'
format: 'json',
// options: 'stdout' or 'file' (set in config.outFile)
writer: 'file',
outFile: 'scrape_output.json'
});
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.