Git Product home page Git Product logo

pjscrape's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pjscrape's Issues

_pjs.toFullUrl can't properly handle //domain.tld

Links to //domain.tld/ are quite common to deal with http/https protocol. They should be expanded to http://domain.tld/ on http site or https://domain.tld/ on https site.

_pjs.toFullUrl() expand such links into base//domain.tld/ which is of course wrong.

Simple real world example: http://en.wikipedia.org/wiki/A
Check the footer for //wikimediafoundation.org/ and //www.mediawiki.org/ which are expanded to http://en.wikipedia.org//wikimediafoundation.org/ and http://en.wikipedia.org//www.mediawiki.org/.

ERROR: Page did not load (status=fail)

I'm getting "ERROR: Page did not load (status=fail):" errors with PhantomJS 1.5 and the latest version of pjscrape:

$ phantomjs --load-images=no --web-security=no --ignore-ssl-errors=yes --disk-cache=yes --cookies-file=/tmp/cookies.txt --load-plugins=yes pjscrape-head.js cnn.js 
* Suite 0 starting
* Opening http://www.cnn.com
* Scraping http://www.cnn.com
TypeError: 'null' is not an object

  phantomjs://webpage.evaluate():4
  phantomjs://webpage.evaluate():1
  pjscrape-head.js:723
  pjscrape-head.js:720
  pjscrape-head.js:488
* Found 215 additional urls to scrape
* Suite 0 complete
* Suite 0-sub0 starting
* Opening http://edition.cnn.com/
ERROR: Page did not load (status=fail): http://edition.cnn.com/

Looks similar to this issue, but I'm sure the Content-Type header is available:
http://code.google.com/p/phantomjs/issues/detail?id=128
ariya/phantomjs#135

Here's the code I'm using:

var scraper = function() {
  var links = $('a')
  links = links.map(function(index, elem) { 
    return $(elem).text()
  }).toArray()
  return links
}

pjs.addSuite({
  url: 'http://www.cnn.com',
  maxDepth: 1,
  noConflict: true,
  moreUrls: function() {
    return _pjs.getAnchorUrls('a', false)
  },
  scraper: scraper
});

Adding some debug code I see that status and statusText are null:

* Opening http://edition.cnn.com/cnni/
received: {
    "contentType": null,
    "headers": [],
    "id": 220,
    "redirectURL": null,
    "stage": "end",
    "status": null,
    "statusText": null,
    "time": "2012-05-20T08:05:14.977Z",
    "url": "http://edition.cnn.com/CNNI/"
}
ERROR: Page did not load (status=fail): http://edition.cnn.com/cnni/

Note that removing the noConflict option fixes the "TypeError: 'null' is not an object" issue.

How do you change the useragent?

searched my ass off and did not find a thing
how do you change the useragent from the config file ?

pjs.config({ 
    userAgent : 'someuseragent'
}

am i missing somethig

Can't get the itemfile or file writers to work

I've read the docs and checked the code, but the following config doesn't result in a file being written. I'm using the latest version of PhantomJS (1.6.1) and master branch of pjscrape (0.1.4 in VERSION.txt)

pjs.addSuite({
  url: 'http://www.google.com/',
  writer: 'itemfile',
  // outFile: '/tmp/pjscrape-out.txt',
  scraper: function() {
    return {
      filename: '/tmp/pjscrape-out.txt',
      content: document.documentElement.outerHTML
    }
  }
});

I've also tried 'file' writer and outFile in the top-level config, but neither results in a file being written.

Batch writing to file broken

The configuration field batchSize is essentially ignored as the file writer's write() method clears the file every time it is called. If a script crashes all data is lost.

Early suite exit

Use case:

Let's call http://www.example.com/ as "root". "root" contains links to root.1, root.2, root.3...root.250 (see hermitageart.com...an actual example with 260 links!!!). Each of these 250 links contain links to other pages. If my feature of interest was found only in root.3 and root.102, then ideally I would have liked root.4, root.5,....root.250 to not be accessed, i.e. page.open should not be called on them.

I think this would need to be addressed by setting a flag (maybe on the _pjs.state object?) to end the suite early, which could be checked in the page completion callback, emptying out the array of still-to-scrape pages. Question: this only affects the current level of recursion. Is that good? Do we need an early exit from the entire suite?

Take urls from the command line

Settings in pjs.config() will cascade down to all suites. So you could set your suite settings this way, and then add a conditional block to the CL argument parsing, around here, that would check for urls instead of file names, and, if found, do:

pjs.addSuite({ url: configFile });

I'd have to think whether there are any issues adding many individual suites if many urls were supplied...

RPM

I am building RPM, but I am not very used to this javascript new world. So, what is the best directory to place this? (note, I have already packed phantoms in rpm)

maybe /usr/share/phantomjs/libs ???

Failing on some multiple URLs

Overview

I am trying to create a very basic scraper with PhantomJS and pjscrape framework.

My Code

pjs.config({
timeoutInterval: 6000,
timeoutLimit: 10000,
format: 'csv', 
csvFields: ['productTitle','price'],
writer: 'file', 
outFile: 'D:\\prod_details.csv'
});

pjs.addSuite({
title: 'ChainReactionCycles Scraper',
url: productURLs, //This is an array of URLs, two example are defined below 
scrapers: [
function() {
    var results [];
    var linkTitle = _pjs.getText('#ModelsDisplayStyle4_LblTitle');
    var linkPrice = _pjs.getText('#ModelsDisplayStyle4_LblMinPrice');
    results.push([linkTitle[0],linkPrice[0]]); 
    return results;
}
]
});

URL Array's Used

This first array DOES NOT WORK and fails after the 3rd or 4th URL.

var productURLs = ["8649","17374","7327","7325","14892","8650","8651","14893","18090","51318"];
for(var i=0;i<productURLs.length;++i){
  productURLs[i] = 'http://www.chainreactioncycles.com/Models.aspx?ModelID=' + productURLs[i];
}

This second array WORKS and does not fail, even though it is from the same site.

var categoriesURLs = ["304","2420","965","518","514","1667","521","1302","1138","510"];
for(var i=0;i<categoriesURLs.length;++i){
categoriesURLs[i] = 'http://www.chainreactioncycles.com/Categories.aspx?CategoryID=' + categoriesURLs[i];
}

Problem

When iterating through productURLs the PhantomJS page.open optional callback automatically assumes failure. Even when the page hasn't finished loading.

I know this as I started the script up while running an HTTP debugger and the HTTP request were still running even after PhantomJS had reported a a page load failure.

However, the code works fine when running with categoriesURLs.

Assumptions

  1. All the URL's listed above are VALID
  2. I have the latest versions of both PhantomJS and pjscrape

Possible Solutions

These are solutions I have tried thus far.

  1. Disabling image loading page.options.loadImages = false
  2. Settings a larger timeoutInterval in pjs.config this was not useful apparently as the error generated was of a page.open failure and NOT a timeout failure.

Any ideas?

Tests for .ready() execution order

I think Pjscrape's .ready() function should execute last, but I'm not sure. Should add tests for $(document).ready() execution order (maybe with Prototype, etc, too?).

Is this project dead?

Sorry to be so straight, but the last update is a year ago and here we have a lot of issues coming in.
Has the author gave up on this project? Is there any alternative you recommend?

I found https://www.npmjs.com/package/jedi-crawler https://www.npmjs.com/package/console-crawler https://www.npmjs.com/package/spa-crawler
But they also have not been updated for a while. (BTW, can this be put on NPM?)

There are a lot of stuff at https://www.npmjs.com/search?q=crawler&page=2 but I feel I should pick something built on top of phantomjs or another headless browser to not limit myself.
What do you think? Thank you very much!

post scrape callback

Hi there,

First of all: I really like the project, good stuff!

Now to my request ;)

I use pjscrape to scrape a single file and then do some post processing of the collected data. Right now the post processing doesn't seem to be possible.
It would be great to have a post scrape callback, ideally with the data collected available.

how to start

There is no documentation in page how to use it.

I am not able to output to file using pjscrape

I am trying to scrape the entire page and save it into a JSON file using PJScrape The following code runs and I can see the entire DOM in standard output, but I don't see the file scrape_output.json in the current directory

pjs.addSuite({
    // single URL or array
    url: 'http://en.wikipedia.org/wiki/List_of_towns_in_Vermont',
    // single function or array, evaluated in the client
    scraper: function() {
        return $(document).text();
    },

    // options: 'json' or 'csv'
    format: 'json',
    // options: 'stdout' or 'file' (set in config.outFile)
    writer: 'file',
    outFile: 'scrape_output.json'
});

Add persistent data support

Add a client-side data persistence mechanism - localStorage would work (only if you're on the same domain - otherwise you'd need PhantomJS-level persistence, requiring a separate function...). Use case: scraping sub-pages of a category page and persisting the category.

More thinking on this: I could support cross-domain data using _pjs.state on the client side (default {}), grabbing it in page.open() at the end of the scrape and then writing it as JSON to the object at the beginning.

Error on example scrape

* CLIENT: mw.loader::execute> Exception thrown by jquery.ui.mouse: Result of expression '$.widget' [undefined] is not a function. (http://bits.wikimedia.org/en.wikipedia.org/load.php?debug=false&lang=en&modules=jquery%2Cmediawiki&only=scripts&skin=vector&version=20111213T185322Z line 142)
* CLIENT: TypeError: Result of expression '$.widget' [undefined] is not a function. (http://bits.wikimedia.org/en.wikipedia.org/load.php?debug=false&lang=en&modules=jquery.ui.dialog%2Cdraggable%2Cmouse%2Cposition%2Cresizable&skin=vector&version=20110919T161434Z&* line 40)

Hash fragments don't seem to be supported by PhantomJS

I've posted this on the PhantomJS Google Group, but I thought I'd ask here as well in case you knew the answer.

Run the following script using: phantomjs test.coffee

page = require('webpage').create()
page.onResourceReceived = (response) ->
  console.log('Received ' + response.url)
url = 'http://fiddle.jshell.net/simonvwade/wpstb/11/show/'
page.open url, (status) ->
  console.log 'Finished loading'
  document.location.href = '#!foo'

The fiddle that is loaded (see http://jsfiddle.net/simonvwade/wpstb/11/) makes a request using AJAX whenever the hash fragment changes (try clicking the link). However this doesn't happen when document.location.href is called in PhantomJS.

I would expect to see "Received ...normalize.css" showing after "Finished loading"

Can't understan this

Using the first example here: http://nrabinowitz.github.io/pjscrape/

pjs.addSuite({ // url to scrape url: 'http://en.wikipedia.org/wiki/List_of_towns_in_Vermont', // selector to look for scraper: '#sortable_table_id_0 tr td:nth-child(2)' }); // Output: ["Addison","Albany","Alburgh", ...]

If I launch ita as is I get suites starting. But if I only replace the url whit anything else (e.g. https://en.wikipedia.org/wiki/List_of_cities_and_towns_in_Utah )
I get

`RangeError: Maximum call stack size exceeded.

undefined:0 in injectJs
phantomjs://code/pjscrape.js:886
:0 in forEach
RangeError: Maximum call stack size exceeded.

undefined:0 in injectJs
phantomjs://code/pjscrape.js:886
:0 in forEach
RangeError: Maximum call stack size exceeded.

phantomjs://code/pjscrape.js:894 in global code
:0 in injectJs
phantomjs://code/pjscrape.js:886
:0 in forEach
RangeError: Maximum call stack size exceeded.

undefined:0 in injectJs
phantomjs://code/pjscrape.js:886
:0 in forEach
FATAL ERROR: No suites configured
`
Notingh else changed...

pjscrape with proxy

Hello
I run a pjscrape script using the proxy settings in phantomjs , someting like

phantomjs --proxy=ip:port pjscrape.js my_config.js scraping.js

but I get the following error :

ERROR: Page did not load (status=fail)

Any Idea, I use different proxy ip and they work within firefox browser..?

Thanks

Timeout for scrape()

Right now there's no great way to recover from a stalled scrape - setting a timeout in suite.scrape() could prevent this from killing the rest of the scrape.

Nested Scrapes

I'm trying to figure out how to do a nested scrape which relies on data from the first scraper in the second.

I'm pulling the Artist names from: http://www.billboard.com/artists/top-100?page=0
This part works:

pjs.addSuite({
    url: 'http://www.billboard.com/artists/top-100?page=0',
    scraper: function() {
        var artists=[]; 
        $(".artist-top-100 h1 > a").each(function(i,el){ 
            artists.push( {'name':$(el).text(), 'url':$(el).attr("href")} ); 
        }); 
        return artists;  
    }
});

I then want to go into the individual Artist's page and grab the top songs: http://www.billboard.com/artist/371422/taylor-swift

Individually this works too:

pjs.addSuite({
    url: 'http://www.billboard.com/artist/371422/taylor-swift',
    scraper: function() {
        var songs=[]; 
        $(".module_chart_position b").each(function(i,el){ 
            songs.push( $(el).text() ); 
        }); 
        return songs;  
    }
});

but what I want to get is the return from scrape #2 as a part of the return for scrape #1, so that it looks more like:

[
    {
      name: 'Taylor Swift',
      url: '/artist/371422/taylor-swift',
      songs: ['I Knew You Were Trouble', '22']
    }
    ...
]

When I try and nest, then it says _name and _url are undefined.

pjs.addSuite({
    url: 'http://www.billboard.com/artists/top-100?page=0',
    scraper: function() {
        var artists=[]; 
        $(".artist-top-100 h1 > a:eq(0)").each(function(i,el){ 
            var _name = $(el).text();
            var _url = $(el).attr("href");
            artists.push( {'name':_name, 'url':_url} ); 
        });

        (function(_name,_url){
            pjs.addSuite({
                url: _url,
                scraper: function() {
                    var songs=[]; 
                    $(".module_chart_position b").each(function(i,el){ 
                        songs.push( $(el).text() ); 
                    }); 
                    return songs;  
                }
            });
        })(_name, _url);


        return artists;  
    }
});

Result:
image

I see the note on the Documentation page about the private scope, and I don't quite understand how to apply the evaluate suggested. I guess the question is: Is there a work around to this, or is there another way to accomplish the above?

Saving JSON to external file

How do I specify and export scraped data into a file instead printing to the screen? Would be nice to have in documentation. I am using PhantomJS 1.6. Thanks!

Scraping event called multiple times

Due an issue with PanthomJS (I've tested on 1.7.0 in both MacOSX and Debian 6) issue 353, the page.open() event is being called multiple times on some url's. This is related to iframes being created within the page (you can find more details in the open issue).

pjscrape.js (master branch)

line 680 // run the scrape
line 681 page.open(url, function(status) {

Below you can see an output example of how the log looks like when scraping is invoked many times:

xxxxx@ip-xxxxxxxxx:~/crawler$ phantomjs   --web-security=no --load-images=no --ignore-ssl-errors=yes ./pjscrape-600e20a/pjscrape.js  ./bin/pjscrape-600e20a/pjscrape.js ./config.js

Using config file: src/main/resources/com/apicube/crawler/pjscrape/config.js
* Suite 0 starting
* Opening http://www.lemonde.fr/planete/article/2012/11/29/eolien-ou-gaz-de-schiste-le-debat-sur-l-energie-debute-aujourd-hui_1797825_3244.html
* Scraping http://www.lemonde.fr/planete/article/2012/11/29/eolien-ou-gaz-de-schiste-le-debat-sur-l-energie-debute-aujourd-hui_1797825_3244.html
* Scraping http://www.lemonde.fr/planete/article/2012/11/29/eolien-ou-gaz-de-schiste-le-debat-sur-l-energie-debute-aujourd-hui_1797825_3244.html
* Scraping http://www.lemonde.fr/planete/article/2012/11/29/eolien-ou-gaz-de-schiste-le-debat-sur-l-energie-debute-aujourd-hui_1797825_3244.html
* Scraping http://www.lemonde.fr/planete/article/2012/11/29/eolien-ou-gaz-de-schiste-le-debat-sur-l-energie-debute-aujourd-hui_1797825_3244.html
* Scraping http://www.lemonde.fr/planete/article/2012/11/29/eolien-ou-gaz-de-schiste-le-debat-sur-l-energie-debute-aujourd-hui_1797825_3244.html
* Scraping http://www.lemonde.fr/planete/article/2012/11/29/eolien-ou-gaz-de-schiste-le-debat-sur-l-energie-debute-aujourd-hui_1797825_3244.html
* Scraping http://www.lemonde.fr/planete/article/2012/11/29/eolien-ou-gaz-de-schiste-le-debat-sur-l-energie-debute-aujourd-hui_1797825_3244.html
* Suite 0 complete
* Writing 1 items
* Saved 1 items
* Suite 0 complete
* Writing 1 items
* Saved 1 items
* Suite 0 complete
* Writing 1 items
* Saved 1 items
* Suite 0 complete
* Writing 1 items
* Saved 1 items
* Suite 0 complete
* Writing 1 items
* Saved 1 items

I've applied a workaround in the meantime in order to stop duplicated events. I use the visited array to know if that page was already visited. I added a condition before line 700 as you can see below:

pjscrape.js (master branch) line 700

                   if(visited[url]) {

                        log.msg('Page recalled: ' + url);
                        return;
                    }


                   // mark as visited
                  visited[url] = true;

Hope this help to fix this bug.
Diego

FATAL ERROR: Config file not found

It is not possible to pass any arguments to the script, cause there is an error evaluating the arguments:
phantomjs pjscrape/pjscrape.js pjs.config.js fooparm
FATAL ERROR: Config file not found: fooparm

Robots.txt

Any thoughts on obeying robots.txt file?

Output to multiple files when using multiple urls

Is it possible to output to multiple files when scraping multiple URLs? For example, if I'm scraping page1.html and page2.html is it possible to have each of the scrapes output to page1.csv and page2.csv instead of simply one concatenated output.csv?

How to keep the result in a variable?

Hello,

I would like to keep the result in a variable for subsequent manipulation. In JS terms, is there a complete event I can attach a function to?

Similar to:

pjs.addSuite({
    url: urlCPB,
    moreUrls: '#recherche > table > tbody > tr > td.torrent-aff > a',
    maxDepth: 1,
    scraper: function() {
        return {
            name: $('#center-middle > div.content > div.torrent > h1').text(),
            fileSize: $('#center-middle > div.content > div.torrent > div:nth-child(6) > fieldset > strong:nth-child(2)').text(),
            torrent: $('#center-middle > div.content > div.torrent > div:nth-child(8) > div:nth-child(2) > a').attr('href')
        };
    },
    complete: function(data){
        console.log('This is the final data object being scrapped: ' + data);
    }
});

Reference error _.pjs

pjs.addSuite({
    url: 'http://en.wikipedia.org/wiki/List_of_towns_in_Vermont',
    scraper: function() {
        return $('#sortable_table_id_0 tr').slice(1).map(function() {
            var name = $('td:nth-child(2)', this).text(),
                county = $('td:nth-child(3)', this).text(),
                // convert relative URLs to absolute
                link = _pjs.toFullUrl(
                    $('td:nth-child(2) a', this).attr('href')
                );
            return {
                model: "myapp.town",
                fields: {
                    name: name,
                    county: county,
                    link: link
                }
            }
        }).toArray(); // don't forget .toArray() if you're using .map()
    }
});

an example on the tutorials does not work anymore. I cant reference _pjs within the map function above. I get "ReferenceError: Can't find variable: _pjs" on the command line.

UPDATE: Ok, non issue, can be closed. Hope this helps somebody.

I was running my scraper as:
$ phantomjs pjscrape.js myscraper.js

instead you want all the contents of the zip/tar ball you download in the path

$phantomjs pjscrape/pjscrape.js myscraper.js // Here pjscrape dir has all files from the zip or tar ball

How to download files with pjscrape?

This project is really great, and easy to set up and get started with.

So I've once used it do download files, by getting their URLs, but now I'm facing the problem that the URLs (to images) are tied to a session, and I can't just grab their URLs and get them later anymore. Ideally I should grab them while I scrape.

So, how can I download files from within the scraper (maybe with some writer other than JSON or STDOUT)? Is it even possible to use pjscrape to download files?

Phantomjs2.0

PhantomJS phantom.args.length method call is outdated.

TypeError: undefined is not an object (evaluating 'phantom.args.length')

pjscrape.js:877 in global code

preSuite(page) function

Add an option for a preSuite function, in the PhantomJS environment, with page passed in as an argument, to support things like session-based authentication before a scrape

Infinite Scroll scraping

I'm interested in using pjscrape to scrape pages with an infinite scroll. Trying to figure out the best way to do this. I'm happy to try and implement this feature myself, but I'm curious if you've given this any thought and if so, how you would approach doing this.

Multipage objects

How can I combine data from number of pages into one object as if they were a single page (without post-processing)?

client side extraction with useragent?

I am trying to figure out if this script can run indipendently on a users browser. By this I mean, could a url be entered into the browder by any browser which in effect could execute the scrape using the users IP all without communicating back to a central host server?

Additionally, can one specifiy variables such as custom user agents and dns to be utilized when such requests are made?

Using timeoutInterval causes PhantomJS to stop accepting cookies

If a value for timeoutInterval is provided, PhantomJS will eventually stop utilizing cookies. phantom.addCookie() will return false (as will page.addCookie()).

The larger the timeoutInterval, the fewer suites can be run before cookie functionality fails.

With a timeoutInterval of 3,000, cookies fail around suite 22. With an interval of 10,000 they fail at around 7 suites.

I believe the issue stems from the number of times page.evaluate() is called, but I'm note sure.

Scrape website

I'm trying to save the html in a web with this code, but no result
A help would be much appreciated.

var scraper = function() {
return $().html();

};

pjs.addSuite({
url: 'http://www.expoquimia.com/exhibitors',
moreUrls: function() {
return _pjs.getAnchorUrls('li a');
},
maxDepth: 1,
scraper: scraper
});

pjs.config({
// options: 'stdout', 'file' (set in config.logFile) or 'none'
log: 'stdout',
// options: 'json' or 'csv'
format: 'json',
// options: 'stdout' or 'file' (set in config.outFile)
writer: 'file',
outFile: 'scrape_output.json'
});

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.