Comments (3)
Hi Isaac,
I need to write some better documentation and/or change the API to better account for this use case -- it's not an edge case, but a relatively central one, and it's one that keeps tripping people up. So please accept my apologies for this not being clearer.
But first, I'm not sure which of two central cases you're going for here: Are you trying to scrape the data off the pages linked from that directory, or the data on the directory page itself?
If you're trying to scrape data off the pages linked from the directory, you need to give a CSS or XPath selector as the second argument to Scraper.new
, i.e. Scraper.new("http://whatever", "a.relevant-link"). You can paginate automatically by setting
paginationto true,
pagination_paramto
"page"and
pagination_max` to whatever the max is.
from upton.
I'm trying to do both. This helped me solve the problem of scraping data off the pages linked from the directory. I wasn't aware that the Scraper object needed to target a link.
Now how do I then go about scraping the data on the directory page itself? I'm trying to get the ul
s on the page by using table + table ul li
. Scraper seems to recognize that there are 12 instances but doesn't give me anymore info.
s = Upton::Scraper.new("http://shops.oscommerce.com/directory?country=US&page=1", "table + table ul li")
s.sleep_time_between_requests = 1
=> 1
s.verbose = true
=> true
s.scrape { |html| puts html }
-------
Stashing disabled. Will download from the internet.
Downloading from http://shops.oscommerce.com/directory?country=US&page=1
Downloaded http://shops.oscommerce.com/directory?country=US&page=1
sleeping 1 secs
Scraping 12 instances
=> [nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil]
from upton.
Hey Isaac,
To scrape just a single page, put the URL in an array, (I should really write some helpers for this so it's clearer and use them in the README (#TODO).) e.g.
s = Upton::Scraper.new(["http://website.com/whatever"])
s.scrape do |instance_html| #this block will only execute once -- with the HTML from the site you gave to Scraper.new
page = Nokogiri::HTML(instance_html)
page.search("table + table ul li").each{|li| puts li.text}
end
You can also use a helper like this:
s = Upton::Scraper.new(["http://website.com/whatever"])
my_list = s.scrape &Upton::Utils.list("table + table ul li")
or even better
my_list = Upton::Scraper.new(["http://website.com/whatever"]).scrape &Upton::Utils.list("table + table ul li")
which'll just return an array of the contents of the li
elements.
There's no built-in way to scrape data from both the index page AND the instance page with one pass. You're not the first one to ask for it though. I think what I might do for the next release of Upton (#TODO) is have scrape
yield an instance of InstancePage
(which sounds awkward...) -- which would include the Nokogiri'd HTML, the plain HTML, the URL, and a reference to the Nokogiri'd index page, etc.
from upton.
Related Issues (20)
- relative url edge cases HOT 4
- Handle pagination out-of-the-box HOT 2
- find by xpath HOT 5
- Improving url_to_filename HOT 7
- Use content-type to skip non-HTML instance pages HOT 4
- Recursive function causing a stack overflow HOT 5
- Switch from concatenating HTML to putting it in an array when paginating HOT 2
- Warn users of slug collisions
- pagination doesn't respect sleep time HOT 7
- The example in README.md does not work HOT 2
- Helper methods for scraping one page and for scraping multiple HOT 5
- Create ScrapedPage object HOT 1
- HTML Comment on stashed pages with info HOT 1
- Make Scraper instances additive HOT 1
- problem scraping index page (Scraping 0 instances) HOT 1
- Pagination always double-downloads first page HOT 3
- make scrape method return an enumerator
- scrape_to_csv method should write to the CSV incrementally
- New version? HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from upton.