fallanic / cheers Goto Github PK
View Code? Open in Web Editor NEWScrape a website efficiently, block by block, page by page. Based on cheerio and curl.
License: MIT License
Scrape a website efficiently, block by block, page by page. Based on cheerio and curl.
License: MIT License
Can we pass RegEx as value for extract key?
Can't run example code from readme:
events.js:72
throw er; // Unhandled 'error' event
^
Error: spawn ENOENT
at errnoException (child_process.js:1000:11)
at Process.ChildProcess._handle.onexit (child_process.js:791:34)
P.S. All example scripts has same error. walgreens.js also has thouse:
Incorrect scraper mapping, each item must have a 'selector' and an 'extract' attribute.
price:{"selector":"","extract":"text"}
Incorrect scraper mapping, each item must have a 'selector' and an 'extract' attribute.
img:{"selector":"","extract":"text"}
Incorrect scraper mapping, each item must have a 'selector' and an 'extract' attribute.
quantity:{"selector":"","extract":"text"}
Incorrect scraper mapping, each item must have a 'selector' and an 'extract' attribute.
reviews:{"selector":"","extract":"text"}
Incorrect scraper mapping, each item must have a 'selector' and an 'extract' attribute.
shipping:{"selector":"","extract":"text"}
Incorrect scraper mapping, each item must have a 'selector' and an 'extract' attribute.
inStore:{"selector":"","extract":"text"}
Incorrect scraper mapping, each item must have a 'selector' and an 'extract' attribute.
pickupInStore:{"selector":"","extract":"text"}
before error.
I have this structure
<div id="otherArticles">
<a href="/article1.html">
<div class="dateStr"> 01.01.1001</div>
<div class="titleStr"> article1Title </div>
</a>
<a href="/article2.html">
<div class="dateStr"> 02.02.2002 </div>
<div class="titleStr"> article2Title </div>
</a>
</div>
and this config:
{
url: 'url',
blockSelector: '#otherArticles',
scrape: {
link: {
selector: 'a',
extract: 'href'
},
date: {
selector: '.dateStr',
extractor: 'text'
},
title: {
selector: '.titleStr',
extractor: 'text'
}
}
}
I get this output:
[{
link: '/article1.html',
date: '01.01.1001 02.02.2002',
title: 'article1Title article2Title'
}]
As you can see article1 and article2 is joined together, how can I scrape this separately so I get this:
[{
link: '/article1.html',
date: '01.01.1001',
title: 'article1Title'
},{
link: '/article2.html',
date: '02.02.2002',
title: 'article2Title'
}]
How can I scrape this structure, is looping supported? What I need is list of all the hrefs, so in this case ['href1', 'href2', 'href3']
.
<ul>
<li> <a href="href1"></li>
<li> <a href="href2"></li>
<li> <a href="href3"></li>
</ul>
Could you add an option to pass curloptions, eg I need to pass form data to the webpage, or Cookies.
Is it possible to support scraping from html? without making any HTTP request in case you somehow obtained the html. I was thinking maybe:
config.html = '<body><div>Im an HTML</div></body>'
And assign priority or throw error in case both config.url
and config.html
found
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.