fallanic / cheers Goto Github PK

View Code? Open in Web Editor NEW

193.0 193.0 16.0 44 KB

Scrape a website efficiently, block by block, page by page. Based on cheerio and curl.

License: MIT License

JavaScript 27.72% HTML 72.28%

cheers's People

Contributors

Stargazers

Watchers

Forkers

quicktoolbox arsalan-k nivertech jkso mickeyrourkeske eguneys imclab yyrj modulexcite ptsakyrellis stophi ahelsinger ogrotten bryant1410 ykankaya

cheers's Issues

Regex for extract

Can we pass RegEx as value for extract key?

Unhandled 'error' event

Can't run example code from readme:

events.js:72
        throw er; // Unhandled 'error' event
              ^
Error: spawn ENOENT
    at errnoException (child_process.js:1000:11)
    at Process.ChildProcess._handle.onexit (child_process.js:791:34)

P.S. All example scripts has same error. walgreens.js also has thouse:

Incorrect scraper mapping, each item must have a 'selector' and an 'extract' attribute.
price:{"selector":"","extract":"text"}
Incorrect scraper mapping, each item must have a 'selector' and an 'extract' attribute.
img:{"selector":"","extract":"text"}
Incorrect scraper mapping, each item must have a 'selector' and an 'extract' attribute.
quantity:{"selector":"","extract":"text"}
Incorrect scraper mapping, each item must have a 'selector' and an 'extract' attribute.
reviews:{"selector":"","extract":"text"}
Incorrect scraper mapping, each item must have a 'selector' and an 'extract' attribute.
shipping:{"selector":"","extract":"text"}
Incorrect scraper mapping, each item must have a 'selector' and an 'extract' attribute.
inStore:{"selector":"","extract":"text"}
Incorrect scraper mapping, each item must have a 'selector' and an 'extract' attribute.
pickupInStore:{"selector":"","extract":"text"}

before error.

extracted blocks joined together in one

I have this structure

<div id="otherArticles">
  <a href="/article1.html">
    <div class="dateStr"> 01.01.1001</div>
    <div class="titleStr"> article1Title </div>
  </a>
  <a href="/article2.html">
    <div class="dateStr"> 02.02.2002 </div>
    <div class="titleStr"> article2Title </div>
  </a>
</div>

and this config:

{
  url: 'url',
  blockSelector: '#otherArticles',
  scrape: {
    link: {
      selector: 'a',
      extract: 'href'
    },
    date: {
      selector: '.dateStr',
      extractor: 'text'
    },
    title: {
      selector: '.titleStr',
      extractor: 'text'
    }
  }
}

I get this output:

[{
  link: '/article1.html',
  date: '01.01.1001 02.02.2002',
  title: 'article1Title article2Title'
}]

As you can see article1 and article2 is joined together, how can I scrape this separately so I get this:

[{
  link: '/article1.html',
  date: '01.01.1001',
  title: 'article1Title'
},{
  link: '/article2.html',
  date: '02.02.2002',
  title: 'article2Title'
}]

Scrape a list of articles in a ul li

How can I scrape this structure, is looping supported? What I need is list of all the hrefs, so in this case ['href1', 'href2', 'href3'].

<ul>
  <li> <a href="href1"></li>
  <li> <a href="href2"></li>
  <li> <a href="href3"></li>
</ul>

pass curl options option

Could you add an option to pass curloptions, eg I need to pass form data to the webpage, or Cookies.

HTML string support

Is it possible to support scraping from html? without making any HTTP request in case you somehow obtained the html. I was thinking maybe:

config.html = '<body><div>Im an HTML</div></body>'

And assign priority or throw error in case both config.url and config.html found

fallanic / cheers Goto Github PK

cheers's People

Contributors

Stargazers

Watchers

Forkers

cheers's Issues

Regex for extract

Unhandled 'error' event

extracted blocks joined together in one

Scrape a list of articles in a ul li

pass curl options option

HTML string support

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent