Git Product home page Git Product logo

scrape-html-web's Introduction

scrape-html-web

scrape-html-web

Extract content from a static HTML website.

ESM, CJS Node >=16

When you install Scrape HTML Web, no version of Chromium will be downloaded, unlike, for example, Puppeteer. This makes it a fast and light library.

Access to all websites is not guaranteed, this may depend on the authorization they have.

The library is asynchronous.

Note: two dependencies are included in order to work:

  • axios - retrieve the web page
  • cheerio - manage content scraping based you have formatted selectors

Installation

To use Scrape HTML Web in your project, run:

npm install scrape-html-web
or
yarn add scrape-html-web

Usage

import { scrapeHtmlWeb } from "scrape-html-web";
//or
const { scrapeHtmlWeb } = require("scrape-html-web");

//example
const options = {
  url: "https://nodejs.org/en/blog/",
  bypassCors: true, // avoids running errors in esm
  mainSelector: ".blog-index",
  childrenSelector: [
    { key: "date", selector: "time", type: "text" },
    // by default, the first option that is taken into consideration is att
    { key: "version", selector: "a", type: "text" },
    { key: "link", selector: "a", attr: "href" },
  ],
};

(async () => {
  const data = await scrapeHtmlWeb(options);
  console.log(data);
})();

Response

//Example response

[
  {
    date: "04 Nov",
    version: "Node v18.12.1 (LTS)",
    link: "/en/blog/release/v18.12.1/",
  },
  {
    date: "04 Nov",
    version: "Node v19.0.1 (Current)",
    link: "/en/blog/release/v19.0.1/",
  },
  ...{
    date: "11 Jan",
    version: "Node v17.3.1 (Current)",
    link: "/en/blog/release/v17.3.1/",
  },
  {
    date: "11 Jan",
    version: "Node v12.22.9 (LTS)",
    link: "/en/blog/release/v12.22.9/",
  },
];

options

  • url - urls to scraper site web required
  • bypassCors - Url to bypass cors errors in ESM
  • mainSelector - indicates the main selector where to start scraping required
  • list - indicates that we need to iterate through a list of elements containing mainSelector, default is True not required
  • childrenSelector - is an array made up of parameters to define the object we expect to receive required

url

const options = {
  url: "https://nodejs.org/en/blog/" //url from which you want to extrapolate the data,
  ...
};

bypassCors

const options = {
  bypassCors: {
    customURI: "https://api.allorigins.win/get?url=",
    paramExstract: "contents",
  }, // bypass cors error in ESM
  ...
};
const options = {
  bypassCors: true,
  ...
};

You can use the default URL or use a custom one.

  1. If you pass a Boolean without specifying anything, the default URL will be used, which is the following: https://api.allorigins.win/get?url=

  2. it is also possible to pass a custom URL indicating the following parameters:

    - customURI: Custom URL ** required

    - paramExstract: Any extraction parameter deriving from the call ** _not required_

mainSelector

const options = {
   ...
   mainSelector: ".blog-index" //the parent selector where you want to start from,
   ...
};

//Extract **HTML**:

//example HTML
<ul class="blog-index">
    <li>
      <time datetime="2022-11-04T22:34:29+0000">04 Nov</time>
      <a href="/en/blog/release/v18.12.1/">Node v18.12.1 (LTS)</a>
    </li>
    <li>
      <time datetime="2022-11-04T18:05:19+0000">04 Nov</time>
      <a href="/en/blog/release/v19.0.1/">Node v19.0.1 (Current)</a>
    </li>
</ul>

list

const options = {
  ...
  list: true|false
  // if false it will only loop once over the parent element
  // if true it will loop through all elements below the parent element
  ...
};

childrenSelector

const options = {
  ...
  childrenSelector: [
    { key: "date", selector: "time", type: "text" },
    { key: "version", selector: "a", type: "text" },
    { key: "link", selector: "a", attr: "href" },
  ],
};
  • key: is the name of the key ** required

  • selector: is the name of the selector that is searched for in the HTML that is contained by the parent ** required

  • attr: indicates what kind of attribute you want to get ** not required

    Some of the more common attributes are − [ className, tagName, id, href, title, rel, src, style ]

  • type: indicates the type of value to be obtained ** not required (Default: "Text")

    possible values: [ text , html ]

    optional

  • replace - with this parameter it is possible to have text or html inside a selector. It is possible to provide it with either a RegExp or a custom function ** not required

  • canBeEmpty: - by default it is set to false ( grants the ability to leave the value of an element blank ) ** not required

    { key: "title", selector: ".title", type: "text", canBeEmpty: true }, Example response: {title: ''} if text in selector is empty

replace
const options = {
  url: "https://nodejs.org/en/blog/",
  mainSelector: ".blog-index",
  childrenSelector: [
    {
      key: "date",
      selector: "time",
      type: "text",
      replace: (text) => text + " 2022",
      /* I pass a custom function that adds the
      "2022" test to the date I get from the selector */
    },
    {
      key: "version",
      selector: "a",
      type: "html",
      replace: /[{()}]/g,
      /* I pass a regex to remove
      the round paraesthesia within the html */
    },
    {
      key: "link",
      selector: "a",
      attr: "href",
    },
  ],
};

(async () => {
  const data = await scrapeHtmlWeb(options);

  console.log("example 2 :", data);
})();
//Example response

[
  {
    date: "04 Nov 2022",
    version: '<a href="/en/blog/release/v18.12.1/">Node v18.12.1 LTS</a>',
    link: "/en/blog/release/v18.12.1/",
  },
  {
    date: "04 Nov 2022",
    version: '<a href="/en/blog/release/v19.0.1/">Node v19.0.1 Current</a>',
    link: "/en/blog/release/v19.0.1/",
  },
  ...{
    date: "11 Jan 2022",
    version: '<a href="/en/blog/release/v17.3.1/">Node v17.3.1 Current</a>',
    link: "/en/blog/release/v17.3.1/",
  },
  {
    date: "11 Jan 2022",
    version: '<a href="/en/blog/release/v12.22.9/">Node v12.22.9 LTS</a>',
    link: "/en/blog/release/v12.22.9/",
  },
];

❤️ Support

If you make any profit from this or you just want to encourage me, you can offer me a coffee and I'll try to accommodate you.

Buy Me A Coffee

Please note: 🙏

This library was created for educational purposes and excludes the intention to take information for which authorization to do so is not granted

scrape-html-web's People

Contributors

batman110391 avatar

Watchers

 avatar

scrape-html-web's Issues

cheerio.load() expects a string

Hi! I'm continuously getting this error with the way my code is structured and wondering if you can point me into the right direction.

Trying to use it in a component that will pull the img url from zillow, but even when I'm using your test case I'm getting this error.

import { scrapeHtmlWeb } from "scrape-html-web";

export const getImageUrls = async (url) => {

    const options = {
        url: url,
        bypassCors: true,
        mainSelector:".blog-index",
        childrenSelector: [
        { key: "img", selector: "time", attr: "text" },

        ],
    };
    
    (async () => {
        const data = await scrapeHtmlWeb(options);
        console.log(data);
        // return data
    })();
}
import { useEffect, useState } from "react";
import { getImageUrls } from "../hooks/getImageUrls";

const ZillowImage = (url) => {

    const [imgUrl,setImageUrl] = useState(null);

    useEffect(() => {
        getImageUrls(url)
        .then((data) => {
            setImageUrl(data)
    });

        },[])
    
    return (!imgUrl ? (
        <div>Loading...</div>
        ) : (
            <div>
            <img src={imgUrl[0].img}></img>
            </div>
        )

    )

}

export default ZillowImage;

Thanks in advance if you have any suggestions on how to overcome this error.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.