Git Product home page Git Product logo

web-auto-extractor's Introduction

Web Auto Extractor

Build Status

Parse semantically structured information from any HTML webpage.

Supported formats:-

  • Encodings that support Schema.org vocabularies:-
    • Microdata
    • RDFa-lite
    • JSON-LD
  • Random Meta tags

Popularly, many websites mark up their webpages with Schema.org vocabularies for better SEO. This library helps you parse that information to JSON.

Demo it on tonicdev

Installation

npm install web-auto-extractor

// IF CommonJS
var WAE = require('web-auto-extractor').default
// IF ES6
import WAE from 'web-auto-extractor'

var parsed = WAE().parse(sampleHTML)

Let's use the following text as the sampleHTML in our example. It uses Schema.org vocabularies to structure a Product information and is encoded in microdata format.

<div itemscope itemtype="http://schema.org/Product">
  <span itemprop="brand">ACME</span>
  <span itemprop="name">Executive Anvil</span>
  <img itemprop="image" src="anvil_executive.jpg" alt="Executive Anvil logo" />
  <span itemprop="description">Sleeker than ACME's Classic Anvil, the
    Executive Anvil is perfect for the business traveler
    looking for something to drop from a height.
  </span>
  Product #: <span itemprop="mpn">925872</span>
  <span itemprop="aggregateRating" itemscope itemtype="http://schema.org/AggregateRating">
    <span itemprop="ratingValue">4.4</span> stars, based on <span itemprop="reviewCount">89
      </span> reviews
  </span>

  <span itemprop="offers" itemscope itemtype="http://schema.org/Offer">
    Regular price: $179.99
    <meta itemprop="priceCurrency" content="USD" />
    $<span itemprop="price">119.99</span>
    (Sale ends <time itemprop="priceValidUntil" datetime="2020-11-05">
      5 November!</time>)
    Available from: <span itemprop="seller" itemscope itemtype="http://schema.org/Organization">
                      <span itemprop="name">Executive Objects</span>
                    </span>
    Condition: <link itemprop="itemCondition" href="http://schema.org/UsedCondition"/>Previously owned,
      in excellent condition
    <link itemprop="availability" href="http://schema.org/InStock"/>In stock! Order now!</span>
  </span>
</div>

Our parsed object should look like -

{
  "microdata": {
    "Product": [
      {
        "@context": "http://schema.org/",
        "@type": "Product",
        "brand": "ACME",
        "name": "Executive Anvil",
        "image": "anvil_executive.jpg",
        "description": "Sleeker than ACME's Classic Anvil, the\n    Executive Anvil is perfect for the business traveler\n    looking for something to drop from a height.",
        "mpn": "925872",
        "aggregateRating": {
          "@context": "http://schema.org/",
          "@type": "AggregateRating",
          "ratingValue": "4.4",
          "reviewCount": "89"
        },
        "offers": {
          "@context": "http://schema.org/",
          "@type": "Offer",
          "priceCurrency": "USD",
          "price": "119.99",
          "priceValidUntil": "5 November!",
          "seller": {
            "@context": "http://schema.org/",
            "@type": "Organization",
            "name": "Executive Objects"
          },
          "itemCondition": "http://schema.org/UsedCondition",
          "availability": "http://schema.org/InStock"
        }
      }
    ]
  },
  "rdfa": {},
  "jsonld": {},
  "metatags": {
    "priceCurrency": [
      "USD",
      "USD"
    ]
  }
}

The parsed object includes four objects - microdata, rdfa, jsonld and metatags. Since the above HTML does not have any information encoded in rdfa and jsonld, those two objects are empty.

Caveat

I wouldn't call it a caveat but rather the parser is strict by design. It might not parse like expected if the HTML isn't encoded correctly, so one might assume the parser is broken.

For example, take the following HTML snippet.

<div itemscope itemtype="http://schema.org/Movie">
  <h1 itemprop="name">Ghostbusters</h1>
  <div itemprop="productionCompany" itemscope itemtype="http://schema.org/Organization">Black Rhino</div>
  <div itemprop="countryOfOrigin" itemscope itemtype="http://schema.org/Country">
    Country: <span itemprop="name" content="USA">United States</span><p>
  </div>
</div>

The problem here is the itemprop - productionCompany which is of itemtype - Organization doesn't have any itemprop as its children, in this case - name.

The parser assumes every itemtype contains an itemprop, or every typeof contains a property in case of rdfa. So the "Black Rhino" information is lost.

It'll be nice to fix this by having a non-strict mode for parsing this information. PRs are welcome.

License

MIT

web-auto-extractor's People

Contributors

addnab avatar andypang avatar bartleusink avatar dwolters avatar lakshmirajagopalan avatar manojlds avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

web-auto-extractor's Issues

Error in jsonld parse

Hey there,

how can I prevent this error thrown by jsonld-parser.js?
The input html code has no jsonld, this why w-a-e is throwing the error.

Maybe, w-a-e should try if json-ld is present, because the error message is wrong at the moment if now json-ld is in the html - right?

Error in jsonld parse - SyntaxError: Unexpected end of JSON input                                                                 
POST /detailpage 200 4729.179 ms - 678
POST /detailpage 200 5661.727 ms - 2388
Error in jsonld parse - SyntaxError: Unexpected end of JSON input
POST /detailpage 200 4616.781 ms - 664
Error in jsonld parse - SyntaxError: Unexpected end of JSON input                                                                 POST /detailpage 200 4659.136 ms - 1339  

Extraction of arrays

I'm using the web auto extractor to retrieve recipes embedded as microdata. Unfortunately, the extractor is not extracting the ingridient list as an array. Instead the ingridients replace each other in the object so that only the last ingridient remains.

Example:

<ul class="ingredientList">`
  <li itemprop="ingredients">&frac12; cup butter</li>`
  <li itemprop="ingredients">&frac12; cup powdered sugar</li>`
  <li itemprop="ingredients">&frac12; cup chocolate chips</li>`
</ul>

The web auto extractor only extracts:
{ ..., ingedients : "&frac12; cup chocolate chips", ...}
Instead of:
{ ..., ingedients : ["&frac12; cup butter","&frac12; cup powdered sugar","&frac12; cup chocolate chips"], ...}

An example page with recipe microdata can be found here: getmecooking.com

Is this behavior by design?

The official schema.org Recipe example allows this kind of multi declaration of ingridients.

Remove lodash

You are using 1% of lodash... is it worth including?

_.find
_.keys
_.isArray

can all be switched to

arr.find
Object.keys(obj)
Array.isArray

Can't use web-auto-extractor with browserify

Hello,

if I take the example in the README then run:
browserify parser.js -o bundle.js
It fails and writes only:


^
ParseError: Unexpected character ' '

I tried with another plugin to parse schema.org, it works with browserify, but their usage is horrible, as their doc :/

Any idea how to fix it?

Thanks

`currentScope[attribs[PROP]].push is not a function` error when using `parse` even when wrapped in a try/catch

When I try to parse this page (https://www.discountnaturalhealth.com/bilberry-eye-support-advanced-30-tablets-blackmores/), I get the following error:

contentScript.bundle.js:190 TypeError: currentScope[attribs[PROP]].push is not a function
    at Object.onopentag (contentScript.bundle.js:146442:37)
    at Parser.onopentagend (contentScript.bundle.js:48652:23)
    at Tokenizer._stateBeforeAttributeName (contentScript.bundle.js:49170:19)
    at Tokenizer._parse (contentScript.bundle.js:49664:18)
    at Tokenizer.write (contentScript.bundle.js:49639:10)
    at Tokenizer.end (contentScript.bundle.js:49817:21)
    at Parser.end (contentScript.bundle.js:48833:21)
    at exports.default (contentScript.bundle.js:146528:44)
    at Object.parse (contentScript.bundle.js:146253:48)
    at isCategoryPage (contentScript.bundle.js:143319:87)

You can see in the W3C validator that this page has some errors, including a stray </i> tag: https://validator.w3.org/nu/?doc=https%3A%2F%2Fwww.discountnaturalhealth.com%2Fbilberry-eye-support-advanced-30-tablets-blackmores%2F

Here's the code you can use to reproduce the issue:

try {
  const parsed = WAE().parse(document?.body?.innerHTML);
} catch {
}

I would expect the parse to fail due to bad html, but I would have expected it to catch, rather than causing an error.

Improve parsing microdata when itemProps contains multiple space separated properties

Thanks for creating such a great project!

I ran into a bug parsing microdata content where itemprop contained multiple properties, like in these examples and thought I'd share what I ran into:

<meta data-rh="true" property="article:published" itemprop="datePublished dateCreated" content="2019-07-21T09:00:06.000Z"/>
<span itemProp="publisher copyrightHolder provider sourceOrganization" itemscope="" itemType="http://schema.org/NewsMediaOrganization" itemID="https://www.nytimes.com">
<figure itemprop="associatedMedia image" itemscope itemtype="http://schema.org/ImageObject" data-component="image" class="element element-image img--landscape  fig--narrow-caption fig--has-shares " data-media-id="f82028d62b1edd7417d7d3773c4abf0d4fa86174" id="img-3">
  <meta itemprop="url" content="https://i.guim.co.uk/img/media/f82028d62b1edd7417d7d3773c4abf0d4fa86174/0_272_6435_3861/master/6435.jpg?width=700&amp;quality=85&amp;auto=format&amp;fit=max&amp;s=016df6a3f33eabe3cbca39eb389a60fb">
</figure>

Markup like this is parsed correctly in Google's Structured Data Testing Tool, but web-auto-extractor does not currently split input based on spaces.

I resolved this in a project which uses web-auto-extractor by doing this:

const __transformStructuredData = (structuredData) => {
   let result = structuredData
   Object.keys(result.microdata).forEach(schema => {
     result.microdata[schema].forEach(object => {
       Object.keys(object).forEach(key => {
         if (key.includes(' ')) {
           key.split(' ').forEach(newKey => {
             object[newKey] = object[key]
           })
           delete object[key]
         }
       })
     })
   })
   return result
 }

I'm aware there are some other PRs related to handling whitespace trimming open.

If an enhancement like this appeals I'd be happy to raise a PR.

Parsing microdata strips spaces

Given the following HTML:

<div itemscope itemtype="http://schema.org/Product"><h1 itemprop="name"><span>Foo</span> Bar</h1></div>

I would expect the library to extract a Product with the name of Foo Bar, but it extracts FooBar omitting the space.

Do you think this would be an easy fix?

babel seems to mess with the ES export

Hi!

// in src/index.js
export default function

becomes

// in dist/index.js
exports.default = function () {

so the ES import currently needs to look like this:

import WAE from 'web-auto-extractor'
WAE.default().parse(html)

instead of

import WAE from 'web-auto-extractor'
WAE.parse(html)

which is what the readme says.

That could maybe be solved by making ES imports directly import the src files, something like this:

// in package.json
{
    "exports": {
        "import": "./src/index.js",
        "require": "./dist/index.js"
    }
}

see documentation

second ld+json not display on output

Hi,

There is two ld+json scripts at https://www.rakuten.de/produkt/ncm-milano-plus-48vtrekking-e-bike-16ah-768wh-panasonic-zellen-akku-plus-weiss-26-V2998816670.

First one is organization and second is product but parser only displays org.

<script type="application/ld+json"> --   | {   | "@context": "http://schema.org",   | "@type": "Organization",   | "name": "Rakuten"   | }   | </script><script type="application/ld+json">

  | {
  | "@context": "http://schema.org",
  | "@type": "Product",
  | "name": "NCM Milano Plus 48V,Trekking E-Bike, 16Ah 768Wh Panasonic Zellen Akku",
  | "category": "Elektro Räder",
  | "url": "https://www.rakuten.de/produkt/ncm-milano-plus-48vtrekking-e-bike-16ah-768wh-panasonic-zellen-akku-plus-weiss-26-V2998816670",
  | "image": "https://files.rakuten-static.de/dbfcb25a05c4143dc35c09b98afed7f7/images/0b67f328ad3a324dd098bca6a16dab1f.jpg",
  | "offers": {
  | "@type": "Offer",
  | "price": "1599",
  | "priceCurrency": "EUR"
  | }
  | ,"brand": [
  | {
  | "@type": "Brand",
  | "name": "NCM",
  | "image": "https://files.rakuten-static.de/brands/ncm-b122721.png"
  | }
  | ]
  | }
  | }
  | </script>

Make LD parser more resilient

$html('script[type="application/ld+json"]').each((index, item) => {
try {
let parsedJSON = JSON.parse($(item).text())
if (!Array.isArray(parsedJSON)) {
parsedJSON = [parsedJSON]
}
parsedJSON.forEach(obj => {
const type = obj['@type']
jsonldData[type] = jsonldData[type] || []
jsonldData[type].push(obj)
})
} catch (e) {
console.log(`Error in jsonld parse - ${e}`)
}
})

The current JSON-LD parser assumes a perfect world scenario.

Here is how I've implemented a LD+JSON parser in my local project:

(html: string): $ReadOnlyArray<Object> => {
  const dom = new JSDOM(html);

  const nodes = Object.values(dom.window.document.querySelectorAll('script[type="application/ld+json"]'));

  return nodes.map((node) => {
    if (!node || typeof node.innerHTML !== 'string') {
      throw new TypeError('Unexpected content.');
    }

    let body = node.innerHTML;

    debug('body', body);

    // Some websites (e.g. Empire) have JSON that includes new-lines, i.e. invalid JSON.
    body = body.replace(/\n/g, '');

    // Some website (e.g. Variety) have JSON that is surrounded in CDATA comments, e.g.
    // https://gist.github.com/gajus/4a2653b4a5235ccebedc44467a2896f2
    body = body.slice(body.indexOf('{'), body.lastIndexOf('}') + 1);

    return JSON.parse(body);
  });
};

Thus far it works with all the sites I have been testing.

Inconsistent Object Keys

Hi,

Thanks for the great library. Due to limitation of time, I am not able to submit a pull request but I think its worth raising an issue here. The Keys of the Object created from the data are directly pulled from the HTML. So for example if a user is using itemprop="Description" then it will add Description as key to microdata but if a user is using itemprop="description" then it will use the lowercase description key.

Although the burden of using a valid schema with proper cases lie on the webmaster of the site, It will be nice if WAE can transform the keys before adding them into the object.

Currently I add a fix to my code by using then following recursive function:

const _ = require('lodash'); const lowerCaseKeys = function (obj) { return _.transform(obj, (result, value, key) => { let k = key; if (_.isString(key)) { k = key.toLowerCase(); } if (_.isObject(value) || _.isArray(value)) { result[k] = this.lowerCaseKeys(value); } else { result[k] = value; } }, []); },

I use it for small objects so its not a problem for me yet but if anyone else wish to use it, be cautious as it creates a stack for every iteration. If the keys get modified before being added to the object then it will eliminate extra processing like above.

I am happy to help further if the Library Owner wants/needs.

Title, redirects, and user-agent

My team is comparing this package, web-auto-extractor, with node-metainspector and meta-extractor.

TITLE - web-auto-extractor seems to return all the meta tags including og tags, along with the LRMI tags. Great. But, it doesn't seem to return some basic HTML elements like 'Title', which is actually important. Was this intentional? Perhaps I'm overlooking something. Perhaps the intent of this package is different than what I'm trying to do.

REDIRECTS - node-metainspector will also follow redirects up to 5 redirects I believe. This is nice and many sites do actually redirect as their main page changes. Does web-auto-extractor have the ability to follow redirects?

USER-AGENT - Both other packages allow the user-agent to be set. Does web-auto-extractor have this ability too?

Thanks.

<meta http-equiv>

Do you mind adding 'http-equiv' to the list of recognized meta-tag keys?

return ['name', 'property', 'itemprop', 'http-equiv'].indexOf(attr) !== -1;

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.