mozilla / page-metadata-parser Goto Github PK

View Code? Open in Web Editor NEW

270.0 20.0 42.0 189 KB

DEPRECATED - A Javascript library for parsing metadata on a web page.

Home Page: https://www.npmjs.com/package/page-metadata-parser

License: Mozilla Public License 2.0

JavaScript 100.00%

abandoned unmaintained

page-metadata-parser's Introduction

Page Metadata Parser

A Javascript library for parsing metadata in web pages.

Overview

Purpose

The purpose of this library is to be able to find a consistent set of metadata for any given web page. Each individual kind of metadata has many rules which define how it may be located. For example, a description of a page could be found in any of the following DOM elements:

<meta name="description" content="A page's description"/>

<meta property="og:description" content="A page's description" />

Because different web pages represent their metadata in any number of possible DOM elements, the Page Metadata Parser collects rules for different ways a given kind of metadata may be represented and abstracts them away from the caller.

The output of the metadata parser for the above example would be

{description: "A page's description"}

regardless of which particular kind of description tag was used.

Supported schemas

This library employs parsers for the following formats:

opengraph

twitter

meta tags

Requirements

This library is meant to be used either in the browser (embedded directly in a website or into a browser addon/extension) or on a server (node.js).

The parser depends only on the Node URL library or the Browser URL library.

Each function expects to be passed a Document object, which may be created either directly by a browser or on the server using a Document compatible object, such as that provided by domino.

Usage

Installation

npm install --save page-metadata-parser

Usage in the browser

The library can be built to be deployed directly to a modern browser by using

npm run bundle

and embedding the resultant js file directly into a page like so:

<script src="page-metadata-parser.bundle.js" type="text/javascript" />

<script>

  const metadata = metadataparser.getMetadata(window.document, window.location);

  console.log("The page's title is ", metadata.title);

</script>

Usage in node

To use the library in node, you must first construct a DOM API compatible object from an HTML string, for example:

const {getMetadata} = require('page-metadata-parser');
const domino = require('domino');

const url = 'https://github.com/mozilla/page-metadata-parser';
const response = await fetch(url);
const html = await response.text();
const doc = domino.createWindow(html).document;
const metadata = getMetadata(doc, url);

Metadata Rules

Rules

A single rule instructs the parser on a possible DOM node to locate a specific piece of content.

For instance, a rule to parse the title of a page found in a DOM tag like this:

<meta property="og:title" content="Page Title" />

Would be represented with the following rule:

['meta[property="og:title"]', element => element.getAttribute('content')]

A rule consists of two parts, a query selector compatible string which is used to look up the target content, and a callable which receives an element and returns the desired content from that element.

Many rules together form a Rule Set. This library will apply each rule to a page and choose the 'best' result. The order in which rules are defined indicate their preference, with the first rule being the most preferred. A Rule Set can be defined like so:

const titleRules = {
  rules: [
    ['meta[property="og:title"]', node => node.element.getAttribute('content')],
    ['title', node => node.element.text],
  ]
};

In this case, the OpenGraph title will be preferred over the title tag.

This library includes many rules for a single desired piece of metadata which should allow it to consistently find metadata across many types of pages. This library is meant to be a community driven effort, and so if there is no rule to find a piece of information from a particular website, contributors are encouraged to add new rules!

Built-in Rule Sets

This library provides rule sets to find the following forms of metadata in a page:

Field	Description
description	A user displayable description for the page.
icon	A URL which contains an icon for the page.
image	A URL which contains a preview image for the page.
keywords	The meta keywords for the page.
provider	A string representation of the sub and primary domains.
title	A user displayable title for the page.
type	The type of content as defined by opengraph.
url	A canonical URL for the page.

To use a single rule set to find a particular piece of metadata within a page, simply pass that rule set, a URL, and a Document object to getMetadata and it will apply each possible rule for that rule set until it finds a matching piece of information and return it.

Example:

const {getMetadata, metadataRuleSets} = require('page-metadata-parser');

const pageTitle = getMetadata(doc, url, {title: metadataRuleSets.title});

Extending a single rule

To add your own additional custom rule to an existing rule set, you can simply push it into that rule sets's array.

Example:

const {getMetadata, metadataRuleSets} = require('page-metadata-parser');

const customDescriptionRuleSet = metadataRuleSets.description;

customDescriptionRuleSet.rules.push([
  ['meta[name="customDescription"]', element => element.getAttribute('content')]
]);

const pageDescription = getMetadata(doc, url, {description: customDescriptionRuleSet});

Using all rules

To parse all of the available metadata on a page using all of the rule sets provided in this library, simply call getMetadata on the Document.

const {getMetadata, metadataRuleSets} = require('page-metadata-parser');

const pageMetadata = getMetadata(doc, url);

page-metadata-parser's People

Contributors

Stargazers

Watchers

page-metadata-parser's Issues

Add support for sailthru metadata

You can find an example here:

https://techcrunch.com/2016/08/10/new-macbook-pro-with-touch-id-sensor-and-oled-mini-screen-is-coming-soon/

Add support for more twitter: meta tags

Loosely related to work in #40

Looking at some random GitHub pages, it looks like we're not scraping enough Twitter tags (in the off chance somebody specifies twitter: meta tags but not the open graph versions...

Here's a heavily prettified version of [a subset of] https://github.com/mozilla/page-metadata-parser/issues/53 meta tags:

<meta name="twitter:card" content="summary" />
<meta name="twitter:description" content="Another fascinating look into the life of @pdehaan...

Scraping http://www.blenderbabes.com/lifestyle-diet/dairy-free/lower-calories-lunch-falafel-recipe/ returns some unexpected expected results. ..." />
<meta name="twitter:image:src" content="https://avatars1.githubusercontent.com/u/557895?v=3&amp;s=400" />
<meta name="twitter:site" content="@github" />
<meta name="twitter:title" content="Handle duplicate meta tags? · Issue #53 · mozilla/page-metadata-parser" />

<meta property="og:description" content="Another fascinating look into the life of @pdehaan...

Scraping http://www.blenderbabes.com/lifestyle-diet/dairy-free/lower-calories-lunch-falafel-recipe/ returns some unexpected expected results. ..." />
<meta property="og:image" content="https://avatars1.githubusercontent.com/u/557895?v=3&amp;s=400" />
<meta property="og:site_name" content="GitHub" />
<meta property="og:title" content="Handle duplicate meta tags? · Issue #53 · mozilla/page-metadata-parser" />
<meta property="og:type" content="object" />
<meta property="og:url" content="https://github.com/mozilla/page-metadata-parser/issues/53" />

Looks like currently we are not:

Extracting meta[name="twitter:description"], if available.
Extracting meta[name="twitter:image:src"] at all (but we do extract twitter:image syntax). Per https://dev.twitter.com/cards/markup I'm not certain if this is valid syntax, but GitHub uses it. But I can't find an exhaustive resource for supported Twitter open graph tags.
Using correct format for meta[name="twitter:title"] — see #40.
Using correct format for meta[name="twitter:image"] — see #40.

Only return preview images

On some pages, such as https://cbc.ca the parser will return a tracking image. We should remove the rule for previewImage.

Add support for msapplication-* meta tags?

I think this is a Windows 10 tiles thing maybe.

$ meta-scraper -u  "https://vimeo.com/180763356"

...
<meta name="msapplication-TileColor" content="#00adef">
<meta name="msapplication-TileImage" content="https://i.vimeocdn.com/favicon/main-touch_144">
...

Not sure if we'd need msapplication-TileColor for (unless we want a favicon background color fallback), but the msapplication-TileImage would probably be an appropriate [albeit lower ranked] icon_url fallback.

2 seconds of research led me to this page: https://msdn.microsoft.com/en-us/library/dn255024(v=vs.85).aspx and https://msdn.microsoft.com/en-us/library/bg183312(v=vs.85).aspx (in case we want to see what other pinned site options there are).

Off the top of my head:

application-name: Possibly useful for a <title> fallback?
msapplication-square70x70logo: favicon fallback?
msapplication-square150x150logo: favicon fallback?
msapplication-square310x310logo: favicon fallback?
msapplication-wide310x150logo: It's wide, so poor candidate for a favicon, but could be last hope for a primary image if we literally cannot find anything better.
msapplication-TileColor: favicon background color fallback? Not sure if we use that currently in Activity Stream, or if we have some Firefox library that gives us primary colors from a favicon.
msapplication-TileImage: favicon fallback?

Resolve 'rule' name collision

We are reusing the name 'rule' here:

https://github.com/mozilla/page-metadata-parser/blob/master/parser.js#L82

Do not use 'document' as a variable name

'document' can be a global name in browser JS, we should not overload that name.

Add video URL

Pull out a video player URL

Return secure HTTPS images, if possible

Ref mozilla/page-metadata-service#37

Fun! Turns out that in some cases we may get some Open Graph magic like og:image:secure_url which is an HTTPS URL (see http://ogp.me/#structured). It'd probably be nice to use this image if available instead of the HTTP og:image or og:image:url versions.

og:image:url - Identical to og:image.
og:image:secure_url - An alternate url to use if the webpage requires HTTPS.

— via http://ogp.me/#structured

Or a real-live example at https://www.mysecretwood.com/collections/our-rings?sort_by=best-selling:

<meta property="og:image" content="http://cdn.shopify.com/s/files/1/1197/3550/t/12/assets/logo.png?13093450641738266550">
<meta property="og:image:secure_url" content="https://cdn.shopify.com/s/files/1/1197/3550/t/12/assets/logo.png?13093450641738266550">

It looks like we're currently the same as Embedly proxy behavior, but seems like a simple fix to get more secure HTTPS images [when available].

node index --url "https://www.mysecretwood.com/collections/our-rings?sort_by=best-selling"

https://www.mysecretwood.com/collections/our-rings?sort_by=best-selling:
  images:
    embedly:
      -
        caption: null
        entropy: 1.93638247295
        height:  220
        size:    46404
        url:     http://cdn.shopify.com/s/files/1/1197/3550/t/12/assets/logo.png?13093450641738266550
        width:   303
    fathom:
      -
        entropy: 1
        height:  500
        url:     http://cdn.shopify.com/s/files/1/1197/3550/t/12/assets/logo.png?13093450641738266550
        width:   500
  url:
    embedly: https://www.mysecretwood.com/collections/our-rings
    fathom:  https://www.mysecretwood.com/collections/our-rings?sort_by=best-selling

Add Docker file?

Not sure how we're planning on deploying this, but just a placeholder bug in case we want to have some supported Docker flow for this repo.

Ref: https://github.com/mozilla-services/Dockerflow

Add coveralls.io support

Integrate coverage reporting with coveralls.io

Inconsistent results for `favicon_url`

Ref: mozilla/page-metadata-service#28

Code: ./parser.js:41-49

Using my crappy proxy diff tool, it looks like we're not catching favicon_urls in all cases.

node index --url "https://techcrunch.com/2016/08/10/new-macbook-pro-with-touch-id-sensor-and-oled-mini-screen-is-coming-soon/"

https://techcrunch.com/2016/08/10/new-macbook-pro-with-touch-id-sensor-and-oled-mini-screen-is-coming-soon/:
  ...
  favicon_url:
    embedly: https://s0.wp.com/wp-content/themes/vip/techcrunch-2013/assets/images/favicon.ico
    fathom:  https://techcrunch.com/favicon.ico

So it looks like we're failing, and just assuming a /favicon.ico at the site root, whereas the Embedly proxy is finding a favicon.

Looking at the source, it looks like Embedly is matching on this (whereas Fathom isn't):

<link rel="shortcut icon" type="image/x-icon" href="https://s0.wp.com/wp-content/themes/vip/techcrunch-2013/assets/images/favicon.ico">

Another case:

node index --url "https://github.com/yannickcr/eslint-plugin-react/blob/master/lib/rules/no-string-refs.js"

https://github.com/yannickcr/eslint-plugin-react/blob/master/lib/rules/no-string-refs.js:
  favicon_url:
    embedly: https://assets-cdn.github.com/favicon.ico
    fathom:  https://github.com/apple-touch-icon.png

With the following markup:

    <link rel="apple-touch-icon" href="/apple-touch-icon.png">
    <link rel="apple-touch-icon" sizes="57x57" href="/apple-touch-icon-57x57.png">
    <link rel="apple-touch-icon" sizes="60x60" href="/apple-touch-icon-60x60.png">
    <link rel="apple-touch-icon" sizes="72x72" href="/apple-touch-icon-72x72.png">
    <link rel="apple-touch-icon" sizes="76x76" href="/apple-touch-icon-76x76.png">
    <link rel="apple-touch-icon" sizes="114x114" href="/apple-touch-icon-114x114.png">
    <link rel="apple-touch-icon" sizes="120x120" href="/apple-touch-icon-120x120.png">
    <link rel="apple-touch-icon" sizes="144x144" href="/apple-touch-icon-144x144.png">
    <link rel="apple-touch-icon" sizes="152x152" href="/apple-touch-icon-152x152.png">
    <link rel="apple-touch-icon" sizes="180x180" href="/apple-touch-icon-180x180.png">
    ...
    <link rel="icon" type="image/x-icon" href="https://assets-cdn.github.com/favicon.ico">

Not sure how/if we can optimize that to return the largest size icon (and presumably highest quality apple-touch-icon versus just the first one in the list, or if we need to).

Unable to scrape pages with invalid markup

I can't figure out why this isn't scraping or not returning an error:

$ http https://pageshot.net

$ http https://pageshot.net

HTTP/1.1 200 OK
Connection: keep-alive
Content-Length: 829
Content-Security-Policy: default-src 'self'; img-src 'self' pageshot-usercontent.net data:; script-src 'self' www.google-analytics.com 'nonce-4716a6ab-9fd6-4306-84b7-2e53eacc39e7'; style-src 'self' 'unsafe-inline'
Content-Type: text/html; charset=utf-8
Date: Mon, 22 Aug 2016 16:50:14 GMT
ETag: W/"33d-U6zVFiG1y8SeEae8uDoO+Q"
X-Powered-By: Express

<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="utf-8">
    <meta http-equiv="X-UA-Compatible" content="IE=Edge">
    <title>
      Welcome to PageShot
    </title>

    <meta name="viewport" content="width=320, initial-scale=1.0, maximum-scale=1.0, user-scalable=0"/>
    <meta name="description" content="Share anything on the web with anyone using PageShot.">
    <link rel="stylesheet" href="https://code.cdn.mozilla.net/fonts/fira.css">
    <link href="/homepage/css/style.css" rel="stylesheet">

  </head>
  <body>
    <div class="vertical-centered-content-wrapper">
      <div class="stars"></div>
      <div class="copter fly-up"></div>
      <h1>Welcome to PageShot</h1>
      <a class="button primary" href="https://testpilot.firefox.com/">Coming soon to Firefox Test Pilot</a>
    <div>
  </body>
</html>

So the page definitely looks like it is there, and is valid-ish. But when I go to scrape it via the metadata proxy, I get no errors or response:

$ http https://metadata.dev.mozaws.net/v1/metadata urls:='["https://pageshot.net"]' -j -v

$ http https://metadata.dev.mozaws.net/v1/metadata urls:='["https://pageshot.net"]' -j -v

POST /v1/metadata HTTP/1.1
Accept: application/json
Accept-Encoding: gzip, deflate
Connection: keep-alive
Content-Length: 34
Content-Type: application/json; charset=utf-8
Host: metadata.dev.mozaws.net
User-Agent: HTTPie/0.9.1

{
    "urls": [
        "https://pageshot.net"
    ]
}

HTTP/1.1 200 OK
Connection: keep-alive
Content-Length: 22
Content-Type: application/json; charset=utf-8
Date: Mon, 22 Aug 2016 16:50:24 GMT
ETag: W/"16-urTtGfwwfQX5N25qpNbXOg"
Server: nginx/1.9.9
Strict-Transport-Security: max-age=31536000
X-Powered-By: Express

{
    "error": "",
    "urls": {}
}

Unable to scrape kickstarter pages

Scraping https://www.kickstarter.com/projects/1280803647/muzo-your-personal-zone-creator-with-noise-blockin/description gives me no usable data.

Looking at view-source:https://www.kickstarter.com/projects/1280803647/muzo-your-personal-zone-creator-with-noise-blockin/description shows me a bunch of <meta> that isn't available when I cURL the page directly, so I'm guessing there is some shenanigans happening on the Kickstarter side.

$ http https://page-metadata.services.mozilla.com/v1/metadata urls:='["https://www.kickstarter.com/projects/1280803647/muzo-your-personal-zone-creator-with-noise-blockin/description"]' -j

HTTP/1.1 200 OK
Connection: keep-alive
Content-Length: 475
Content-Type: application/json; charset=utf-8
Date: Sat, 24 Sep 2016 00:30:14 GMT
ETag: W/"1db-rGqz4BBDtZ7heHcjgO72Ow"

{
    "request_error": "",
    "url_errors": {},
    "urls": {
        "https://www.kickstarter.com/projects/1280803647/muzo-your-personal-zone-creator-with-noise-blockin/description": {
            "favicon_url": "https://www.kickstarter.com/favicon.ico",
            "images": [],
            "original_url": "https://www.kickstarter.com/projects/1280803647/muzo-your-personal-zone-creator-with-noise-blockin/description",
            "url": "https://www.kickstarter.com/projects/1280803647/muzo-your-personal-zone-creator-with-noise-blockin/description"
        }
    }
}

Find the largest high res icon

When we parse for high res icons, like apple touch icons, we need to take into account the size properties and sort for the largest one.

Extended support for page keywords

Ref: #43 (comment)

Not sure if we want to add support for article:tag, which I've seen a few times in "the wild":

Example: https://www.engadget.com/2016/08/19/the-best-headlamps/
Source:

<meta property="og:url" content="https://www.engadget.com/2016/08/19/the-best-headlamps/">
<meta property="og:title" content="The best headlamps">
<meta property="og:description" content="Go for the Black Diamond Spot.">

<meta property="og:image" content="https://s.aolcdn.com/dims5/amp:7a9ea64b5117cd0b2d3e3df595c52b67aa6a6709/t:1200,630/q:80/?url=https%3A%2F%2Fs.aolcdn.com%2Fhss%2Fstorage%2Fmidas%2F7697ed6dc5ea00ddff3537c34c17dde3%2F204221484%2F01-headlamps-2000.jpg">
<meta property="og:image:width" content="1200">
<meta property="og:image:height" content="630">

<meta property="og:type" content="article">
<meta property="article:tag" content="BlackDiamond">
<meta property="article:tag" content="BlackDiamondRevolt">
<meta property="article:tag" content="BlackDiamondSpot">
<meta property="article:tag" content="CoastFL75">
<meta property="article:tag" content="gadgetry">
<meta property="article:tag" content="gadgets">
<meta property="article:tag" content="gear">
<meta property="article:tag" content="headlamp">
<meta property="article:tag" content="headlamps">
<meta property="article:tag" content="LED Lights">
<meta property="article:tag" content="ONeill">
<meta property="article:tag" content="partner">
<meta property="article:tag" content="Shining Buddy">
<meta property="article:tag" content="syndicated">
<meta property="article:tag" content="The Revolt">
<meta property="article:tag" content="thewirecutter">
<meta property="article:tag" content="Vitchelo">
<meta property="article:tag" content="VitcheloV800">
<meta property="article:tag" content="wirecutter">

Interestingly, I can't even see a <meta name="keywords" /> on that page...

Also, it looks like they do have the same values repeated for swiftype tags:

<meta class="swiftype" name="tags" data-type="string" content="BlackDiamond">
<meta class="swiftype" name="tags" data-type="string" content="BlackDiamondRevolt">
<meta class="swiftype" name="tags" data-type="string" content="BlackDiamondSpot">
<meta class="swiftype" name="tags" data-type="string" content="CoastFL75">
<meta class="swiftype" name="tags" data-type="string" content="gadgetry">
<meta class="swiftype" name="tags" data-type="string" content="gadgets">
<meta class="swiftype" name="tags" data-type="string" content="gear">
<meta class="swiftype" name="tags" data-type="string" content="headlamp">
<meta class="swiftype" name="tags" data-type="string" content="headlamps">
<meta class="swiftype" name="tags" data-type="string" content="LED Lights">
<meta class="swiftype" name="tags" data-type="string" content="ONeill">
<meta class="swiftype" name="tags" data-type="string" content="partner">
<meta class="swiftype" name="tags" data-type="string" content="Shining Buddy">
<meta class="swiftype" name="tags" data-type="string" content="syndicated">
<meta class="swiftype" name="tags" data-type="string" content="The Revolt">
<meta class="swiftype" name="tags" data-type="string" content="thewirecutter">
<meta class="swiftype" name="tags" data-type="string" content="Vitchelo">
<meta class="swiftype" name="tags" data-type="string" content="VitcheloV800">
<meta class="swiftype" name="tags" data-type="string" content="wirecutter">

And again for AMP, using ld+json:

<script type="application/ld+json">
  {
    "@context": "https://schema.org",
    "@type": "Article",
    "url": "https://www.engadget.com/2016/08/19/the-best-headlamps/",
    "author": "The Wirecutter",
    "headline": "The best headlamps",
    "datePublished": "2016-08-19 12:23:00.000000",
    ...
    "articleBody": "...",
    "articleSection": "Gear",
    "keywords": ["BlackDiamond","BlackDiamondRevolt","BlackDiamondSpot","CoastFL75","gadgetry","gadgets","gear","headlamp","headlamps","LED Lights","ONeill","partner","Shining Buddy","syndicated","The Revolt","thewirecutter","Vitchelo","VitcheloV800","wirecutter"],
    ...
    "dateModified": "2016-08-19 12:39:44.000000"
  }
</script>

Not sure if we want to add the latter two right now, or leave those until the amp and swiftype implementation bugs.

But it also brings up a semi-related issue I keep forgetting to ask. Given that OpenGraph and swiftype and others can sometimes have multiple tags that match a ruleset, does Fathom or our parser somehow convert those to an array, or will it just pluck the first value that matches (giving us one keyword, instead of an array of keywords)?

For example, will it work for tags like this:

<meta class="swiftype" name="tags" data-type="string" content="BlackDiamond">
<meta class="swiftype" name="tags" data-type="string" content="BlackDiamondRevolt">
<meta class="swiftype" name="tags" data-type="string" content="BlackDiamondSpot">
<meta class="swiftype" name="tags" data-type="string" content="CoastFL75">
...

YouTube doesn't return a "type" attribute

Not sure if we have a response schema or if we guarantee that you'll get all fields, but I noticed this while browsing using the super cool ffmetadata tool.

youtube.com:

$ curl -i -XPOST -H "content-type: application/json" -d '{"urls": ["https://www.youtube.com"]}' http://localhost:7001 | JSON

HTTP/1.1 200 OK
content-type: application/json; charset=utf-8
cache-control: no-cache
content-length: 500
Date: Thu, 30 Jun 2016 20:01:21 GMT
Connection: keep-alive

{
  "error": "",
  "urls": {
    "https://www.youtube.com": {
      "description": "Enjoy the videos and music you love, upload original content, and share it all with friends, family, and the world on YouTube.",
      "icon_url": "https://s.ytimg.com/yts/img/favicon_32-vfl8NGn4k.png",
      "image_url": "//s.ytimg.com/yts/img/yt_1200-vfl4C3T0K.png",
      "title": "YouTube",
      "url": "https://www.youtube.com",
      "original_url": "https://www.youtube.com",
      "provider_url": "https://www.youtube.com",
      "favicon_url": "https://www.youtube.com/favicon.ico"
    }
  }
}

Whereas www.cnn.com gives me slightly different fields:

cnn.com:

$ curl -i -XPOST -H "content-type: application/json" -d '{"urls": ["http://www.cnn.com"]}' http://localhost:7001 | JSON

HTTP/1.1 200 OK
content-type: application/json; charset=utf-8
cache-control: no-cache
content-length: 560
Date: Thu, 30 Jun 2016 20:02:00 GMT
Connection: keep-alive

{
  "error": "",
  "urls": {
    "http://www.cnn.com": {
      "description": "View the latest news and breaking news today for U.S., world, weather, entertainment, politics and health at CNN.com.",
      "icon_url": "http://i.cdn.turner.com/cnn/.e/img/3.0/global/misc/apple-touch-icon.png",
      "image_url": "http://i.cdn.turner.com/cnn/.e1mo/img/4.0/logos/menu_politics.png",
      "title": "CNN - Breaking News, Latest News and Videos",
      "type": "website",
      "url": "http://www.cnn.com",
      "original_url": "http://www.cnn.com",
      "provider_url": "http://www.cnn.com",
      "favicon_url": "http://www.cnn.com/favicon.ico"
    }
  }
}

Strip newlines from title value?

A bit of a curious case, but should we strip any newlines from the parsed title value?

I found the following markup in a random page:

<title>
Imperfect -Ugly produce delivery in Oakland and Berkeley
 :: FAQ
</title>

And scraping that page indeed returns a \n in the title value:

"title": "Imperfect -Ugly produce delivery in Oakland and Berkeley\n :: FAQ",

Full response below:

$ http POST https://metadata.dev.mozaws.net/v1/metadata urls:='["http://shop.imperfectproduce.com/pages.php?pageid=27&xid=db023392b90263112dd62807c8814d8a"]' -j -v

POST /v1/metadata HTTP/1.1
Accept: application/json
Accept-Encoding: gzip, deflate
Connection: keep-alive
Content-Length: 103
Content-Type: application/json; charset=utf-8
Host: metadata.dev.mozaws.net
User-Agent: HTTPie/0.9.1

{
    "urls": [
        "http://shop.imperfectproduce.com/pages.php?pageid=27&xid=db023392b90263112dd62807c8814d8a"
    ]
}

HTTP/1.1 200 OK
Connection: keep-alive
Content-Length: 801
Content-Type: application/json; charset=utf-8
Date: Fri, 12 Aug 2016 00:00:12 GMT
ETag: W/"321-F7b8qrfWsYA4UlqUUNvgqg"
Server: nginx/1.9.9
Strict-Transport-Security: max-age=31536000
X-Powered-By: Express

{
    "error": "",
    "urls": {
        "http://shop.imperfectproduce.com/pages.php?pageid=27&xid=db023392b90263112dd62807c8814d8a": {
            "description": "Imperfect offers home and office delivery of 'ugly' produce for 30% less than grocery store prices.  We are located in Oakland, California and deliver fruits and vegetables to Oakland, Berkeley, and Emeryville",
            "favicon_url": "http://shop.imperfectproduce.com/favicon.ico",
            "images": [
                {
                    "entropy": 1,
                    "height": 500,
                    "url": "http://shop.imperfectproduce.com/skin1/images/heartlogo.png",
                    "width": 500
                }
            ],
            "original_url": "http://shop.imperfectproduce.com/pages.php?pageid=27&xid=db023392b90263112dd62807c8814d8a",
            "title": "Imperfect -Ugly produce delivery in Oakland and Berkeley\n :: FAQ",
            "url": "http://shop.imperfectproduce.com/pages.php?pageid=27&xid=db023392b90263112dd62807c8814d8a"
        }
    }
}

And running this through my lame proxy diff tool shows that Embedly proxy seemingly strips the newline, while the Fathom proxy doesn't:

$ node index --url "http://shop.imperfectproduce.com/pages.php?pageid=27&xid=db023392b90263112dd62807c8814d8a"

Scraping http://shop.imperfectproduce.com/pages.php?pageid=27&xid=db023392b90263112dd62807c8814d8a:

https://embedly-proxy.services.mozilla.com/v2/extract: 179.735ms
https://metadata.dev.mozaws.net/v1/metadata: 397.871ms

http://shop.imperfectproduce.com/pages.php?pageid=27&xid=db023392b90263112dd62807c8814d8a:
  ...
  title:
    embedly: Imperfect -Ugly produce delivery in Oakland and Berkeley :: FAQ
    fathom:
      """
        Imperfect -Ugly produce delivery in Oakland and Berkeley
         :: FAQ
      """


TOTAL TIME: 426.121ms

Add domino/cheerio support

JSDom has significant memory leaks, we should test and add support for domino and/or cheerio.

Run webpack during CI

We should run our webpack bundle task during CI to ensure we don't break this feature.

Add documentation

Document basic installation and usage.

Define rules as arrays so they can be extended

Rather than wrapping rules in RuleSet, we should just leave them as arrays so that they can be dynamically extended.

Suspicious www URL parsing in getProvider()

Ref /parser.js:14-22,

function getProvider(url) {
  return urlparse.parse(url)
    .hostname
    .replace(/www[a-zA-Z0-9]*\./, '')
    .replace('.co.', '.')
    .split('.')
    .slice(0, -1)
    .join(' ');
}

The one suspicious bit in there is the .replace(/www[a-zA-Z0-9*\./, '') bit.
It looks like [in theory] it would murder any domain that would start with "www", such as "wwwapple.com":

const urlparse = require('url');

function getProvider(url) {
  return urlparse.parse(url)
    .hostname
    .replace(/www[a-zA-Z0-9]*\./, '')
    .replace('.co.', '.')
    .split('.')
    .slice(0, -1)
    .join(' ');
}

console.log(getProvider('https://www.apple.com')); // "apple"
console.log(getProvider('https://bbc.co.uk')); // "bbc"
console.log(getProvider('https://redirect.ca')); // "redirect"
console.log(getProvider('https://aol.go.com')); // "aol go"
console.log(getProvider('https://wwwwwapple.com')); // ""

Fall back to provided URL if no canonical URL found

When we invoke the canonical URL rules, we should pass in the URL in the context, and if the parser is unable to find a URL in the page, return the passed in URL as the canonical URL.

Fully qualified paths

Any time we pull out a path, if it is relative, we need to fully qualify it with the parent path of the page.

Add support for oembed service?

Was reading https://medium.com/slack-developer-blog/everything-you-ever-wanted-to-know-about-unfurling-but-were-afraid-to-ask-or-how-to-make-your-e64b4bb9254#.cchchs8s3 and started looking into http://oembed.com/.

Not sure if it's worth it to try and search for oembed service stuff in a webpage instead of trying to scrape a page's DOM and extract tags.

Here's 3 results from random tabs that I happened to have open:

https://www.youtube.com/watch?v=qjwujvv1E0w&feature=youtu.be:

<link rel="alternate"
    type="application/json+oembed"
    href="http://www.youtube.com/oembedurl=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3Dqjwujvv1E0w&amp;format=json" 
    title="Urban Agriculture in Focus: Greenhouse vs. Indoor Growing">

Output:

{
    "author_name": "Bright Agrotech",
    "author_url": "https://www.youtube.com/user/BrightAgrotechLLC",
    "height": 270,
    "html": "<iframe width=\"480\" height=\"270\" src=\"https://www.youtube.com/embed/qjwujvv1E0w?feature=oembed\" frameborder=\"0\" allowfullscreen></iframe>",
    "provider_name": "YouTube",
    "provider_url": "https://www.youtube.com/",
    "thumbnail_height": 360,
    "thumbnail_url": "https://i.ytimg.com/vi/qjwujvv1E0w/hqdefault.jpg",
    "thumbnail_width": 480,
    "title": "Urban Agriculture in Focus: Greenhouse vs. Indoor Growing",
    "type": "video",
    "version": "1.0",
    "width": 480
}

http://www.greenkitchenstories.com/the-green-kitchen-smoothies-book/:

<link rel="alternate"
    type="application/json+oembed"
    href="http://www.greenkitchenstories.com/wp-json/oembed/1.0/embed?url=http%3A%2F%2Fwww.greenkitchenstories.com%2Fthe-green-kitchen-smoothies-book%2F" />

Output:

{
  "author_name": "Green Kitchen Stories",
  "author_url": "http://www.greenkitchenstories.com/author/greenkit/",
  "height": 338,
  "html": "<iframe sandbox=\"allow-scripts\" security=\"restricted\" src=\"http://www.greenkitchenstories.com/the-green-kitchen-smoothies-book/embed/\" width=\"600\" height=\"338\" title=\"Embedded WordPress Post\" frameborder=\"0\" marginwidth=\"0\" marginheight=\"0\" scrolling=\"no\" class=\"wp-embedded-content\"></iframe>",
  "provider_name": "Green Kitchen Stories",
  "provider_url": "http://www.greenkitchenstories.com",
  "title": "Green Kitchen Smoothies",
  "type": "rich",
  "version": "1.0",
  "width": 600
}

https://www.flickr.com/photos/bees/2341623661/:

<link rel="alternative"
    type="application/json+oembed"
    href="https://www.flickr.com/services/oembed?url&#x3D;https://www.flickr.com/photos/bees/2341623661&amp;format&#x3D;json" />

Output:

{
  "author_name": "‮‭‬bees‬",
  "author_url": "https://www.flickr.com/photos/bees/",
  "cache_age": 3600,
  "flickr_type": "photo",
  "height": "683",
  "html": "<a data-flickr-embed=\"true\" href=\"https://www.flickr.com/photos/bees/2341623661/\" title=\"ZB8T0193 by ‮‭‬bees‬, on Flickr\"><img src=\"https://farm4.staticflickr.com/3123/2341623661_7c99f48bbf_b.jpg\" width=\"1024\" height=\"683\" alt=\"ZB8T0193\"></a><script async src=\"https://embedr.flickr.com/assets/client-code.js\" charset=\"utf-8\"></script>",
  "license": "All Rights Reserved",
  "license_id": 0,
  "provider_name": "Flickr",
  "provider_url": "https://www.flickr.com/",
  "thumbnail_height": 150,
  "thumbnail_url": "https://farm4.staticflickr.com/3123/2341623661_7c99f48bbf_q.jpg",
  "thumbnail_width": 150,
  "title": "ZB8T0193",
  "type": "photo",
  "url": "https://farm4.staticflickr.com/3123/2341623661_7c99f48bbf_b.jpg",
  "version": "1.0",
  "web_page": "https://www.flickr.com/photos/bees/2341623661/",
  "web_page_short_url": "https://flic.kr/p/4yVr8K",
  "width": "1024"
}

Use getAttribute to access element attributes

We must use getAttribute to access attributes of DOM elements. Accessing it as a property on the element itself is not an officially supported behaviour and leads to confusing results.

Relative URLs are broken

If a page uses a relative protocol URL for a resource such as:

<link rel="shortcut icon" href="//s.ytimg.com/yts/img/favicon-vflz7uhzw.ico" type="image/x-icon">

the parser is accidentally malforming it to:

https://www.youtube.com/yts/img/favicon_32-vfl8NGn4k.png

when we attempt to make it absolute.

Scraping an image doesn't return said image in images[] array

Not sure if valid use case, but worth discussion somewhere, I imagine...

I looked at some random image on The Internet, http://i.imgur.com/qu1D1i2.png

When I try and lookup that URL via the metadata proxy, it returns a response, but the images[] array is empty, even though the URL itself is an image.
Supporting this use case may require too much special casing where we have to inspect the HTTP Content-Type and inject the image if the URL is the actual image (and not an HTML page with scrape-able meta tags).

$ http https://metadata.dev.mozaws.net/v1/metadata urls:='["http://i.imgur.com/qu1D1i2.png"]' -j -v

POST /v1/metadata HTTP/1.1
Accept: application/json
Accept-Encoding: gzip, deflate
Connection: keep-alive
Content-Length: 44
Content-Type: application/json; charset=utf-8
Host: metadata.dev.mozaws.net
User-Agent: HTTPie/0.9.1

{
    "urls": [
        "http://i.imgur.com/qu1D1i2.png"
    ]
}

HTTP/1.1 200 OK
Connection: keep-alive
Content-Length: 226
Content-Type: application/json; charset=utf-8
Date: Wed, 24 Aug 2016 23:29:59 GMT
ETag: W/"e2-8nExhyCWBstSo0Fyi5WAGw"
Server: nginx/1.9.9
Strict-Transport-Security: max-age=31536000

{
    "request_error": "",
    "url_errors": {},
    "urls": {
        "http://i.imgur.com/qu1D1i2.png": {
            "favicon_url": "http://i.imgur.com/favicon.ico",
            "images": [],
            "original_url": "http://i.imgur.com/qu1D1i2.png",
            "url": "http://i.imgur.com/qu1D1i2.png"
        }
    }
}

$ curl -v http://i.imgur.com/qu1D1i2.png:

$ curl -v http://i.imgur.com/qu1D1i2.png

*   Trying 151.101.40.193...
* Connected to i.imgur.com (151.101.40.193) port 80 (#0)
> GET /qu1D1i2.png HTTP/1.1
> Host: i.imgur.com
> User-Agent: curl/7.43.0
> Accept: */*
>
< HTTP/1.1 200 OK
< Last-Modified: Thu, 18 Aug 2016 17:41:22 GMT
< ETag: "ff147ca3989a0d3a1459a68197703c9b"
< Content-Type: image/png
< Fastly-Debug-Digest: 51baf4e16eef201d40ad7972b9cf48d2c78f559e1c1dc2d152ecfdffd5ff2a37
< cache-control: public, max-age=31536000
< Content-Length: 1224831
< Accept-Ranges: bytes
< Date: Wed, 24 Aug 2016 23:38:56 GMT
< Age: 539625
< Connection: keep-alive
< X-Served-By: cache-iad2129-IAD, cache-sjc3129-SJC
< X-Cache: HIT, HIT
< X-Cache-Hits: 1, 1
< X-Timer: S1472081936.673075,VS0,VE1
< Access-Control-Allow-Methods: GET, OPTIONS
< Access-Control-Allow-Origin: *
< Server: cat factory 1.0
<
...

provider_url response includes full domain (not top-level domain)

TL;DR: Embed.ly returns just the top-level domain part of the URL for the provider_url whereas our page-metadata-parser returns the original URL:

SERVICE	`provider_url`
Embed.ly	`"https://www.youtube.com/"`
page-metadata-parser	`"https://www.youtube.com/watch?v=nh0ac5HUpDU"`

page-metadata-parser:

$ curl -i -XPOST -H "content-type: application/json" -d '{"urls": ["https://www.youtube.com/watch?v=nh0ac5HUpDU"]}' http://localhost:7001 | JSON

HTTP/1.1 200 OK
content-type: application/json; charset=utf-8
cache-control: no-cache
content-length: 682
Date: Thu, 30 Jun 2016 20:27:50 GMT
Connection: keep-alive

{
  "error": "",
  "urls": {
    "https://www.youtube.com/watch?v=nh0ac5HUpDU": {
      "description": "The United Kingdom voted to leave the European Union, and it looks like it may not be an especially smooth transition. Connect with Last Week Tonight online....",
      "favicon_url": "https://www.youtube.com/favicon.ico",
      "icon_url": "https://s.ytimg.com/yts/img/favicon_32-vfl8NGn4k.png",
      "image_url": "https://i.ytimg.com/vi/nh0ac5HUpDU/hqdefault.jpg",
      "original_url": "https://www.youtube.com/watch?v=nh0ac5HUpDU",
      "provider_url": "https://www.youtube.com/watch?v=nh0ac5HUpDU",
      "title": "Last Week Tonight With John Oliver: Brexit Update (HBO)",
      "type": "video",
      "url": "https://www.youtube.com/watch?v=nh0ac5HUpDU",
    }
  }
}

But here's the [modified] response from the embed.ly extract API:

{
    "description": "The United Kingdom voted to leave the European Union, and it looks like it may not be an especially smooth transition. Connect with Last Week Tonight online...",
    "favicon_url": "https://www.youtube.com/favicon.ico",
    "images": [ ... ],
    "original_url": "https://www.youtube.com/watch?v=nh0ac5HUpDU",
    "provider_display": "www.youtube.com",
    "provider_name": "YouTube",
    "provider_url": "https://www.youtube.com/",
    "title": "Last Week Tonight With John Oliver: Brexit Update (HBO)",
    "type": "html",
    "url": "https://www.youtube.com/watch?v=nh0ac5HUpDU",
}

Parser silently? fails on Google Maps

Fathom proxy 👎

curl -XPOST 'https://metadata.dev.mozaws.net/v1/metadata' -d '{"urls":["https://www.google.com/maps/@37.3743066,-122.0030697,14.14z"]}' -H "Content-Type: application/json" | json

{
  "error": "",
  "urls": {}
}

Embedly proxy 👍

curl -XPOST 'https://embedly-proxy.services.mozilla.com/v2/extract' -d '{"urls":["https://www.google.com/maps/@37.3743066,-122.0030697,14.14z"]}' -H "Content-Type: application/json" | json

{
  "urls": {
    "https://www.google.com/maps/@37.3743066,-122.0030697,14.14z": {
      "provider_url": "http://google.com/maps",
      "authors": [],
      "provider_display": "www.google.com",
      "related": [],
      "favicon_url": "https://www.google.com/images/branding/product/ico/maps_32dp.ico",
      "images": [
        {
          "width": 250,
          "url": "http://maps-api-ssl.google.com/maps/api/staticmap?center=37.3743066,-122.0030697&zoom=15&size=250x250&sensor=false",
          "height": 250,
          "caption": null,
          "colors": [
            {
              "color": [
                230,
                231,
                233
              ],
              "weight": 0.876953125
            },
            {
              "color": [
                247,
                248,
                250
              ],
              "weight": 0.12304687500000001
            }
          ],
          "entropy": 0.878253271477,
          "size": 5221
        }
      ],
      "app_links": [],
      "original_url": "https://www.google.com/maps/@37.3743066,-122.0030697,14.14z",
      "media": {
        "type": "rich"
      },
      "content": null,
      "entities": [],
      "provider_name": "Google Maps",
      "type": "html",
      "description": null,
      "embeds": [],
      "safe": true,
      "offset": null,
      "cache_age": 86400,
      "lead": null,
      "language": null,
      "url": "https://www.google.com/maps/@37.3743066,-122.0030697,14.14z?dg=dbrw&newdg=1",
      "title": "37.3743066,-122.0030697",
      "favicon_colors": [
        {
          "color": [
            6,
            163,
            104
          ],
          "weight": 0.0891113281
        },
        {
          "color": [
            80,
            131,
            228
          ],
          "weight": 0.046875
        },
        {
          "color": [
            213,
            82,
            70
          ],
          "weight": 0.041259765600000005
        },
        {
          "color": [
            193,
            199,
            193
          ],
          "weight": 0.0283203125
        },
        {
          "color": [
            233,
            214,
            75
          ],
          "weight": 0.0244140625
        }
      ],
      "keywords": [],
      "published": null
    }
  },
  "error": ""
}

Add support for 'picture' element?

I guess this kind of assumes that if somebody went to the effort of marking up an element with a <picture> and <source> tag, it has some sort of value:

<picture>
  <source srcset="mdn-logo.svg" type="image/svg+xml" />
  <img src="mdn-logo.png" />
</picture>

— via https://developer.mozilla.org/en-US/docs/Web/HTML/Element/picture

Randomly saw this on http://www.espn.com/nfl/story/_/id/17554369/chris-ivory-back-jacksonville-jaguars-emergency:

<aside class="inline inline-photo full">
  <figure>
    <picture >
      <source data-srcset="http://a3.espncdn.com/combiner/i?img=%2Fphoto%2F2016%2F0913%2Fr125376_1296x729_16%2D9.jpg&w=570, http://a3.espncdn.com/combiner/i?img=%2Fphoto%2F2016%2F0913%2Fr125376_1296x729_16%2D9.jpg&w=1140&cquality=40 2x" media="(min-width: 376px)">
      <source data-srcset="http://a3.espncdn.com/combiner/i?img=%2Fphoto%2F2016%2F0913%2Fr125376_1296x729_16%2D9.jpg&w=375, http://a3.espncdn.com/combiner/i?img=%2Fphoto%2F2016%2F0913%2Fr125376_1296x729_16%2D9.jpg&w=750&cquality=40 2x" media="(max-width: 375px)">
      <img  class=" lazyload lazyload" data-image-container=".inline-photo"  />
    </picture>
    <figcaption class="photoCaption">
      Chris Ivory was back at the Jaguars' facility Thursday following an emergency for which he was hospitalized earlier this week.&nbsp;<cite>David Rosenblum/Icon Sportswire</cite>
    </figcaption>
  </figure>
</aside>

Also worth noting is that the above real-world ESPN example uses data-srcset (possibly a polyfill), and the syntax for srcset is "unique" (see MDN) in that it is seemingly a comma separated list and seems to specify a regular and "retina" image, so we'd have to somehow wrap that in a postprocessor to try and pull the larger retina image, if available.

Plus, that page has appropriate <meta property="og:image"/> and <meta name="twitter:image"/> tags, so we [presumably] wouldn't use that particular <source> fallback anyways.

Possibly not worth the effort, and that's "OK".

Add a command line interface

It would be handy if we could invoke the parser to output metadata in json on the command line.

Add support for amp metadata

https://www.ampproject.org/docs/guides/discovery.html

Fall back on domain/favicon.ico when no favicon specified

Many sites do not specify a favicon in the meta tags at all, in that case we should just fall back on inferring the presence of /favicon.ico

Thoughts on site specific rules

First off, really nice work guys, I hope we can begin using this shortly!

On our side we have some requirements to support parsing pages like Facebook and Twitter profiles. Currently this is a messy and brittle process. Facebook for example requires that we extract the main cover-cover image from some JSON burried in some inline JS via a regex.

Gross huh? But needs must :)

It doesn't really make sense to throw very very specific rules like this into the page-metadata-parser Fathom 'image_url' ruleset, especially if we know that this URL is not a Facebook profile URL.

Opening this issue to start the discussion.

Add build and coverage to readme

Add the circle build and coverage badges to README.md

Try and fix invalid open graph tags?

This is a stretch, but I'm not sure how common bad markup is, or if we should try and correct mistakes.

Random example; http://www.toyotasunnyvale.com/used/Ford/2012-Ford-Fiesta-sunnyvale-bay-area-ca-8ae3ecd20a0e0adf41c9e44a5ea8b80b.htm has the following markup:

<meta name="og:title" content="Used 2012 Ford Fiesta SES For Sale in Sunnyvale CA & the Bay Area | Serving San Jose, Fremont & Palo Alto | VIN: 3FADP4FJ0CM129464" />
<meta name="og:type" content="website" />
<meta name="og:image" content="https://images.dealer.com/evox/color_0640_001/7623/7623_cc0640_001_VG.jpg" />
<meta name="og:url" content="http://www.toyotasunnyvale.com/used/Ford/2012-Ford-Fiesta-sunnyvale-bay-area-ca-8ae3ecd20a0e0adf41c9e44a5ea8b80b.htm" />
<meta name="og:description" content="Buy a used 2012 Ford Fiesta Hatchback VIN: 3FADP4FJ0CM129464 from Toyota Sunnyvale in Sunnyvale, CA.Call 408-831-0005 for more information about Stock#TC11688. Proudly serving San Jose, Fremont & Palo Alto." />
<meta name="locale" content="en_US" />

That superficially LOOKS right, but according to http://ogp.me/ it should probably be:

<meta property="og:image" content="..." />

Instead of the incorrect (name instead of property):

<meta name="og:image" content="..." />

Not sure if this is worth adding a few fallback rules in our https://github.com/mozilla/page-metadata-parser/blob/master/parser.js to look for the incorrect version as well, so we "do the right thing"(tm).

I was a bit disappointed when I tried scraping that page and got an empty images[] array, expecially when it looks like the correct open graph tags were there (except the small bit about the invalid markup).

$ http https://metadata.dev.mozaws.net/v1/metadata urls:='["http://www.toyotasunnyvale.com/used/Ford/2012-Ford-Fiesta-sunnyvale-bay-area-ca-8ae3ecd20a0e0adf41c9e44a5ea8b80b.htm"]' -j -v

$ http https://metadata.dev.mozaws.net/v1/metadata urls:='["http://www.toyotasunnyvale.com/used/Ford/2012-Ford-Fiesta-sunnyvale-bay-area-ca-8ae3ecd20a0e0adf41c9e44a5ea8b80b.htm"]' -j -v

POST /v1/metadata HTTP/1.1
Accept: application/json
Accept-Encoding: gzip, deflate
Connection: keep-alive
Content-Length: 130
Content-Type: application/json; charset=utf-8
Host: metadata.dev.mozaws.net
User-Agent: HTTPie/0.9.1

{
    "urls": [
        "http://www.toyotasunnyvale.com/used/Ford/2012-Ford-Fiesta-sunnyvale-bay-area-ca-8ae3ecd20a0e0adf41c9e44a5ea8b80b.htm"
    ]
}

HTTP/1.1 200 OK
Connection: keep-alive
Content-Length: 904
Content-Type: application/json; charset=utf-8
Date: Wed, 24 Aug 2016 19:24:29 GMT
ETag: W/"388-7++HeucCZxyUJSPyNTcEwA"
Server: nginx/1.9.9
Strict-Transport-Security: max-age=31536000

{
    "request_error": "",
    "url_errors": {},
    "urls": {
        "http://www.toyotasunnyvale.com/used/Ford/2012-Ford-Fiesta-sunnyvale-bay-area-ca-8ae3ecd20a0e0adf41c9e44a5ea8b80b.htm": {
            "description": "Buy a used 2012 Ford Fiesta Hatchback VIN: 3FADP4FJ0CM129464 from Toyota Sunnyvale in Sunnyvale, CA.Call 408-831-0005 for more information about Stock#TC11688. Proudly serving San Jose, Fremont & Palo Alto.",
            "favicon_url": "http://www.toyotasunnyvale.com/v8/global/images/site-favicon-default.ico?1356028138000",
            "images": [],
            "original_url": "http://www.toyotasunnyvale.com/used/Ford/2012-Ford-Fiesta-sunnyvale-bay-area-ca-8ae3ecd20a0e0adf41c9e44a5ea8b80b.htm",
            "title": "Used 2012 Ford Fiesta SES For Sale in Sunnyvale CA & the Bay Area | Serving San Jose, Fremont & Palo Alto | VIN: 3FADP4FJ0CM129464",
            "url": "http://www.toyotasunnyvale.com/used/Ford/2012-Ford-Fiesta-sunnyvale-bay-area-ca-8ae3ecd20a0e0adf41c9e44a5ea8b80b.htm"
        }
    }
}

Add getMetadata tests

Test the getMetadata function.

Handle duplicate meta tags?

Another fascinating look into the life of @pdehaan...

Scraping http://www.blenderbabes.com/lifestyle-diet/dairy-free/lower-calories-lunch-falafel-recipe/ returns some unexpected expected results. Not sure if we can make this better, or falls under the "don't try and fix content" banner.

Like most WordPress sites, the user is probably using 9 different plugins which are trying to improve SEO and garbage like that. This leaves us with some interesting and conflicting data.
By my guess, we yank the first matching DOM rule and ignore the rest, but in this edgey edge case that seems a bit suboptimal.

<meta> tags include, but are not limited to:

<!-- Open Graph tags generated by Open Graph Metabox for WordPress -->
<meta property="fb:app_id" content="689951950" />
<meta property="og:description" content="&nbsp;" />
<meta property="og:image" content="http://www.blenderbabes.com/wp-content/uploads/Easy-Falafel-Recipe.jpg" />
<meta property="og:title" content="Easy Falafel Recipe Made in a Vitamix or Blendtec Blender" />
<meta property="og:type" content="blog" />
<meta property="og:url" content="http://www.blenderbabes.com/lifestyle-diet/dairy-free/lower-calories-lunch-falafel-recipe/" />
<!-- /Open Graph tags generated by Open Graph Metabox for WordPress -->

<!-- WordPress Facebook Open Graph protocol plugin (WPFBOGP v2.0.13) http://rynoweb.com/wordpress-plugins/ -->
<meta property="fb:admins" content="689951950"/>
<meta property="og:description" content="Check out our latest Blender Giveaway!  EASY FALAFEL RECIPE If you&#039;re unfamiliar with this Middle Eastern street food, you&#039;re in luck! This relatively easy v"/>
<meta property="og:image" content="http://www.blenderbabes.com/wp-content/uploads/Easy-Falafel-Recipe.jpg"/>
<meta property="og:image" content="http://www.blenderbabes.com/wp-content/uploads/Easy-Falafel-Recipe.jpg"/>
<meta property="og:image" content="http://www.blenderbabes.com/wp-content/uploads/Easy-Falafel-Recipe.jpg"/>
<meta property="og:locale" content="en_us"/>
<meta property="og:site_name" content="Blender Babes - Healthy Smoothie Recipes | Blendtec vs Vitamix Reviews"/>
<meta property="og:title" content="Easy Falafel Recipe"/>
<meta property="og:type" content="article"/>
<meta property="og:url" content="http://www.blenderbabes.com/lifestyle-diet/dairy-free/lower-calories-lunch-falafel-recipe/"/>
<!-- // end wpfbogp -->

I sorted for easy reference, but the highlights are:

First cluster of meta tags returns   for the og:description
Second cluster of meta tags returns a valid-ish (albeit spammy) og:description
First cluster returns blog for the og:type
Second cluster returns article for the og:type

This may not be worth the effort. Only solution I can think of, is that we return an array of matching selectors, and then have to try and calculate what the "best" result is, either by content length or something else. Also in the hideous mess of <meta> tags above is an og:image which is duplicated four times with the same value and then the confusing og:type issue above where we'd really just need to randomly return a type, or return an array of values, or whatever.

Optimise webpack bundle

Ref #68 (review)

This isn't ideal since it doesn't do any compressing. Richard suggested using:
$ webpack --optimize-minimize --optimize-dedupe
But when I run it here it throws some errors I have to dig into...

$ npm run clientize

> [email protected] clientize /Users/pdehaan/dev/github/mozilla/page-metadata-parser
> webpack

[BABEL] Note: The code generator has deoptimised the styling of "/Users/pdehaan/dev/github/mozilla/page-metadata-parser/node_modules/wu/dist/wu.js" as it exceeds the max of "100KB".
Hash: fb2fc43f052c74004a87
Version: webpack 1.13.2
Time: 7381ms
                         Asset    Size  Chunks             Chunk Names
page-metadata-parser.bundle.js  490 kB       0  [emitted]  main
   [0] multi main 40 bytes {0} [built]
    + 308 hidden modules

Add a provider field

Parse the URL into a top level name which removes 'www' and any TLD, ie: 'https://www.amazon.ca/product?id=123' becomes 'amazon'

Getting undefined url response on "bad" request

I mistakenly tried to CURL https://www.cnn.com (which mysteriously doesn't support HTTPS, only HTTP) and it's giving me undefined in the response JSON.

Maybe just a case of "don't do that" and we can close this, but it looked weird and maybe we should be returning an error (even though we get an "HTTP/1.1 200 OK" response:

$ curl -i -XPOST -H "content-type: application/json" -d '{"urls": ["https://www.cnn.com"]}' http://localhost:7001 | JSON

HTTP/1.1 200 OK
content-type: application/json; charset=utf-8
cache-control: no-cache
content-length: 36
Date: Thu, 30 Jun 2016 20:01:40 GMT
Connection: keep-alive

{
  "error": "",
  "urls": {
    "undefined": {}
  }
}

Add keywords

We can extract and store the keywords metadata tag.

Audit dependencies

I think most of these are fine, but wasn't sure if we're using wu, or if it is safe to remove:

$ depcheck

Unused Dependencies
* karma-chrome-launcher
* wu

Unused devDependencies
* eslint-plugin-mozilla
* istanbul-instrumenter-loader
* karma-coverage
* karma-firefox-launcher
* karma-mocha
* karma-mocha-reporter
* karma-sourcemap-loader
* karma-webpack
* webpack

Add highres_icon_url rules

We need a separate set of rules for just high res icons for callers who want to only know if a high res icon is available for a page.

Add support for swiftype metadata

Example here:

https://techcrunch.com/2016/08/10/new-macbook-pro-with-touch-id-sensor-and-oled-mini-screen-is-coming-soon/:

Add CircleCI

Add circleci support for the project

Add rule set for 'provider' name

Embed.ly returns a provider name:
http://docs.embed.ly/docs/extract

That's something we want to use in the mobile UI of Activity Stream (if available). Also see mozilla/page-metadata-service#90.

Some potential sources are:

Meta tags

Open Graph (og:site_name)
application-name
twitter:site (Actually the twitter account, but sometimes a good fallback title)
al:android:app_name
al:ios:app_name
apple-mobile-web-app-title
twitter:app:name:iphone
twitter:app:name:ipad
twitter:app:name:googleplay

Other tags:

title of tag for RSS/ATOM feeds (Not always the best title or very feed specific)
title of tag for OpenSearch (Sometimes search specific)

Secondary resources:

Return keywords as array instead of string?

Ref: #43 (comment)

Do we want to return the content as string value for the <meta name="keywords" content="foo bar" /> tag, or return as an array of [trimmed] keywords?

I'd probably vote for the latter, plus, it's more inline with what Embedly proxy does:

"keywords": [
  {
      "name": "charcoal",
      "score": 95
  },
  {
      "name": "lemon",
      "score": 63
  },
  {
      "name": "drink",
      "score": 32
  },
  {
      "name": "lemonade",
      "score": 32
  },
  {
      "name": "water",
      "score": 29
  },
  {
      "name": "juice",
      "score": 25
  },
  {
      "name": "detox",
      "score": 24
  },
  {
      "name": "toxins",
      "score": 21
  },
  {
      "name": "activated",
      "score": 20
  },
  {
      "name": "detoxifier",
      "score": 20
  }
],

$ http https://embedly-proxy.services.mozilla.com/v2/extract urls:='["http://yogimami.com/Empire/tag/homemade-charcoal-lemonade-juice/"]' -j -v

Add support for `link rel="image_src"`?

Randomly saw this, and surprisingly is/was a thing:

<link rel="image_src" href="http://uncrate.com/p/2016/08/pf-flyer.jpg" />

— via http://uncrate.com/stuff/pf-flyers-x-todd-snyder-grounder/

That site also has the proper open graph tags, so we find the thumbnail anyways, but not sure if we want to add this [here] as yet another fallback.

const imageRules = buildRuleset('image', [
  ['meta[property="og:image:secure_url"]', node => node.element.content],
  ['meta[property="og:image:url"]', node => node.element.content],
  ['meta[property="og:image"]', node => node.element.content],
  ['meta[property="twitter:image"]', node => node.element.content],
  ['meta[name="thumbnail"]', node => node.element.content],
]);

Ref: http://stackoverflow.com/questions/19274463/what-is-link-rel-image-src
Ref: http://www.niallkennedy.com/blog/2009/03/enhanced-social-share.html

mozilla / page-metadata-parser Goto Github PK

page-metadata-parser's Introduction

Page Metadata Parser

Overview

Purpose

Supported schemas

Requirements

Usage

Installation

Usage in the browser

Usage in node

Metadata Rules

Rules

Built-in Rule Sets

Extending a single rule

Using all rules

page-metadata-parser's People

Contributors

Stargazers

Watchers

Forkers

page-metadata-parser's Issues

youtube.com:

cnn.com:

page-metadata-parser:

Fathom proxy 👎

Embedly proxy 👍

Recommend Projects

Recommend Topics

Recommend Org