Git Product home page Git Product logo

url-inspector's Introduction

url-inspector

Synopsys

npx url-inspector <url>

Description

Get normalized metadata about what a URL mainly represents.

This is a Node.js module.

Sources of information:

  • HTTP response headers
  • embedded tags in binary formats (using exiftool)
  • OpenGraph, Twitter Cards, schema.org, json+ld, title and meta tags in HTML pages
  • oEmbed endpoints
  • if a URL is mainly wrapping a media, that media might be inspected too

Inspection stops when enough information has been gathered, or when a maximum number of bytes (depending on media type) have been downloaded.

Format

  • url: url of the inspected resource
  • title: title of the resource, or filename, or last component of pathname with query
  • description: optional longer description, without title in it, and only the first line.
  • site: the name of the site, or the domain name
  • mime: RFC 7231 mime type of the resource (defaults to Content-Type) The inspected mime type could be more accurate than the http header.
  • ext: The file extension, only derived from the mime type. Safe to be used as file extension.
  • what: what the resource represents page, image, video, audio, file
  • type: how the resource is used: link, image, video, audio, embed. Example: if what:image and mime:text/html, and no html snippet is found, type will be 'link'.
  • html: the html representation of the resource, according to type and use.
  • script: url of a script that must be installed along with the html representation.
  • date (YYYY-MD-DD format) creation or modification date
  • author: optional credit, author (without the @ prefix and with _ replaced by spaces)
  • keywords: optional array of collected keywords (lowercased words that are not in title words).
  • size: optional Content-Length as integer; discarded when type is embed
  • icon: optional link to the favicon of the site
  • width, height: optional dimensions as integers
  • duration: optional hh:mm:ss string
  • thumbnail: optional a URL to a thumbnail, could be a data-uri for embedded images
  • source: optional a URL that can go in a 'src' attribute; for example a resource can be an html page representing an image type. The URL of the image itself would be stored here; same thing for audio, video, embed types.
  • error: optional an http error code, or string

Install

url-inspector currently requires those external libraries/tools:

  • exiftool
  • libcurl (and libcurl-dev if node-libcurl needs to be rebuilt)

Both programs are well-maintained, and available in most linux distributions.

Usage

import Inspector from 'url-inspector';

const opts = {
 ua: "Mozilla/5.0", // override ua, defaults to somewhat modern browser
 nofavicon: false, // disable additional requests to get a favicon
 nosource: false, // disable main embedded media sub-inspection
 file: true, // local files inspection is only enabled by default when using CLI
 meta: {} // user-entered metadata, to be merged and normalized
 providers: null // custom providers (module path or array)
};

const inspector = new Inspector(opts);

const obj = await inspector.look(url);

Inspector throws http-errors instances.

By default oembed providers are

  • found from a curated list of providers
  • found from a custom list, required from opts.providers
  • discovered in the inspected web pages

It is possible to add custom providers in the options, by passing an array or a path to a module exporting an array.

See src/custom-oembed-providers.js for examples.

To normalize an already existing metadata object, including url rewriting done by providers, and other changes in fields, do:

await inspector.norm(obj);

url-inspector uses node-libcurl to make http requests, and exposes it as:

const req = await Inspector.get(urlObj);

where req.abort() stops the request, req.res is the response stream, and res.statusCode, res.headers are available.

Proxy support

url-inspector configures http(s) proxies through proxy-from-env package and environment variables (http_proxy, https_proxy, all_proxy, no_proxy):

Read proxy-from-env documentation.

License

Open Source, see ./LICENSE.

url-inspector's People

Contributors

kapouer avatar

Stargazers

Anne Thorpe avatar  avatar  avatar Roman avatar Jonathan Wohl avatar Nicolás Marín avatar Kaptian Core avatar Grzegorz Leoniec avatar Christian W. avatar Daniel Heyne avatar  avatar Jonathan Maye-Hobbs avatar Nguyễn Trọng Vĩnh avatar daniel sieradski avatar  avatar Pranav Raj Singh Chauhan avatar Marcel Sander avatar Matías Agustín Méndez avatar Sam Maddock avatar Harris Novick avatar Paul Mylecharane avatar Matt Lenz avatar Jeffrey.PYL avatar Artyom Stepanishchev avatar David Gonzalez avatar Eusthace Corin avatar Winston Fassett avatar Bakahr avatar

Watchers

Shu Yang Quek avatar  avatar Eusthace Corin avatar  avatar

url-inspector's Issues

Inconsistent results on some urls

url-inspector reports incorrect data on some urls.
For examples:
./url-inspector.js http://www.lavieenbois.com/
does not find any title and returns:

{
  "url": "http://www.lavieenbois.com/",
  "mime": "text/html; charset=ISO-8859-1",
  "type": "link",
  "size": 4070,
  "icon": "http://www.lavieenbois.com/favicon.ico",
  "site": "www.lavieenbois.com",
  "ext": "html",
  "html": "<a href=\"http://www.lavieenbois.com/\">undefined</a>"
}

and ./url-inspector.js https://myspace.com/unefemmemariee/music/songs returns:

{
  "url": "https://myspace.com/unefemmemariee/music/songs",
  "mime": "text/html; charset=utf-8",
  "type": "audio",
  "size": 108104,
  "title": "UNE FEMME MARIÉE",
  "icon": "https://x.myspacecdn.com/new/common/images/favicons/favicon.ico",
  "thumbnail": "https://a2-images.myspacecdn.com/images03/31/31cc9883f6e14a18b96e4ea5a8f82a83/600x600.jpg",
  "site": "myspace",
  "ext": "html",
  "html": "<audio src=\"https://myspace.com/unefemmemariee/music/songs\"></audio>"
}

where type is audio whereas mime is "text/html".

Getting spawn exifToll error

Hi,
I am getting below error after installing url-inspector and start node. my node version is 6.10.2 and OS is Windows 10.
Error: spawn exiftool ENOENT
at exports._errnoException (util.js:1018:11)
at Process.ChildProcess._handle.onexit (internal/child_process.js:193:32)
at onErrorNT (internal/child_process.js:367:16)
at _combinedTickCallback (internal/process/next_tick.js:80:11)
at process._tickDomainCallback (internal/process/next_tick.js:128:9)
at Module.runMain (module.js:606:11)
at run (bootstrap_node.js:393:7)
at startup (bootstrap_node.js:150:9)
at bootstrap_node.js:508:3

example with server and client showing inspected resources

<div class="inpector-thumbnail">
    <div class="inpector-header">
        <img src="image.svg" class="inspector-type" />
        <div class="inpector-title">Title</div>
    </div>
    <div class="inpector-sample">
        <img src="favicon.png" class="inpector-favicon" />
        <img src="sample.png" class="inpector-image" />
    </div>
    <div class="inpector-footer">
        <div class="inpector-dimensions">1024x768</div>
        <div class="inspector-duration">2:10:23</div>
        <div class="inspector-size">12 MB</div>
        <div class="inspector-modified">Ten minutes ago</div>
    </div>
</div>

Refactor: URL

This would avoid many pitfalls of url.parse:

const obj = new URL(str);
obj.searchParams.set("test", 1);
obj.href

works as expected, without surprises...

get picture from mp3

mp3 have id3 tags:

  "PictureMIMEType": "image/jpeg",
  "PictureType": "Front Cover",
  "PictureDescription": "Mike_Cahen.jpg",
  "Picture": "base64:/9j/4Q8NRXhpZgAATU0AKgAAAAgADAEAAAMAAAABBAAAAAEBAAMAAAABAkAAAAECAAMAAAADAAAAngEGAAMAAAABAAIAAAESAAMAAAABAAEAAAEVAAMAAAABAAMAAAEaAAUAAAABAAAApAEbAAUAAAABAAAArAEoAAMAAAABAAIAAAExAAIAAAAkAAAAtAEyAAIAAAAUAAAA2

so url-inspector can build a data-uri out of this very easily.
Depends on issue #6 .

Dockerfile and backends

  • url-inspector is a microservice that should be deployable in the cloud
  • configurable http clients as backends is needed to overcome origins restrictions (IP, countries, etc...)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.