Git Product home page Git Product logo

page-meta's Introduction

Page Meta ๐Ÿ•ต

Get detailed info for any URL on the internet

Page Meta is a PHP library than can retrieve detailed info on any URL from the internet! It uses data from HTML meta tags and OpenGraph with fallback to detailed HTML scraping.

Highlights

  • Works for any valid URL on the internet!
  • Follows page redirects
  • Uses all scraping methods available: HTML tags, OpenGraph, Schema data

Potention use cases

  • Display Info Cards for links in a article
  • Rich preview for links in messaging apps
  • Extract info from a user-submitted URL

How to use

Installation

Add layered/page-meta as a dependency in your project's composer.json file:

$ composer require layered/page-meta

Usage

Create a UrlPreview instance, then call loadUrl($url) method with your URL as first argument. Preview data is retrieved with get($section) or getAll() methods:

require 'vendor/autoload.php';

$preview = new Layered\PageMeta\UrlPreview;
$preview->loadUrl('https://www.instagram.com/p/BbRyo_Kjqt1/');

$allPageData = $preview->getAll();	// contains all scraped data
$siteInfo = $preview->get('site');	// get general info about the website

Returned data

Returned data will be an Array with following format:

{
	"site": {
		"secure":		true,
		"url":			"https:\/\/www.instagram.com",
		"icon":			"https:\/\/www.instagram.com\/static\/images\/ico\/favicon-192.png\/b407fa101800.png",
		"language":		"en",
		"responsive":	true,
		"name":			"Instagram"
	},
	"page": {
		"type":			"photo",
		"url":			"https:\/\/www.instagram.com\/p\/BbRyo_Kjqt1\/",
		"title":		"GitHub on Instagram",
		"description":	"There\u2019s still time to join the #GitHubGameOff and build a game inspired by throwbacks. Get started\u2026",
		"image":		{
			"url": "https:\/\/scontent-mad1-1.cdninstagram.com\/vp\/73b1790d77548031327e64ee83196706\/5B4AD567\/t51.2885-15\/e35\/23421974_1768724519826754_3855913942043852800_n.jpg"
		}
	},
	"author": {
		"name":			"GitHub",
		"handle":		"@github",
		"url":			"https:\/\/www.instagram.com\/github\/"
	},
	"app_links": {
		"ios": {
			"url": "nflx:\/\/www.netflix.com\/title\/80014749",
			"app_store_id": "363590051",
			"app_name": "Netflix",
			"store_url": "https:\/\/itunes.apple.com\/us\/app\/Netflix\/id363590051"
		},
		"android": {
			"url": "nflx:\/\/www.netflix.com\/title\/80014749",
			"package": "com.netflix.mediaclient",
			"app_name": "Netflix",
			"store_url": "https:\/\/play.google.com\/store\/apps\/details?id=com.netflix.mediaclient"
		}
	}
}

Public API

UrlPreview class provides the following public methods:

__construct(array $headers)

Start the UrlPreview instance. Pass extra headers to send when requesting the page URL

Returns: UrlPreview instance

loadUrl(string $url)

Load and start the scrape process for any valid URL

Returns: UrlPreview instance

getAll()

Get all data scraped from page

Return: Array with scraped data in following format

  • site - info about the website
    • url - main site URL
    • name - site name, ex: 'Instagram' or 'Medium'
    • secure - Boolean true|false depending on http connection
    • responsive - Boolean true|false. True if site has viewport meta tag present. Basic check for responsiveness
    • icon - site icon
    • language - ISO 639-1 language code, ex: en, es
  • page - info about the page at current URL
    • type - page type, ex: website, article, profile, video, etc
    • url - canonical URL for the page
    • title - page title
    • description - page description
    • image - Array containing image info, if present:
      • url - image URL
      • width - image width
      • height - image width
    • video - Array containing video info, if found on page:
      • url - video URL
      • width - video width
      • height - video width
  • author - info about the content author, ex:
    • name - Author's name on a blog, person's name on social network sites
    • handle - Social media site username
    • url - Author URL for more articles or Profile URL on social network sites
  • app_links - Array containing apps linked to page, like:
    • ios - iOS app
      • url - link for in-app action, ex: 'nflx://www.netflix.com/title/80014749'
      • app_store_id - Apple AppStore app ID
      • app_name - name of the app
      • store_url - link to installable app
    • android - Android app
      • url - link for in-app action, ex: 'nflx://www.netflix.com/title/80014749'
      • package - Android PlayStore app ID
      • app_name - name of the app
      • store_url - link to installable app

get(string $section)

Get data in one scraped section site, page, profile or app_links

Return: Array with section scraped data. See getAll for data format

addListener(string $eventName, callable $listener, int $priority = 0)

Attach an event on UrlPreview for data processing or scrape process. Arguments:

  • $eventName - on which event to listen, available:
    • page.scrape - fired when the scraping process starts
    • data.filter - fired when data is requested by getData() or getAll() methods
  • $listener - a callable reference, which will get the $event parameter with available data
  • $priority - order on which the callable should be executed

Extending the library

If there's need to more scraped data for a URL, more functionality can be attached to PageMeta library. Example for returing the 'Terms and Conditions' link from pages:

use Symfony\Component\EventDispatcher\Event;

$previewer = new \Layered\PageMeta\UrlPreview;
$previewer->addListener('page.scrape', function(Event $event) {
	$currentScrapedData = $event->getData();	// check data from other scrapers
	$crawler = $event->getCrawler();			// instance of DomCrawler Symfony Component
	$termsLink = '';

	$crawler->filter('a[href*=terms]')->each(function($node) use(&$termsLink) {
		$termsLink = $node->attr('href');
	});

	// forwards the scraped data
	$event->addData('site', [
		'termsLink'	=>	$termsLink
	]);
});
$previewer->loadUrl('http://github.com');

More

Please report any issues here on GitHub.

Any contributions are welcome

page-meta's People

Contributors

andreiigna avatar vlafranca avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.