Page Meta is a PHP library than can retrieve detailed info on any URL from the internet! It uses data from HTML meta tags and OpenGraph with fallback to detailed HTML scraping.
- Works for any valid URL on the internet!
- Follows page redirects
- Uses all scraping methods available: HTML tags, OpenGraph, Schema data
- Display Info Cards for links in a article
- Rich preview for links in messaging apps
- Extract info from a user-submitted URL
Add layered/page-meta
as a dependency in your project's composer.json
file:
$ composer require layered/page-meta
Create a UrlPreview
instance, then call loadUrl($url)
method with your URL as first argument. Preview data is retrieved with get($section)
or getAll()
methods:
require 'vendor/autoload.php';
$preview = new Layered\PageMeta\UrlPreview;
$preview->loadUrl('https://www.instagram.com/p/BbRyo_Kjqt1/');
$allPageData = $preview->getAll(); // contains all scraped data
$siteInfo = $preview->get('site'); // get general info about the website
Returned data will be an Array
with following format:
{
"site": {
"secure": true,
"url": "https:\/\/www.instagram.com",
"icon": "https:\/\/www.instagram.com\/static\/images\/ico\/favicon-192.png\/b407fa101800.png",
"language": "en",
"responsive": true,
"name": "Instagram"
},
"page": {
"type": "photo",
"url": "https:\/\/www.instagram.com\/p\/BbRyo_Kjqt1\/",
"title": "GitHub on Instagram",
"description": "There\u2019s still time to join the #GitHubGameOff and build a game inspired by throwbacks. Get started\u2026",
"image": {
"url": "https:\/\/scontent-mad1-1.cdninstagram.com\/vp\/73b1790d77548031327e64ee83196706\/5B4AD567\/t51.2885-15\/e35\/23421974_1768724519826754_3855913942043852800_n.jpg"
}
},
"author": {
"name": "GitHub",
"handle": "@github",
"url": "https:\/\/www.instagram.com\/github\/"
},
"app_links": {
"ios": {
"url": "nflx:\/\/www.netflix.com\/title\/80014749",
"app_store_id": "363590051",
"app_name": "Netflix",
"store_url": "https:\/\/itunes.apple.com\/us\/app\/Netflix\/id363590051"
},
"android": {
"url": "nflx:\/\/www.netflix.com\/title\/80014749",
"package": "com.netflix.mediaclient",
"app_name": "Netflix",
"store_url": "https:\/\/play.google.com\/store\/apps\/details?id=com.netflix.mediaclient"
}
}
}
UrlPreview
class provides the following public methods:
Start the UrlPreview instance. Pass extra headers to send when requesting the page URL
Returns: UrlPreview instance
Load and start the scrape process for any valid URL
Returns: UrlPreview instance
Get all data scraped from page
Return: Array
with scraped data in following format
site
- info about the websiteurl
- main site URLname
- site name, ex: 'Instagram' or 'Medium'secure
- Boolean true|false depending on http connectionresponsive
- Boolean true|false.True
if site hasviewport
meta tag present. Basic check for responsivenessicon
- site iconlanguage
- ISO 639-1 language code, ex:en
,es
page
- info about the page at current URLtype
- page type, ex:website
,article
,profile
,video
, etcurl
- canonical URL for the pagetitle
- page titledescription
- page descriptionimage
-Array
containing image info, if present:url
- image URLwidth
- image widthheight
- image width
video
-Array
containing video info, if found on page:url
- video URLwidth
- video widthheight
- video width
author
- info about the content author, ex:name
- Author's name on a blog, person's name on social network siteshandle
- Social media site usernameurl
- Author URL for more articles or Profile URL on social network sites
app_links
-Array
containing apps linked to page, like:ios
- iOS appurl
- link for in-app action, ex: 'nflx://www.netflix.com/title/80014749'app_store_id
- Apple AppStore app IDapp_name
- name of the appstore_url
- link to installable app
android
- Android appurl
- link for in-app action, ex: 'nflx://www.netflix.com/title/80014749'package
- Android PlayStore app IDapp_name
- name of the appstore_url
- link to installable app
Get data in one scraped section site
, page
, profile
or app_links
Return: Array
with section scraped data. See getAll
for data format
Attach an event on UrlPreview
for data processing or scrape process. Arguments:
$eventName
- on which event to listen, available:page.scrape
- fired when the scraping process startsdata.filter
- fired when data is requested bygetData()
orgetAll()
methods
$listener
- a callable reference, which will get the$event
parameter with available data$priority
- order on which the callable should be executed
If there's need to more scraped data for a URL, more functionality can be attached to PageMeta library. Example for returing the 'Terms and Conditions' link from pages:
use Symfony\Component\EventDispatcher\Event;
$previewer = new \Layered\PageMeta\UrlPreview;
$previewer->addListener('page.scrape', function(Event $event) {
$currentScrapedData = $event->getData(); // check data from other scrapers
$crawler = $event->getCrawler(); // instance of DomCrawler Symfony Component
$termsLink = '';
$crawler->filter('a[href*=terms]')->each(function($node) use(&$termsLink) {
$termsLink = $node->attr('href');
});
// forwards the scraped data
$event->addData('site', [
'termsLink' => $termsLink
]);
});
$previewer->loadUrl('http://github.com');
Please report any issues here on GitHub.