Git Product home page Git Product logo

sitemapparser's Introduction

Build Status Scrutinizer Code Quality Code Climate Test Coverage License Packagist Join the chat at https://gitter.im/VIPnytt/SitemapParser

XML Sitemap parser

An easy-to-use PHP library to parse XML Sitemaps compliant with the Sitemaps.org protocol.

The Sitemaps.org protocol is the leading standard and is supported by Google, Bing, Yahoo, Ask and many others.

SensioLabsInsight

Features

Formats supported

  • XML .xml
  • Compressed XML .xml.gz
  • Robots.txt rule sheet robots.txt
  • Line separated text (disabled by default)

Requirements:

Installation

The library is available for install via Composer. Just add this to your composer.json file:

{
    "require": {
        "vipnytt/sitemapparser": "^1.0"
    }
}

Then run composer update.

Getting Started

Basic example

Returns an list of URLs only.

use vipnytt\SitemapParser;
use vipnytt\SitemapParser\Exceptions\SitemapParserException;

try {
    $parser = new SitemapParser();
    $parser->parse('http://php.net/sitemap.xml');
    foreach ($parser->getURLs() as $url => $tags) {
        echo $url . '<br>';
    }
} catch (SitemapParserException $e) {
    echo $e->getMessage();
}

Advanced

Returns all available tags, for both Sitemaps and URLs.

use vipnytt\SitemapParser;
use vipnytt\SitemapParser\Exceptions\SitemapParserException;

try {
    $parser = new SitemapParser('MyCustomUserAgent');
    $parser->parse('http://php.net/sitemap.xml');
    foreach ($parser->getSitemaps() as $url => $tags) {
        echo 'Sitemap<br>';
        echo 'URL: ' . $url . '<br>';
        echo 'LastMod: ' . $tags['lastmod'] . '<br>';
        echo '<hr>';
    }
    foreach ($parser->getURLs() as $url => $tags) {
        echo 'URL: ' . $url . '<br>';
        echo 'LastMod: ' . $tags['lastmod'] . '<br>';
        echo 'ChangeFreq: ' . $tags['changefreq'] . '<br>';
        echo 'Priority: ' . $tags['priority'] . '<br>';
        echo '<hr>';
    }
} catch (SitemapParserException $e) {
    echo $e->getMessage();
}

Recursive

Parses any sitemap detected while parsing, to get an complete list of URLs.

Use url_black_list to skip sitemaps that are part of parent sitemap. Exact match only.

use vipnytt\SitemapParser;
use vipnytt\SitemapParser\Exceptions\SitemapParserException;

try {
    $parser = new SitemapParser('MyCustomUserAgent');
    $parser->parseRecursive('http://www.google.com/robots.txt');
    echo '<h2>Sitemaps</h2>';
    foreach ($parser->getSitemaps() as $url => $tags) {
        echo 'URL: ' . $url . '<br>';
        echo 'LastMod: ' . $tags['lastmod'] . '<br>';
        echo '<hr>';
    }
    echo '<h2>URLs</h2>';
    foreach ($parser->getURLs() as $url => $tags) {
        echo 'URL: ' . $url . '<br>';
        echo 'LastMod: ' . $tags['lastmod'] . '<br>';
        echo 'ChangeFreq: ' . $tags['changefreq'] . '<br>';
        echo 'Priority: ' . $tags['priority'] . '<br>';
        echo '<hr>';
    }
} catch (SitemapParserException $e) {
    echo $e->getMessage();
}

Parsing of line separated text strings

Note: This is disabled by default to avoid false positives when expecting XML, but fetches plain text instead.

To disable strict standards, simply pass this configuration to constructor parameter #2: ['strict' => false].

use vipnytt\SitemapParser;
use vipnytt\SitemapParser\Exceptions\SitemapParserException;

try {
    $parser = new SitemapParser('MyCustomUserAgent', ['strict' => false]);
    $parser->parse('https://www.xml-sitemaps.com/urllist.txt');
    foreach ($parser->getSitemaps() as $url => $tags) {
            echo $url . '<br>';
    }
    foreach ($parser->getURLs() as $url => $tags) {
            echo $url . '<br>';
    }
} catch (SitemapParserException $e) {
    echo $e->getMessage();
}

Throttling

  1. Install middleware:
composer require hamburgscleanest/guzzle-advanced-throttle
  1. Define host rules:
$rules = new RequestLimitRuleset([
    'https://www.google.com' => [
        [
            'max_requests'     => 20,
            'request_interval' => 1
        ],
        [
            'max_requests'     => 100,
            'request_interval' => 120
        ]
    ]
]);
  1. Create handler stack:
$stack = new HandlerStack();
$stack->setHandler(new CurlHandler());
  1. Create middleware:
$throttle = new ThrottleMiddleware($rules);

 // Invoke the middleware
$stack->push($throttle());
 
// OR: alternatively call the handle method directly
$stack->push($throttle->handle());
  1. Create client manually:
$client = new \GuzzleHttp\Client(['handler' => $stack]);
  1. Pass client as an argument or use setClient method:
$parser = new SitemapParser();
$parser->setClient($client);

More details about this middle ware is available here

Automatic retry

  1. Install middleware:
composer require caseyamcl/guzzle_retry_middleware
  1. Create stack:
$stack = new HandlerStack();
$stack->setHandler(new CurlHandler());
  1. Add middleware to the stack:
$stack->push(GuzzleRetryMiddleware::factory());
  1. Create client manually:
$client = new \GuzzleHttp\Client(['handler' => $stack]);
  1. Pass client as an argument or use setClient method:
$parser = new SitemapParser();
$parser->setClient($client);

More details about this middle ware is available here

Advanced logging

  1. Install middleware:
composer require gmponos/guzzle_logger
  1. Create PSR-3 style logger
$logger = new Logger();
  1. Create handler stack:
$stack = new HandlerStack();
$stack->setHandler(new CurlHandler());
  1. Push logger middleware to stack
$stack->push(new LogMiddleware($logger));
  1. Create client manually:
$client = new \GuzzleHttp\Client(['handler' => $stack]);
  1. Pass client as an argument or use setClient method:
$parser = new SitemapParser();
$parser->setClient($client);

More details about this middleware config (like log levels, when to log and what to log) is available here

Additional examples

Even more examples available in the examples directory.

Configuration

Available configuration options, with their default values:

$config = [
    'strict' => true, // (bool) Disallow parsing of line-separated plain text
    'guzzle' => [
        // GuzzleHttp request options
        // http://docs.guzzlephp.org/en/latest/request-options.html
    ],
    // use this to ignore URL when parsing sitemaps that contain multiple other sitemaps. Exact match only.
    'url_black_list' => []
];
$parser = new SitemapParser('MyCustomUserAgent', $config);

If an User-agent also is set using the GuzzleHttp request options, it receives the highest priority and replaces the other User-agent.

sitemapparser's People

Contributors

adamberryhuff avatar grzegorzdrozd avatar heathstannard avatar janpettermg avatar jszczypk avatar madeitbelgium avatar peter279k avatar schrojf avatar thomasnicoullaud avatar vpominchuk avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

sitemapparser's Issues

allow proxy in guzzle

I'm trying to test this against a local website that uses the Symfony CLI to create a server. There's a proxy that allows me to use custom domain names, e.g. https://my-demo.wip, to serve requests.

Unfortunately, I can't read the sitemap there, as I'd need to add the proxy servers, by adding them to the client.

https://www.zenrows.com/blog/guzzle-proxy#request-option

I've tried to set the proxies, but not having much luck:

$proxies = [
            'http'  => 'http://127.0.0.1:7080/',
            'https' => 'http://127.0.0.1:7080/',
        ];
        $config = [
            'guzzle' =>
                [RequestOptions::PROXY => $proxies,
                      RequestOptions::VERIFY => false, # disable SSL certificate validation
    RequestOptions::TIMEOUT => 30, # timeout of 30 seconds
                ]
        ];

            $parser = new SitemapParser('MyCustomUserAgent', $config);

any suggestions?

It'd be great if that were documented.

Old dev dependency

Old version of dev packages phpunit, codeclimate/php-test-reporter are causing that are also fetched old packages for example: Package guzzle/guzzle is abandoned, you should avoid using it. Use guzzlehttp/guzzle instead.

Loading composer repositories with package information
Installing dependencies (including require-dev) from lock file
Package operations: 46 installs, 0 updates, 0 removals
  - Installing psr/http-message (1.0.1): Loading from cache
  - Installing guzzlehttp/psr7 (1.4.2): Loading from cache
  - Installing guzzlehttp/promises (v1.3.1): Loading from cache
  - Installing guzzlehttp/guzzle (6.2.3): Loading from cache
  - Installing symfony/polyfill-mbstring (v1.3.0): Loading from cache
  - Installing psr/log (1.0.2): Loading from cache
  - Installing symfony/debug (v3.2.7): Loading from cache
  - Installing symfony/console (v3.2.7): Loading from cache
  - Installing symfony/yaml (v3.2.7): Loading from cache
  - Installing symfony/stopwatch (v3.2.7): Loading from cache
  - Installing symfony/filesystem (v3.2.7): Loading from cache
  - Installing symfony/config (v3.2.7): Loading from cache
  - Installing symfony/event-dispatcher (v2.8.19): Loading from cache
  - Installing guzzle/guzzle (v3.9.3): Loading from cache
  - Installing satooshi/php-coveralls (v1.0.1): Loading from cache
  - Installing padraic/humbug_get_contents (1.0.4): Loading from cache
  - Installing padraic/phar-updater (1.0.3): Loading from cache
  - Installing codeclimate/php-test-reporter (v0.4.4): Loading from cache
  - Installing webmozart/assert (1.2.0): Loading from cache
  - Installing phpdocumentor/reflection-common (1.0): Loading from cache
  - Installing phpdocumentor/type-resolver (0.2.1): Loading from cache
  - Installing phpdocumentor/reflection-docblock (3.1.1): Loading from cache
  - Installing phpunit/php-token-stream (1.4.11): Loading from cache
  - Installing sebastian/version (2.0.1): Loading from cache
  - Installing sebastian/resource-operations (1.0.0): Loading from cache
  - Installing sebastian/recursion-context (3.0.0): Loading from cache
  - Installing sebastian/object-reflector (1.1.1): Loading from cache
  - Installing sebastian/object-enumerator (3.0.2): Loading from cache
  - Installing sebastian/global-state (2.0.0): Loading from cache
  - Installing sebastian/exporter (3.1.0): Loading from cache
  - Installing sebastian/environment (3.0.2): Loading from cache
  - Installing sebastian/diff (1.4.1): Loading from cache
  - Installing sebastian/comparator (2.0.0): Loading from cache
  - Installing phpunit/php-text-template (1.2.1): Loading from cache
  - Installing doctrine/instantiator (1.0.5): Loading from cache
  - Installing phpunit/phpunit-mock-objects (4.0.1): Loading from cache
  - Installing phpunit/php-timer (1.0.9): Loading from cache
  - Installing phpunit/php-file-iterator (1.4.2): Loading from cache
  - Installing theseer/tokenizer (1.1.0): Loading from cache
  - Installing sebastian/code-unit-reverse-lookup (1.0.1): Loading from cache
  - Installing phpunit/php-code-coverage (5.2.1): Loading from cache
  - Installing phpspec/prophecy (v1.7.0): Loading from cache
  - Installing phar-io/version (1.0.1): Loading from cache
  - Installing phar-io/manifest (1.0.1): Loading from cache
  - Installing myclabs/deep-copy (1.6.1): Loading from cache
  - Installing phpunit/phpunit (6.1.2): Loading from cache
symfony/console suggests installing symfony/process ()
symfony/event-dispatcher suggests installing symfony/dependency-injection ()
symfony/event-dispatcher suggests installing symfony/http-kernel ()
satooshi/php-coveralls suggests installing symfony/http-kernel (Allows Symfony integration)
sebastian/global-state suggests installing ext-uopz (*)
phpunit/php-code-coverage suggests installing ext-xdebug (^2.5.3)
phpunit/phpunit suggests installing ext-xdebug (*)
phpunit/phpunit suggests installing phpunit/php-invoker (^1.1)
Package guzzle/guzzle is abandoned, you should avoid using it. Use guzzlehttp/guzzle instead.

First example returns nothing

Hey @VIPnytt,

I'm considering using your library in my PHPScraper and have been playing around with the examples. The first one didn't return anything - I guess it's due to the sitemap only containing other site maps. Maybe it would make sense to use a different one to reduce confusion?

Cheers,
Peter

Line-separated sitemaps

Hey @VIPnytt,

wondering what it would take to make the line-separated sitemaps by default. Could the type be guessed based on the content-type in the response or maybe from failing to parse the XML? Keen to hear your thoughts.

Cheers,
Peter

suggest to catch exception for unexpected sitemap xml content

Sometimes, we will fetch sitemap xml with deny message like below

<?xml version="1.0" encoding="UTF-8"?>
<Error><Code>AccessDenied</Code><Message>Access Denied</Message><RequestId>7F33HDMXW5ACA6XM</RequestId><HostId>fDtErV52NjvRWoigB8xew2jT1lHs/PILta/bNsoisgcjt7QFS4i1UQeKvV/4fMk56GSF7cGu398=</HostId></Error>

For this case, I suggest to add try {} catch () to handle this case in the SitemapParser class

try {
            $response = is_string($urlContent) ? $urlContent : $this->getContent();
        } catch (Exceptions\SitemapParserException $e) {
            throw new Exceptions\SitemapParserException($e->getMessage());
        } catch (Exceptions\TransferException $e) {
            throw new Exceptions\TransferException($e->getMessage());
        }

parseRecursive is broken when one of sitemaps does not exists

Hi,

in parseRecursive TransferException thrown inside parse will cause $this->urls and $this->sitemaps to be set to [].
In next loop $urls and $sitemaps will have values from $this->urls and $this->sitemaps copied into them, effectively clearing all past results.

It is enough to remove continue in exception handler, so they will have the values brought back in array_merge_recursive.
On the other hand it could be better to refactor parse so it will work on its own local urls and sitemaps instead of main object properties.

Best regards,
Janusz

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.