Git Product home page Git Product logo

phpscraper's Introduction

PHP Scraper: a web utility for PHP

Unit Tests Total Downloads Latest Version License

For full documentation, visit phpscraper.de.

PHPScraper is a versatile web-utility for PHP. Its primary objective is to streamline the process of extracting information from websites, allowing you to focus on accomplishing tasks without getting caught up in the complexities of selectors, data structure preparation, and conversion.

Under the hood, it uses

See composer.json for more details.

โฒ๏ธ PHPScraper in 5 Minutes explained

Here are a few impressions of the way the library works. More examples are on the project website.

Basics: Flexible Calling as an Attribute or Method

All scraping functionality can be accessed either as a function call or a property call. For example, the title can be accessed in two ways:

// Prep
$web = new \Spekulatius\PHPScraper\PHPScraper;
$web->go('https://google.com');

// Returns "Google"
echo $web->title;

// Also returns "Google"
echo $web->title();

๐Ÿ”‹ Batteries included: Meta data, Links, Images, Headings, Content, Keywords, ...

Many common use cases are covered already. You can find prepared extractors for various HTML tags, including interesting attributes. You can filter and combine these to your needs. In some cases there is an option to get a simple or detailed version, here in the case of linksWithDetails:

$web = new \Spekulatius\PHPScraper\PHPScraper;

// Contains:
// <a href="https://placekitten.com/456/500" rel="ugc">
//   <img src="https://placekitten.com/456/400">
//   <img src="https://placekitten.com/456/300">
// </a>
$web->go('https://test-pages.phpscraper.de/links/image-urls.html');

// Get the first link on the page and print the result
print_r($web->linksWithDetails[0]);
// [
//     'url' => 'https://placekitten.com/456/500',
//     'protocol' => 'https',
//     'text' => '',
//     'title' => null,
//     'target' => null,
//     'rel' => 'ugc',
//     'image' => [
//         'https://placekitten.com/456/400',
//         'https://placekitten.com/456/300'
//     ],
//     'isNofollow' => false,
//     'isUGC' => true,
//     'isSponsored' => false,
//     'isMe' => false,
//     'isNoopener' => false,
//     'isNoreferrer' => false,
// ]

If there aren't any matching elements (here links) on the page, an empty array will be returned. If a method normally returns a string it might return null. Details such as follow_redirects, etc. are optional configuration parameters (see below).

Most of the DOM should be covered using these methods:

A full list of methods with example code can be found on phpscraper.de. Further examples are in the tests.

Download Files

Besides processing the content on the page itself, you can download files using fetchAsset:

// Absolute URL
$csvString = $web->fetchAsset('https://test-pages.phpscraper.de/test.csv');

// Relative URL after navigation
$csvString = $web
  ->go('https://test-pages.phpscraper.de/meta/lorem-ipsum.html')
  ->fetchAsset('/test.csv');

You will only need to write the content into a file or cloud storage.

Process the RSS feeds, sitemap.xml, etc.

PHPScraper can assist in collecting feeds such as RSS feeds, sitemap.xml-entries and static search indexes. This can be useful when deciding on the next page to crawl or building up a list of pages on a website.

Here we are processing the sitemap into a set of FeedEntry-DTOs:

(new \Spekulatius\PHPScraper\PHPScraper)
    ->go('https://phpscraper.de')
    ->sitemap

// array(131) {
//   [0]=>
//   object(Spekulatius\PHPScraper\DataTransferObjects\FeedEntry)#165 (3) {
//     ["title"]=>
//     string(0) ""
//     ["description"]=>
//     string(0) ""
//     ["link"]=>
//     string(22) "https://phpscraper.de/"
//   }
//   [1]=>
// ...

Whenever post-processing is applied, you can fall back to the underlying *Raw-methods.

Process CSV-, XML- and JSON files and URLs

PHPScraper comes out of the box with file / URL processing methods for CSV-, XML- and JSON:

  • parseJson
  • parseXml
  • parseCsv
  • parseCsvWithHeader (generates an asso. array using the first row)

Each method can process both strings as well as URLs:

// Parse JSON into array:
$json = $web->parseJson('[{"title": "PHP Scraper: a web utility for PHP", "url": "https://phpscraper.de"}]');
// [
//     'title' => 'PHP Scraper: a web utility for PHP',
//     'url' => 'https://phpscraper.de'
// ]

// Fetch and parse CSV into a simple array:
$csv = $web->parseCsv('https://test-pages.phpscraper.de/test.csv');
// [
//     ['date', 'value'],
//     ['1945-02-06', 4.20],
//     ['1952-03-11', 42],
// ]

// Fetch and parse CSV with first row as header into an asso. array structure:
$csv = $web->parseCsvWithHeader('https://test-pages.phpscraper.de/test.csv');
// [
//     ['date' => '1945-02-06', 'value' => 4.20],
//     ['date' => '1952-03-11', 'value' => 42],
// ]

Additional CSV parsing parameters such as separator, enclosure and escape are possible.

There is more!

There are plenty of examples on the PHPScraper website and in the tests.

Check the playground.php if you prefer learning by doing. You get it up and running with:

$ git clone [email protected]:spekulatius/PHPScraper.git && composer update

๐Ÿ’ช Roadmap

The future development is organized into milestones. Releases follow semver.

  • Improve documentation and examples.
  • Organize code better (move websites into separate repos, etc.)
  • Add support for feeds and some typical file types.

v2: Service Upgrade:

  • Expand to parse a wider range of types, elements, embeds, etc.
  • Improve performance with caching and concurrent fetching of assets
  • Minor improvements for parsing methods

TBC.

๐Ÿ˜ Sponsors

PHPScraper is sponsored by:

With your support, PHPScraper can became the PHP swiss army knife for the web. If you find PHPScraper useful to your work, please consider a sponsorship or donation. Thank you ๐Ÿ’ช

โš™๏ธ Configuration (optional)

If needed, you can use the following configuration options:

User Agent

You can set the browser agent using setConfig:

$web->setConfig([
  'agent' => 'Mozilla/5.0 (X11; Linux x86_64; rv:107.0) Gecko/20100101 Firefox/107.0'
]);

It defaults to Mozilla/5.0 (compatible; PHP Scraper/1.x; +https://phpscraper.de).

Proxy Support

You can configure proxy support with setConfig:

$web->setConfig(['proxy' => 'http://user:[email protected]:3128']);

Timeout

You can set the timeout using setConfig:

$web->setConfig(['timeout' => 15]);

Setting the timeout to zero will disable it.

Disabling SSL

While unrecommended, it might be required to disable SSL checks. You can do so using:

$web->setConfig(['disable_ssl' => true]);

You can call setConfig multiple times. It stores the config and merges it with previous settings. This should be kept in mind in the unlikely use-case when unsetting values.

๐Ÿš€ Installation with Composer

composer require spekulatius/phpscraper

After the installation, the package will be picked up by the Composer autoloader. If you are using a common PHP application or framework such as Laravel or Symfony you can start scraping now ๐Ÿš€

If not or you are building a standalone-scraper, please include the autoloader in vendor/ at the top of your file:

<?php

require __DIR__ . '/vendor/autoload.php';

// ...

Now you can now use any of the examples on the documentation website or from the tests/-folder.

Please consider supporting PHPScraper with a star or sponsorship:

composer thanks

Thank you ๐Ÿ’ช

โœ… Testing

The library comes with a PHPUnit test suite. To run the tests, run the following command from the project folder:

composer test

You can find the tests here. The test pages are publicly available.

phpscraper's People

Contributors

alanx15a2 avatar amurrell avatar datlechin avatar dependabot[bot] avatar fumiya5863 avatar imgbotapp avatar nadar avatar nathabonfim59 avatar scotteuser avatar spekulatius avatar szepeviktor avatar tacman avatar tentacode avatar vitormattos avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

phpscraper's Issues

TypeError

When i run the sample code:

$web = new \Spekulatius\PHPScraper\PHPScraper();
$web->go('https://www.google.com/');
echo $web->title;

It return:

Spekulatius\PHPScraper\Core::setHttpClient(): Argument #1 ($httpClient) must be of type Symfony\Component\HttpClient\CurlHttpClient, Symfony\Component\HttpClient\NativeHttpClient given, called in C:\www\web-crawer\vendor\spekulatius\phpscraper\src\PHPScraper.php on line 108

Environments

PHP: 8.1.13
PHPScraper: 1.0.1

[Proposal] Exposing Goutte/Client via client() property/callable method

Previously, there was an issue about exposing the Goutte/Client via core that was fixed but never released.

My code was depending on it via a dev branch.

In the latest version, this functionality was removed. I recommend that it be added back in.

In UsesGoutte.php,

/**
 * Retrieve the client
 *
 * @param \Goutte\Client $client
 */
public function client(): GoutteClient
{
    return $this->client;
}

deprecate magic properties / methods

In my branch I've removed the magic __get and __call methods, and moved what was core into the phpscraper, so now there is only one class.

After a while I got tired of find/replace, so I created a rector rule to change the properties to method.

Is there a demo repository that uses I can use to test my branch? Tests are passing, except for the ones related to internal/external links, which I'll address in another issue.

Keyword Extraction

To benefit from simpler analysis it makes sense to provide a keyword extraction system out-of-the-box. Rake looks promising.

charset in headers method

Is this method used? Where is charset() defined?

    /**
     * Get the header collected as an array
     *
     * @return array
     */
    public function headers()
    {
        return [
            'charset' => $this->charset(),
            'contentType' => $this->contentType(),
            'viewport' => $this->viewport(),
            'canonical' => $this->canonical(),
            'csrfToken' => $this->csrfToken(),
        ];
    }

SSL certificate problem

Symfony\Component\HttpClient\Exception\TransportException: SSL certificate problem: unable to get local issuer certificate for ***

Getting an error when trying to get data from a link that doesn't have ssl

image

get http status code

How do I check if the scraped page is an error page?
It would be helpful to have the http status for that - something like

$web->go('https://httpstat.us/404');
echo $web->status;     // prints 404

Just before posting, I found the solution myself:

$web->client->getResponse()->getStatusCode()

It would be great to have this added to the documentation.
Or maybe add $web->status as shortcut?

PHPScraper + Timeout

Hello,

How to make a condition if the timeout exceeds a certain number of seconds?

Thank

composer require fails on mac due to strange characters in filenames in the /tests directory

Problem

Running composer require spekulatius/phpscraper fails on macOS 10.15.5 with the following:

[RuntimeException]                                                                                                                                                                                                          
 Failed to extract spekulatius/phpscraper: (50) '/usr/bin/unzip' -qq '/xxx/xxx/vendor/composer/tmp-123dcc14bbc3272649fdd489b0ecc9a3' -d '/xxx/xxx/vendor/composer/74030a62'  
                                                                                                                                                                                                                            
error:  cannot create /xxx/xxx/vendor/composer/74030a62/spekulatius-PHPScraper-fb4f590/tests/resources/assets/katze-+?-++-+?.jpg                                                                    
        Illegal byte sequence                                                                                                                                                                                               
error:  cannot create /xxx/xxx/vendor/composer/74030a62/spekulatius-PHPScraper-fb4f590/tests/resources/assets/???.jpg                                                                               
        Illegal byte sequence                                                                                                                                                                                               
error:  cannot create /xxx/xxx/vendor/composer/74030a62/spekulatius-PHPScraper-fb4f590/websites/test-pages/assets/katze-+?-++-+?.jpg                                                                
        Illegal byte sequence                                                                                                                                                                                               
error:  cannot create /xxx/xxx/vendor/composer/74030a62/spekulatius-PHPScraper-fb4f590/websites/test-pages/assets/???.jpg                                                                           
        Illegal byte sequence                                                                                                                                                                                               

It seems like the unzip command within composer fails due to the strange characters.

[Proposal] Add scraping API support

Motivation

I saw some references to an API service in the documentation. I don't know if you plan to make the implementation open-source as well, but I really liked the idea of abstracting the extraction process.

Combined with the proxy feature, it provides an incredibly powerful tool for gathering data.

Proposal

The idea is to implement a setApi method in the phpscraper class that supports various APIs. I went with the "namespace builder" approach, to load the API code as needed.

public $api = null;
...
public function setApi($api)
{
    $apiClass = __NAMESPACE__ . '\apis\\' . $api;

    $this->api = new $apiClass($this->core);

    return $this;
}

Then we will just need to create a file inside the new apis folder with the corresponding implementation, inside the namespace.

src/apis/example_api.php

namespace spekulatius\apis;

class example_api
{
    protected $core = null;

    public function __construct(core &$core)
    {
        $this->core = $core;
    }
...
}

Implementation example

I've implemented an API to scrap products from Mercado Libre (the biggest online marketplace in Latin America).

$web = new phpscraper;
$web->setApi('mercado_libre');

$web->go('https://mercadolivre.com.br/product_url');
$productData = $web->api->getProduct();

More details about the api mercado libre API here.


What do you think?

Allow Symfony 6

This library is locked to Symfony 5, I'll submit a PR to allow Symfony 6 as well.

Make public function to access Client (Goutte)

I wanted to know the response code of the url but could not get access.

Looking at Goutte under the hood, I think I could get it if I could do something like:

$web->getClient()->getInternalResponse()->getStatusCode()

So, I would like a function to get access to the $client, like:

public function getClient()
{
   return $this->client;
}

Idea: Various Feeds

Allow to receive various feeds sites might provide:

  • rss
  • sitemap.xml
  • index.json
  • ?

Inclusive the respective content.

Getting Deprecated: strpos() error in PHP PHP 8.1.8

I am using this example https://phpscraper.de/examples/extract-keywords.html with PHP 8.1.8

Deprecated: strpos(): Passing null to parameter #1 ($haystack) of type string is deprecated in /Users/khanakia/D1/www/php/scrap_php/vendor/spekulatius/phpscraper/src/phpscraper.php on line 858
PHP Deprecated:  strpos(): Passing null to parameter #1 ($haystack) of type string is deprecated in /Users/khanakia/D1/www/php/scrap_php/vendor/spekulatius/phpscraper/src/phpscraper.php on line 859

Add link type interpretation

It would be great if the most common link types could be interpreted directly. To mind jump:

  • http / https
  • mailto
  • ftp
  • tel
  • tg
  • # - page internal anchors
  • javascript

Provide example with authentication

How can I scrape a website that requires authentication?

That is, I want to start with at https://jardinado.herokuapp.com/login, fill in my credentials, and THEN start scraping the site.

That is, I want the $goutteClient to execute something like this first, then scrape:

            if ($username) {
                $crawler = $gouteClient->request('GET', $url = $baseUrl . "/login", [
                ]);

// select the form and fill in some values
                $form = $crawler->selectButton('login-btn')->form();
                $form['_username'] = 'user';
                $form['_password'] = 'pass';

// submit that form
                $crawler = $gouteClient->submit($form);
                $response = $gouteClient->getResponse();

Now that cookies are set, when I fetch a url that requires login I should get the page instead of the 302 (redirect to login).

I'm not sure how to implement this within the context of phpscraper. One idea would be to expose the goutte client.

links from div

Hello, I have a problem, I want to have all the links from // * [@ id = "content"] 'how do I do?

Bump jeremykendall/php-domain-parser to version 6

I'm trying to integrate this into an app that is already locked to psr/log:^3.

jeremykendall/php-domain-parser version 5 is locked to psr/log:^1

There are 2 tests that fail when bumping to the new version. I can fix and submit a PR.

Idea: Implement low-level util to access the web.

E.g.

// GET request
$web->get('https://...');

// POST request
$response = $web->post('https://...', [
  'param' => 'first param',
]);

// ...

This could be done either directly in PHPScraper or built upon another specialized lib such as Symfony HTTP. Exposing the functionality of the existing dependency sounds like a reasonable way to go, if the idea is of interest.

Lists

Allow for returning of a lists set (unorder, order) with each list items below.

Currently unsure how to handle more HTML within the

  • . Maybe return both (plain text and HTML as dom)?

  • [Proposal] Add HTTP proxy support

    I'm working on a project that needs to be constantly changing proxies and taking a look in the Goute client, the underlying implementation already supports it.

    Research

    1. HttpBrowser, which Goute is based on, supports a custom client, implementing the HttpClientInterface (source code)
    ...
    class HttpBrowser extends AbstractBrowser
    {
        private $client;
    
        public function __construct(HttpClientInterface $client = null, History $history = null, CookieJar $cookieJar = null)
    ...
            $this->client = $client ?? HttpClient::create();
    ...
    1. The HttpClient::create() supports proxy using the $defaultOptions parameter, that gets passed to the selected HttpClient (source code)
            public static function create(array $defaultOptions = [], int $maxHostConnections = 6, int $maxPendingPushes = 50): HttpClientInterface

    Implementation details

    The idea is to expose this functionality though a setProxy function in the core class. The library will continue to dynamically select the httpClient accordingly.

    I just made a POC in my fork with all necessary code to make this work (a6589da).

    public function setProxy(string $proxy)
    {
        $httpClient = HttpClient::create([
            'proxy' => $proxy
        ]);
    
        $this->client = new Client($httpClient);
    
        return $this;
    }

    How to use

    $web = new phpscraper;
    $web->__call('setProxy', [
        'http://user:[email protected]:3128',
    ]);

    If this feature gets approved, I will open a PR with it. If anything needs to be changed, let me know.

    Idea: Discovery Sets

    Sets to discover certain data (e.g. URLs, etc.) from markup and plain text content

    Check errors

    Greetings,

    I would like to know how I can skip errors when the site does not load correctly or is not working, for example:

    $scrap = $web->go('https://asdsadsadassd.com');

    if ( $scrap )
    {
    echo 'ok';
    }

    clickLink with onclick

    Navigating through web pages works wonderful.

    But i dont see a way for those ones

    <a href="#" onclick="Postback('xxxx','2');return false;">2</a>

    Parsing structured data (microdata)

    #16 proposes adding support for JSONLD.

    There isn't only JSONLD - structured data can be provided also in the microdata notation, and: good news - there is a project which parses microdata and converts it to the same data structure as JSONLD: https://github.com/yusufkandemir/microdata-parser

    So it should be possible to use both and treat it just like an additional JSONLD block!

    A first test:

    $jsonlddata = \YusufKandemir\MicrodataParser\Microdata::fromHTML($web->client->getResponse()->getContent(), $web->currentUrl())->toArray();
    

    Internally this project uses an own DOM document class derived from DOMDocument. It has a function to import a DOMDocument - but Symphonys response class doesn't allow to access the DOMDocument.

    I did a small test, but didn't fiddle it out how to pass the DOMDocument without reparsing - my try which didn't work:

    $dom = new \DOMDocument('1.0', 'UTF-8');
    $dom->importNode($web->filterFirst('//*')->getNode(0), true);
    $jsonlddata = \YusufKandemir\MicrodataParser\Microdata::fromDOMDocument($dom);
    

    What makes sense?
    Adding separate PHPScraper functions for JSONLD and microdata? Or mixing both automatically? (my opinion: mixing)

    How should support for microdata look like? Adding the other project to PHPScraper? Extending the existing classes or porting the whole functionality to PHPScraper?

    SSL connect error

    Hi

    Im getting an

    SSL connect error for "https://bbc.in/3cNMnkw".

    when connecting to some websites. I'm writing a tweetdeck style system and so will scrape the site that the tweet links to to pick up the details so that a nicer link can be generated. Any ideas on how to solve this ? Most work fine, its just a few that fail.

    Also Im sometimes getting a "This browser is no longer supported" coming through as a description from the page. I notice you are passing through 'Mozilla/5.0 (compatible; PHP Scraper/0.x; +https://phpscraper.de)' as the user agent. Is it worth altering the user agent ? I dont think I can pass it in as a parameter anywhere from what I can see?

    Thanks.

    Basic Processing for CSV files

    Implement basic CSV parsing functionality.

    API

    As usual, keep it simple:

    Direct Call

    $data = $web->parseCsv('https://test-pages.phpscraper.de/1.csv');

    Chained

    $data = $web->go('https://test-pages.phpscraper.de/1.csv')->parseCsv()

    Separate Calls

    $web->go('https://test-pages.phpscraper.de/1.csv');
    $data = $web->parseCsv();

    Associative Array vs. Simple Array

    While the CSVs are flat by nature, they can be resolved into an associative array. This should be considered as an option. Suitable naming TBC.

    Consideration

    Error after installing

    I am not sure if I'm simply missing something, but after installing phpscraper and simply writing this:
    <?php require 'vendor/autoload.php'; $web = new \spekulatius\phpscraper; ?>
    Brings this error.
    Fatal error: Default value for parameters with a class type can only be NULL in .../vendor/symfony/browser-kit/AbstractBrowser.php on line 136
    I'm using vanilla php with XAMPP.

    Versioning and Changelog

    Hi

    Its a great project, we use that library, its great ๐Ÿ‘ thanks for all the work!

    We are not sure what the versioning concept of this library looks like, also we could not find information about what has changed (a changelog), there are only tags: https://github.com/spekulatius/PHPScraper/tags. So it would be nice to read something about what is the versioning strategy and when are breaks expected.

    Of course semver would be nice with an 1.0.0 release and a CHANGELOG.md and UPGRADE.md. My personal example would look like this https://github.com/nadar/quill-delta-parser/blob/master/CHANGELOG.md and this https://github.com/nadar/quill-delta-parser/blob/master/UPGRADE.md

    Thanks and keep up to great work!

    composer update

    spekulatius/phpscraper 0.5.2 requires symfony/dom-crawler ^5.0 -> found symfony/dom-crawler[v5.0.0, ..., v5.4.3] but the package is fixed to v6.0.3 (lock file version) by a partial update and that version does not match. Make sure you list it as an argument for the update command.

    Recommend Projects

    • React photo React

      A declarative, efficient, and flexible JavaScript library for building user interfaces.

    • Vue.js photo Vue.js

      ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

    • Typescript photo Typescript

      TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

    • TensorFlow photo TensorFlow

      An Open Source Machine Learning Framework for Everyone

    • Django photo Django

      The Web framework for perfectionists with deadlines.

    • D3 photo D3

      Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

    Recommend Topics

    • javascript

      JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

    • web

      Some thing interesting about web. New door for the world.

    • server

      A server is a program made to process requests and deliver data to clients.

    • Machine learning

      Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

    • Game

      Some thing interesting about game, make everyone happy.

    Recommend Org

    • Facebook photo Facebook

      We are working to build community through open source technology. NB: members must have two-factor auth.

    • Microsoft photo Microsoft

      Open source projects and samples from Microsoft.

    • Google photo Google

      Google โค๏ธ Open Source for everyone.

    • D3 photo D3

      Data-Driven Documents codes.