crispy-computing-machine / phpcrawl Goto Github PK

View Code? Open in Web Editor NEW

This project forked from mmerian/phpcrawl

9.0 9.0 4.0 586 KB

PHPCrawl Web Crawler PHP 8

Home Page: https://github.com/crispy-computing-machine/phpcrawl

License: GNU General Public License v2.0

PHP 100.00%

crawl crawler php php74 sphider

phpcrawl's People

Contributors

Stargazers

Watchers

Forkers

gerben86 neosin kingozorg pizchy-wachida

phpcrawl's Issues

Typed property can be null

Lib version: 0.9.13
PHP: 7.4.22 (Ubuntu 20.04)

The property PHPCrawlerStatusHandler::$crawlerStatus is typed as PHPCrawlerStatus but because no actual checking of the crawler ID's existence occurs on the F/S in getCrawlerStatus(), objects naively assume that the file is always present. When it's not present, getCrawlerStatus() will set the null return of PHPCrawlerUtils::deserializeFromFile($this->working_directory . 'crawlerstatus.tmp') to the property resulting in the following PHP error:

[Emergency] Uncaught TypeError: Typed property PHPCrawl\ProcessCommunication\PHPCrawlerStatusHandler::$crawlerStatus must be an instance of PHPCrawl\PHPCrawlerStatus, null used

Hacking the file and removing the scoping "works" but doesn't explain why the file is missing in the first place. At a minimum though I'd expect the library to check that a file existed first, before going ahead and using its presence or content as a signal to perform some other task.

Interestingly and despite the casting, the logic in getCrawlerStatus() is still expecting $this->crawlerStatus to be null in some circumstances.

The logic would then look something like the following:

    /**
     * Returns/reads the current crawler-status
     *
     * @return PHPCrawlerStatus The current crawlerstatus as a PHPCrawlerStatus-object
     * @throws \LogicException
     */
    public function getCrawlerStatus(): PHPCrawlerStatus
    {
        $crawlFile = sprintf('%s/crawlerstatus.tmp', $this->working_directory);

        if (!file_Exists($crawlFile)) {
            throw \LogicException('Crawler status file not found!');
        }

        // Get crawler-status from file
        if ($this->write_status_to_file) {
            $this->crawlerStatus = PHPCrawlerUtils::deserializeFromFile($crawlFile);
            if ($this->crawlerStatus == null) {
                $this->crawlerStatus = new PHPCrawlerStatus();
            }
        }

        return $this->crawlerStatus;
    }

The easiest fix however is simply to permit the property to be null:

class PHPCrawlerStatusHandler
{
    /**
     * @var mixed null|PHPCrawlerStatus
     */
    protected ?PHPCrawlerStatus $crawlerStatus;
   ...
}

PHP 8 Windows Threading

https://gist.github.com/krakjoe/0ee02b887288720d9b785c9f947f3a0a

https not working

When you try to crawl a https website it says no content.

Subprocesses not working in Docker containers...

It doesn't seem like the application ever completed when running more then one process.
I'm not sure if the processes are running and the execution of MPMODE_PARENT_EXECUTES_USERCODE doesn't seem to be working, as I don't see console log output for the main process.

When i override the initChildProcess() function and print_r the $this i can the crawler object for the child process.

But I am not sure the child process is doing anything or talking back the main process.

Basically if i run with ->go() the whole scan is done fairly quickly. If I run with goMultiProcessed() it doesn't appear anything is happening and the process never finishes.

PHP 8 Infinite Multiprocess-Mode

https://github.com/crispy-computing-machine/phpcrawl/blob/master/libs/ProcessCommunication/PHPCrawlerDocumentInfoQueue.php#L77

Caused when MPMODE_CHILDS_EXECUTES_USERCODE used.

while ($this->getDocumentInfoCount() >= $this->queue_max_size) {
    usleep(500000);
 }

Loops infinitley with 50 >= 50

Fatal error: Uncaught Exception: PHPCrawlerUtils::splitURL Failed to parse url

I'm crawling a site and there's an issue with 1 incorrect a-href-tag:
<a href="http://example “O” and 'I' Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.">example.com</a>

This leads to this error:
Fatal error: Uncaught Exception: PHPCrawlerUtils::splitURL Failed to parse url: ... in ./vendor/brittainmedia/phpcrawl/libs/Utils/PHPCrawlerUtils.php on line 51

Is there a way to make a non-fatal error of this so I can just log is and continue processing?

PHP 8 readResponseContentChunk Multi-process

https://github.com/crispy-computing-machine/phpcrawl/blob/master/libs/PHPCrawlerHTTPRequest.php#L852

Need to Replace with working code:

if (strpos($headers, 'Transfer-Encoding: chunked') !== false) {
    $response = '';
    while (!feof($socket)) {
        // Read the chunk size (in hexadecimal)
        $chunkSizeHex = rtrim(fgets($socket));
        // Convert the chunk size to an integer
        $chunkSize = hexdec($chunkSizeHex);

        // If the chunk size is 0, it means we've reached the last chunk
        if ($chunkSize === 0) {
            break;
        }

        // Read the chunk data
        $chunkData = '';
        while ($chunkSize > 0) {
            $buffer = fread($socket, $chunkSize);
            $chunkData .= $buffer;
            $chunkSize -= strlen($buffer);
        }

        // Add the chunk data to the response
        $response .= $chunkData;

        // Read the trailing CRLF after the chunk data
        fgets($socket);
    }
} else {
    // If the response is not chunked, read the response normally
    while (!feof($socket)) {
        $response .= fread($socket, 128);
    }
}

Method createPEMCertificate is in example.php but doesn't exist

Hi,

In your example I see a call to createPEMCertificate:

phpcrawl/example.php

Line 30 in 4ed72d8

$crawler->createPEMCertificate($passPhrase, $certificateData);

But that method doesn't exist in the code.
How can I connect properly to a https site?
Because if I remove the call to createPEMCertificate I get this error with the example.php:

Warning: stream_socket_client(): Peer certificate CN=`www.example.org' did not match expected CN=`12.123.123.123' in /vendor/brittainmedia/phpcrawl/libs/PHPCrawlerHTTPRequest.php on line 569

problem

Got error when running the example.php

Fatal error: Uncaught Error: Class 'PHPCrawl\PHPCrawlerHTTPRequest' not found in C:\xampp\htdocs\bzdemo\libs\PHPCrawler.php on line 245

Unexpected 'PHPCrawlerStatus'

I initially tried version 0.9.9, but the same error occurred there.
I put it through the composer.
I think PHP hates me.

Run via OSPanel.
PHP Version 7.3.

My folder structure looks like this:

public (index.php)
tmp
vendor

Code is taken from an example.php

index.php:

require_once '../vendor/autoload.php';

use PHPCrawl\PHPCrawler;
use PHPCrawl\PHPCrawlerDocumentInfo;

/**
 * Class MyCrawler
 */
class MyCrawler extends PHPCrawler
{
    /**
     * @param PHPCrawlerDocumentInfo $DocInfo
     * @return int|void
     */
    public function handleDocumentInfo($DocInfo)
    {
        // Just detect linebreak for output ("\n" in CLI-mode, otherwise "<br>")..
        if (PHP_SAPI === 'cli') {
            $lb = "\n";
        } else {
            $lb = "<br />";
        }


        // Print the URL and the HTTP-status-Code
        echo 'Page requested: ' . $DocInfo->url . ' (' . $DocInfo->http_status_code . ')' . $lb;
        // Print the refering URL
        echo 'Referer-page: ' . $DocInfo->referer_url . $lb;
        // Print if the content of the document was be recieved or not
        if ($DocInfo->received == true) {
            echo "Content received: " . $DocInfo->bytes_received . " bytes" . $lb;
        } else {
            echo "Content not received" . $lb;
        }

        echo 'Error: ' . var_export($DocInfo->error_string, TRUE);

        // Now you should do something with the content of the actual
        // received page or file ($DocInfo->source), we skip it in this example
        echo $lb;
        flush();
    }
}

$crawler = new MyCrawler();

$crawler->setURL('https://google.com/');
$crawler->enableCookieHandling(true);
$crawler->setTrafficLimit(1000 * 1024);
$crawler->setWorkingDirectory("../tmp/");
$crawler->go();

And I catch the error: Parse error: syntax error, unexpected 'PHPCrawlerStatus' (T_STRING), expecting function (T_FUNCTION) or const (T_CONST) in D:\Software\OSPanel\domains\web_scrapping\vendor\brittainmedia\phpcrawl\libs\ProcessCommunication\PHPCrawlerStatusHandler.php on line 18

What's the problem?

Notice: Uninitialized string offset: 0 in libs/Utils/PHPCrawlerUtils.php

Starting at line 291 of PHPCrawlerUtils.php, and then repeated on lines 324 and 327, references to $link[0] occur; e.g.:
elseif ($link[0] === '/')

These are triggering a uninitialized string offset notice. Removing the [0] index removes the notice and things seem to work fine, but I don't want to modify your library locally if I don't have to. Am I missing something?

BTW, this is occuring just running your example.php file.

$DocInfo->received returns false if the requested page is returned a 301

I'm using this part of code:
public function handleDocumentInfo($DocInfo) { // Print if the content of the document was be recieved or not if ($DocInfo->received == true) { echo "Content received: " . $DocInfo->bytes_received . " bytes" . $lb; } else { echo "Content not received for url:".$lb.$DocInfo->url . " (" . $DocInfo->http_status_code . ")" . " Referer-page: " . $DocInfo->referer_url; print_r($DocInfo); }

Then I see this in the output:
"Content not received for url: http://mysite.com/abcd.html (301) Referer-page: https://mysite.com/1234.html"
So I'm crawling a https site, but some internal url's are still pointing to the http-site.
Due to the redirect the $DocInfo->received doesn't return true.

How can I fix this? So that also a redirect is just handled like any other 'normal' page?
Is there a way to replace "http://" by "https://" or can I somewhere update the code so that the code follows the 301 redirect and processes the page where the redirect is leading to?

How do I only crawl internal links?

How do I only crawl internal links? I seem to be getting lots of external links crawled and my server times out.

Fatal error: Uncaught TypeError: Return value of PHPCrawl\PHPCrawlerBenchmark::getElapsedTime() must be of the type float or null, none returned

I get this error when I call $report = $crawler->getProcessReport();

Fatal error: Uncaught TypeError: Return value of PHPCrawl\PHPCrawlerBenchmark::getElapsedTime() must be of the type float or null, none returned in vendor/brittainmedia/phpcrawl/libs/PHPCrawlerBenchmark.php:91
Stack trace:
#0 vendor/brittainmedia/phpcrawl/libs/PHPCrawler.php(958): PHPCrawl\PHPCrawlerBenchmark::getElapsedTime('crawling_proces...')
#1 /myScript.php(197): PHPCrawl\PHPCrawler->getProcessReport()
#2 {main}
  thrown in /var/hpwsites/u_ruiten/website/html/webroot/pub/crawl/vendor/brittainmedia/phpcrawl/libs/PHPCrawlerBenchmark.php on line 91

I'm using version 0.9.5.
What could be the cause of this?