Git Product home page Git Product logo

phpcrawl's People

Contributors

crispy-computing-machine avatar gerben86 avatar mmerian avatar theputzy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

phpcrawl's Issues

Typed property can be null

  • Lib version: 0.9.13
  • PHP: 7.4.22 (Ubuntu 20.04)

The property PHPCrawlerStatusHandler::$crawlerStatus is typed as PHPCrawlerStatus but because no actual checking of the crawler ID's existence occurs on the F/S in getCrawlerStatus(), objects naively assume that the file is always present. When it's not present, getCrawlerStatus() will set the null return of PHPCrawlerUtils::deserializeFromFile($this->working_directory . 'crawlerstatus.tmp') to the property resulting in the following PHP error:

[Emergency] Uncaught TypeError: Typed property PHPCrawl\ProcessCommunication\PHPCrawlerStatusHandler::$crawlerStatus must be an instance of PHPCrawl\PHPCrawlerStatus, null used

Hacking the file and removing the scoping "works" but doesn't explain why the file is missing in the first place. At a minimum though I'd expect the library to check that a file existed first, before going ahead and using its presence or content as a signal to perform some other task.

Interestingly and despite the casting, the logic in getCrawlerStatus() is still expecting $this->crawlerStatus to be null in some circumstances.

The logic would then look something like the following:

    /**
     * Returns/reads the current crawler-status
     *
     * @return PHPCrawlerStatus The current crawlerstatus as a PHPCrawlerStatus-object
     * @throws \LogicException
     */
    public function getCrawlerStatus(): PHPCrawlerStatus
    {
        $crawlFile = sprintf('%s/crawlerstatus.tmp', $this->working_directory);

        if (!file_Exists($crawlFile)) {
            throw \LogicException('Crawler status file not found!');
        }

        // Get crawler-status from file
        if ($this->write_status_to_file) {
            $this->crawlerStatus = PHPCrawlerUtils::deserializeFromFile($crawlFile);
            if ($this->crawlerStatus == null) {
                $this->crawlerStatus = new PHPCrawlerStatus();
            }
        }

        return $this->crawlerStatus;
    }

The easiest fix however is simply to permit the property to be null:

class PHPCrawlerStatusHandler
{
    /**
     * @var mixed null|PHPCrawlerStatus
     */
    protected ?PHPCrawlerStatus $crawlerStatus;
   ...
}

Subprocesses not working in Docker containers...

It doesn't seem like the application ever completed when running more then one process.
I'm not sure if the processes are running and the execution of MPMODE_PARENT_EXECUTES_USERCODE doesn't seem to be working, as I don't see console log output for the main process.

When i override the initChildProcess() function and print_r the $this i can the crawler object for the child process.

But I am not sure the child process is doing anything or talking back the main process.

Basically if i run with ->go() the whole scan is done fairly quickly. If I run with goMultiProcessed() it doesn't appear anything is happening and the process never finishes.

Fatal error: Uncaught Exception: PHPCrawlerUtils::splitURL Failed to parse url

I'm crawling a site and there's an issue with 1 incorrect a-href-tag:
<a href="http://example โ€œOโ€ and 'I' Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.">example.com</a>

This leads to this error:
Fatal error: Uncaught Exception: PHPCrawlerUtils::splitURL Failed to parse url: ... in ./vendor/brittainmedia/phpcrawl/libs/Utils/PHPCrawlerUtils.php on line 51

Is there a way to make a non-fatal error of this so I can just log is and continue processing?

PHP 8 readResponseContentChunk Multi-process

https://github.com/crispy-computing-machine/phpcrawl/blob/master/libs/PHPCrawlerHTTPRequest.php#L852

Need to Replace with working code:

if (strpos($headers, 'Transfer-Encoding: chunked') !== false) {
    $response = '';
    while (!feof($socket)) {
        // Read the chunk size (in hexadecimal)
        $chunkSizeHex = rtrim(fgets($socket));
        // Convert the chunk size to an integer
        $chunkSize = hexdec($chunkSizeHex);

        // If the chunk size is 0, it means we've reached the last chunk
        if ($chunkSize === 0) {
            break;
        }

        // Read the chunk data
        $chunkData = '';
        while ($chunkSize > 0) {
            $buffer = fread($socket, $chunkSize);
            $chunkData .= $buffer;
            $chunkSize -= strlen($buffer);
        }

        // Add the chunk data to the response
        $response .= $chunkData;

        // Read the trailing CRLF after the chunk data
        fgets($socket);
    }
} else {
    // If the response is not chunked, read the response normally
    while (!feof($socket)) {
        $response .= fread($socket, 128);
    }
}  

Method createPEMCertificate is in example.php but doesn't exist

Hi,

In your example I see a call to createPEMCertificate:

$crawler->createPEMCertificate($passPhrase, $certificateData);

But that method doesn't exist in the code.
How can I connect properly to a https site?
Because if I remove the call to createPEMCertificate I get this error with the example.php:
Warning: stream_socket_client(): Peer certificate CN=`www.example.org' did not match expected CN=`12.123.123.123' in /vendor/brittainmedia/phpcrawl/libs/PHPCrawlerHTTPRequest.php on line 569

problem

Got error when running the example.php

Fatal error: Uncaught Error: Class 'PHPCrawl\PHPCrawlerHTTPRequest' not found in C:\xampp\htdocs\bzdemo\libs\PHPCrawler.php on line 245

Unexpected 'PHPCrawlerStatus'

I initially tried version 0.9.9, but the same error occurred there.
I put it through the composer.
I think PHP hates me.

Run via OSPanel.
PHP Version 7.3.

My folder structure looks like this:

  • public (index.php)
  • tmp
  • vendor

Code is taken from an example.php

index.php:

require_once '../vendor/autoload.php';

use PHPCrawl\PHPCrawler;
use PHPCrawl\PHPCrawlerDocumentInfo;

/**
 * Class MyCrawler
 */
class MyCrawler extends PHPCrawler
{
    /**
     * @param PHPCrawlerDocumentInfo $DocInfo
     * @return int|void
     */
    public function handleDocumentInfo($DocInfo)
    {
        // Just detect linebreak for output ("\n" in CLI-mode, otherwise "<br>")..
        if (PHP_SAPI === 'cli') {
            $lb = "\n";
        } else {
            $lb = "<br />";
        }


        // Print the URL and the HTTP-status-Code
        echo 'Page requested: ' . $DocInfo->url . ' (' . $DocInfo->http_status_code . ')' . $lb;
        // Print the refering URL
        echo 'Referer-page: ' . $DocInfo->referer_url . $lb;
        // Print if the content of the document was be recieved or not
        if ($DocInfo->received == true) {
            echo "Content received: " . $DocInfo->bytes_received . " bytes" . $lb;
        } else {
            echo "Content not received" . $lb;
        }

        echo 'Error: ' . var_export($DocInfo->error_string, TRUE);

        // Now you should do something with the content of the actual
        // received page or file ($DocInfo->source), we skip it in this example
        echo $lb;
        flush();
    }
}

$crawler = new MyCrawler();

$crawler->setURL('https://google.com/');
$crawler->enableCookieHandling(true);
$crawler->setTrafficLimit(1000 * 1024);
$crawler->setWorkingDirectory("../tmp/");
$crawler->go();

And I catch the error: Parse error: syntax error, unexpected 'PHPCrawlerStatus' (T_STRING), expecting function (T_FUNCTION) or const (T_CONST) in D:\Software\OSPanel\domains\web_scrapping\vendor\brittainmedia\phpcrawl\libs\ProcessCommunication\PHPCrawlerStatusHandler.php on line 18

What's the problem?

Notice: Uninitialized string offset: 0 in libs/Utils/PHPCrawlerUtils.php

Starting at line 291 of PHPCrawlerUtils.php, and then repeated on lines 324 and 327, references to $link[0] occur; e.g.:
elseif ($link[0] === '/')

These are triggering a uninitialized string offset notice. Removing the [0] index removes the notice and things seem to work fine, but I don't want to modify your library locally if I don't have to. Am I missing something?

BTW, this is occuring just running your example.php file.

$DocInfo->received returns false if the requested page is returned a 301

I'm using this part of code:
public function handleDocumentInfo($DocInfo) { // Print if the content of the document was be recieved or not if ($DocInfo->received == true) { echo "Content received: " . $DocInfo->bytes_received . " bytes" . $lb; } else { echo "Content not received for url:".$lb.$DocInfo->url . " (" . $DocInfo->http_status_code . ")" . " Referer-page: " . $DocInfo->referer_url; print_r($DocInfo); }

Then I see this in the output:
"Content not received for url: http://mysite.com/abcd.html (301) Referer-page: https://mysite.com/1234.html"
So I'm crawling a https site, but some internal url's are still pointing to the http-site.
Due to the redirect the $DocInfo->received doesn't return true.

How can I fix this? So that also a redirect is just handled like any other 'normal' page?
Is there a way to replace "http://" by "https://" or can I somewhere update the code so that the code follows the 301 redirect and processes the page where the redirect is leading to?

Fatal error: Uncaught TypeError: Return value of PHPCrawl\PHPCrawlerBenchmark::getElapsedTime() must be of the type float or null, none returned

I get this error when I call $report = $crawler->getProcessReport();

Fatal error: Uncaught TypeError: Return value of PHPCrawl\PHPCrawlerBenchmark::getElapsedTime() must be of the type float or null, none returned in vendor/brittainmedia/phpcrawl/libs/PHPCrawlerBenchmark.php:91
Stack trace:
#0 vendor/brittainmedia/phpcrawl/libs/PHPCrawler.php(958): PHPCrawl\PHPCrawlerBenchmark::getElapsedTime('crawling_proces...')
#1 /myScript.php(197): PHPCrawl\PHPCrawler->getProcessReport()
#2 {main}
  thrown in /var/hpwsites/u_ruiten/website/html/webroot/pub/crawl/vendor/brittainmedia/phpcrawl/libs/PHPCrawlerBenchmark.php on line 91

I'm using version 0.9.5.
What could be the cause of this?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.