serp-spider / core Goto Github PK

View Code? Open in Web Editor NEW

88.0 17.0 43.0 526 KB

:spider: The PHP SERP Spider - A search engine scraper

Home Page: https://serp-spider.github.io/

License: Other

PHP 99.23% Shell 0.77%

serp scrap scrape scraping

core's People

Contributors

Stargazers

Watchers

core's Issues

Url normalization

As talked in #14 it would be useful to have url normalization as described here: https://tools.ietf.org/html/rfc3986#section-6.2.2

The following resources look perfect for that:

http://sabre.io/uri/usage/ (in this case, it will also serve for resolving uri)
https://github.com/glenscott/url-normalizer (simpler, just providing uri normalization)

@RubtsovAV You were talking about zend uri, but I told that using a full uri abstraction library had too many drawbacks and limitations for SERPS. Please check at sabre/uri, it does not provide a full uri implementation, instead it gives some tools to manipulate uri. Looks perfect for that.

First configuration

Sorry but I can not understand how to run the first installation. I installed and configured Symfony, then I entered all the configuration files in composer and ran a composer update. What should I do next?

Thanks

Scraper Boilerplate

Hello!

I love your implementation for a site getter and parser.

Do you have a boilerplate / any documentation that I could use to build a generic / alternative search engine scraper ?

Otherwise I'm guessing I'd be best off breaking your project down to it's most basic components and creating a generator ?

Thanks in advance for your time.

Just installed and configure Call to undefined method Serps\Core\Serp\ItemPosition::getType()

As the example in README.

Uncaught Error: Call to undefined method Serps\Core\Serp\ItemPosition::getType()

Allow parseStr to parse array value

Currently the parse_str repalcement does not parse array values (foo[]=a&foo[]=b&foo[]=c)

Allow to resolve an url as any class from any class

Currently url:resolve will create an url of the same class. But we might want to do it as an other class.

For instance we can generate a UrlArchive from a GoogleUrl

Proxy::getIp should be Proxy::getHost

Currently proxy are only aimed to support proxy with ip, not domain.

Proxy documentation, methods and properties must be updated to use host instead of ip.

Make http test internal

Currently http tests use httpbin.org, that's not self dependant. Additionally proxy testing is not very stable, a proxy should be started with phpunit tests

Review exceptions in all packages

The way exceptions work was updated and must follow this guildline: http://serp-spider.github.io/documentation/search-engine/google/handle-errors/

All package must be updated according to this

problem to get serp when setting page number

hi, I got a problem to get serp when I set page number.
It just returns empty space to the "start" parameter.

I found that QueryParam class 's queryItemToString method returns only name when the value is number type.

Thank you for great library anyway !

shinjiro nojima

Allow selecting server IPv4 or IPv6 for

Its very common when it comes to crawling to use proxies,
however its also feasible to do so when the server come with multiple IPv4 or IPv6 subnets
https://stackoverflow.com/questions/2425651/select-outgoing-ip-for-curl-request

i guess we can edit http-client-curl/src/CurlClient.php and serps/core/src/Core/Http/Proxy.php
to add another case with proxy type as 'iface'

while using below curl options

(CURLOPT_INTERFACE, 'IP');
(CURLOPT_IPRESOLVE, CURL_IPRESOLVE_V4);

Could not find package serps/serps at any version for your minimum-stability (stable)

If you just follow the README:

$ composer require serps/serps

  [InvalidArgumentException]
  Could not find package serps/serps at any version for your minimum-stabilit
  y (stable). Check the package spelling or your minimum-stability

require [--dev] [--prefer-source] [--prefer-dist] [--no-progress] [--no-suggest] [--no-update] [--no-scripts] [--update-no-dev] [--update-with-dependencies] [--ignore-platform-reqs] [--prefer-stable] [--prefer-lowest] [--sort-packages] [-o|--optimize-autoloader] [-a|--classmap-authoritative] [--apcu-autoloader] [--] [<packages>]...

Solved by adding it all directly to composer.json:

      "serps/core": "*@alpha",
        "serps/http-client-curl": "*@alpha",
        "serps/search-engine-google": "*@alpha"

Review of url structure

Url base structure has design issue, the constructor is frozen because it is defined in the base interface.
The cnstructor must be removed from the interface, and, instead a static build method will be implemented.

Url will also be created from array (as returned by parse_url).

In addition unsupported port user and pass are also implemented

introduce resolveAsString

Now resolve and resolveAsString are 2 distinct methods. the reason is that resolve could return either an url instance and a string, making things hard to manage in some edge cases

Media should be able to reconize mime type and to assign filetype

Media interface should be able to do the following:

$media->getMimeType(); // image/png
$media->save('filename.%extension'); // filename.png

Add a parser error manager

An error manager should be available for parser. The reason is that the way errors are managed depend on the user of the library.

Sometimes people want the script execution fails, sometimes people want just the value to be null, and that will still depend on what kind of error.

The idea would be to use an error manager that can recognize error type and act according to the detected error.

Example with google client:

we want to throw exceptions when error happens while parsing a classical result, but not for other result type
but we want to return null when google failed to parse the title

Additionally everything will go in an error bag, containing the name/attributes of the error, the description of the error, a trace of the error (like exceptions), and the resolution chosen.

This is very repository specific, then it will be implemented as a component, and it will be added as a dependency on the packages that need it, and for instance we can define it globally for google GoogleClient::registerParsingErrorHandler($errorHandler) or per google client $googleClient->setErrorHandler($errorHandler); then other packages can do the same.

The way errors will be detected is still obscure, the obvious options would be:

redis-like keys NAMESPACE::ITEM::DETAILS e.g: GOOGLE::PARSING::CLASSICAL::TITLE
and match it with regexp e.g : GOOGLE::PARSING::*

Featured Snippets being included in Classical ranking

I'm pretty impressed with what you have here. However, I've noticed that the new "Featured Snippets" for all search results are being included as the Rank #1 serp (via the provided getOnPagePosition method). There should instead be a custom method for determining whether a Classical result is actually a Featured Snippet (I'm currently seeing this as a Rank #0 result across the web, and the featured-snippet result is usually repeated within the top ten serp results). I just came across your library today, so I haven't really gotten into it very much, but I'll likely be trying to implement a custom parsing callback function to try and identify those featured snippets in the meantime. Hopefully this feature enhancement will rank with you. :) Thanks.

Uninitialized string offset: 1 in UrlArchiveTrait.php

Hello,

I sometimes get this error:

PHP Notice: Uninitialized string offset: 1 in ./spider/vendor/serps/core/src/Core/Url/UrlArchiveTrait.php on line 208

Maintained?

Is this repository maintained? I cant see or find any contact information to ask you personally.

Fix tryfor timing

https://github.com/serp-spider/core/blob/v0.1.x/src/Core/Captcha/AsyncCaptchaSolvingCallback.php#L44

This uses a mix of seconds and microsec (that are actually milliseconds). Make everything being milliseconds

Fatal error: Declaration of DOMElement::getAttribute(string $qualifiedName) must be compatible with Serps\Core\Dom\DomNodeInterface::getAttribute($name)

I got error
Fatal error: Declaration of DOMElement::getAttribute(string $qualifiedName) must be compatible with Serps\Core\Dom\DomNodeInterface::getAttribute($name) in .../.../vendor/serps/core/src/Core/Dom/DomElement.php on line 0

Review urls structure

Think about a common interface instead of UrlArchive
Review the relations between url/and urls that can generate requests (consider moving GoogleUrl::generateRequest away)
Review the url factory (currently the constructor is used, that's not flexible)
SerpUrlInterface should extend some UrlInterface

Error: Catchable fatal error: Argument 1 passed to Serps\SearchEngine\Google\GoogleClient::__construct() must be an instance of Serps\Core\Browser\BrowserInterface, instance of Serps\HttpClient\CurlClient given

Hello good afternoon

Error: Catchable fatal error: Argument 1 passed to Serps\SearchEngine\Google\GoogleClient::__construct() must be an instance of Serps\Core\Browser\BrowserInterface, instance of Serps\HttpClient\CurlClient given

Following the example you provided in

https://serp-spider.github.io/documentation/overview/

If I use the following script

use Serps\SearchEngine\Google\GoogleClient;
use Serps\HttpClient\CurlClient;

$googleUrl = 'https://www.google.com.br/';
$proxy = '';

$googleClient = new GoogleClient(new CurlClient());

$googleClient->query($googleUrl, $proxy);

dump($googleClient);

and the following error is returned

Catchable fatal error: Argument 1 passed to Serps\SearchEngine\Google\GoogleClient::__construct() must be an instance of Serps\Core\Browser\BrowserInterface, instance of Serps\HttpClient\CurlClient given, called in /home/marcio/dados/public_html/git/manual/google-appengine/index.php on line 20 and defined in /home/marcio/dados/public_html/git/manual/google-appengine/vendor/serps/search-engine-google/src/GoogleClient.php on line 33

If I use the following script

use Serps\SearchEngine\Google\GoogleClient;
use Serps\HttpClient\CurlClient;

$googleUrl = 'https://www.google.com.br/';
$proxy = '';

$googleClient = new GoogleClient();

$googleClient->query($googleUrl, $proxy);

dump($googleClient);

the following error is returned

Catchable fatal error: Argument 1 passed to Serps\SearchEngine\Google\GoogleClient::query() must be an instance of Serps\SearchEngine\Google\GoogleUrlInterface, string given, called in /home/marcio/dados/public_html/git/manual/google-appengine/index.php on line 22 and defined in /home/marcio/dados/public_html/git/manual/google-appengine/vendor/serps/search-engine-google/src/GoogleClient.php on line 48

class_implements(): Class string does not exist and could not be loaded

Hi @gsouf
Three days back we did the pull for latest code.
For last 3 days, we are getting one error in urlarchivetrait while calling this function : $response->getNaturalResults();

It throws the error
ErrorException in UrlArchiveTrait.php line 322:
class_implements(): Class string does not exist and could not be loaded

Are you not facing this error or in the last 3 days did you do some changes which fixes this?

Make cookie serializable

Cookies and cookie jars should be serialisable, then we can save them and use it in other scripts

Add a way to make screenshots

Phantomjs supports screenshots, but maybe some wkhtmltopdf or alternative can be used from any http adapter and allow to make screenshot from the dom

serp-spider / core Goto Github PK

core's People

Contributors

Stargazers

Watchers

Forkers

core's Issues

Recommend Projects

Recommend Topics

Recommend Org