serp-spider / core Goto Github PK
View Code? Open in Web Editor NEW:spider: The PHP SERP Spider - A search engine scraper
Home Page: https://serp-spider.github.io/
License: Other
:spider: The PHP SERP Spider - A search engine scraper
Home Page: https://serp-spider.github.io/
License: Other
As talked in #14 it would be useful to have url normalization as described here: https://tools.ietf.org/html/rfc3986#section-6.2.2
The following resources look perfect for that:
@RubtsovAV You were talking about zend uri, but I told that using a full uri abstraction library had too many drawbacks and limitations for SERPS. Please check at sabre/uri, it does not provide a full uri implementation, instead it gives some tools to manipulate uri. Looks perfect for that.
Sorry but I can not understand how to run the first installation. I installed and configured Symfony, then I entered all the configuration files in composer and ran a composer update. What should I do next?
Thanks
Hello!
I love your implementation for a site getter and parser.
Do you have a boilerplate / any documentation that I could use to build a generic / alternative search engine scraper ?
Otherwise I'm guessing I'd be best off breaking your project down to it's most basic components and creating a generator ?
Thanks in advance for your time.
As the example in README.
Uncaught Error: Call to undefined method Serps\Core\Serp\ItemPosition::getType()
Currently the parse_str repalcement does not parse array values (foo[]=a&foo[]=b&foo[]=c
)
Currently url:resolve will create an url of the same class. But we might want to do it as an other class.
For instance we can generate a UrlArchive from a GoogleUrl
Currently proxy are only aimed to support proxy with ip, not domain.
Proxy documentation, methods and properties must be updated to use host instead of ip.
Currently http tests use httpbin.org, that's not self dependant. Additionally proxy testing is not very stable, a proxy should be started with phpunit tests
The way exceptions work was updated and must follow this guildline: http://serp-spider.github.io/documentation/search-engine/google/handle-errors/
All package must be updated according to this
hi, I got a problem to get serp when I set page number.
It just returns empty space to the "start" parameter.
I found that QueryParam class 's queryItemToString method returns only name when the value is number type.
Thank you for great library anyway !
shinjiro nojima
Its very common when it comes to crawling to use proxies,
however its also feasible to do so when the server come with multiple IPv4 or IPv6 subnets
https://stackoverflow.com/questions/2425651/select-outgoing-ip-for-curl-request
i guess we can edit http-client-curl/src/CurlClient.php and serps/core/src/Core/Http/Proxy.php
to add another case with proxy type as 'iface'
while using below curl options
(CURLOPT_INTERFACE, 'IP');
(CURLOPT_IPRESOLVE, CURL_IPRESOLVE_V4);
If you just follow the README:
$ composer require serps/serps
[InvalidArgumentException]
Could not find package serps/serps at any version for your minimum-stabilit
y (stable). Check the package spelling or your minimum-stability
require [--dev] [--prefer-source] [--prefer-dist] [--no-progress] [--no-suggest] [--no-update] [--no-scripts] [--update-no-dev] [--update-with-dependencies] [--ignore-platform-reqs] [--prefer-stable] [--prefer-lowest] [--sort-packages] [-o|--optimize-autoloader] [-a|--classmap-authoritative] [--apcu-autoloader] [--] [<packages>]...
Solved by adding it all directly to composer.json
:
"serps/core": "*@alpha",
"serps/http-client-curl": "*@alpha",
"serps/search-engine-google": "*@alpha"
Url base structure has design issue, the constructor is frozen because it is defined in the base interface.
The cnstructor must be removed from the interface, and, instead a static build method will be implemented.
Url will also be created from array (as returned by parse_url).
In addition unsupported port user and pass are also implemented
Now resolve and resolveAsString are 2 distinct methods. the reason is that resolve could return either an url instance and a string, making things hard to manage in some edge cases
Media interface should be able to do the following:
$media->getMimeType(); // image/png
$media->save('filename.%extension'); // filename.png
An error manager should be available for parser. The reason is that the way errors are managed depend on the user of the library.
Sometimes people want the script execution fails, sometimes people want just the value to be null, and that will still depend on what kind of error.
The idea would be to use an error manager that can recognize error type and act according to the detected error.
Example with google client:
Additionally everything will go in an error bag, containing the name/attributes of the error, the description of the error, a trace of the error (like exceptions), and the resolution chosen.
This is very repository specific, then it will be implemented as a component, and it will be added as a dependency on the packages that need it, and for instance we can define it globally for google GoogleClient::registerParsingErrorHandler($errorHandler)
or per google client $googleClient->setErrorHandler($errorHandler);
then other packages can do the same.
The way errors will be detected is still obscure, the obvious options would be:
NAMESPACE::ITEM::DETAILS
e.g: GOOGLE::PARSING::CLASSICAL::TITLE
GOOGLE::PARSING::*
I'm pretty impressed with what you have here. However, I've noticed that the new "Featured Snippets" for all search results are being included as the Rank #1 serp (via the provided getOnPagePosition method). There should instead be a custom method for determining whether a Classical result is actually a Featured Snippet (I'm currently seeing this as a Rank #0 result across the web, and the featured-snippet result is usually repeated within the top ten serp results). I just came across your library today, so I haven't really gotten into it very much, but I'll likely be trying to implement a custom parsing callback function to try and identify those featured snippets in the meantime. Hopefully this feature enhancement will rank with you. :) Thanks.
Hello,
I sometimes get this error:
PHP Notice: Uninitialized string offset: 1 in ./spider/vendor/serps/core/src/Core/Url/UrlArchiveTrait.php on line 208
Is this repository maintained? I cant see or find any contact information to ask you personally.
https://github.com/serp-spider/core/blob/v0.1.x/src/Core/Captcha/AsyncCaptchaSolvingCallback.php#L44
This uses a mix of seconds and microsec (that are actually milliseconds). Make everything being milliseconds
I got error
Fatal error: Declaration of DOMElement::getAttribute(string $qualifiedName) must be compatible with Serps\Core\Dom\DomNodeInterface::getAttribute($name) in .../.../vendor/serps/core/src/Core/Dom/DomElement.php on line 0
Hello good afternoon
Error: Catchable fatal error: Argument 1 passed to Serps\SearchEngine\Google\GoogleClient::__construct() must be an instance of Serps\Core\Browser\BrowserInterface, instance of Serps\HttpClient\CurlClient given
Following the example you provided in
https://serp-spider.github.io/documentation/overview/
If I use the following script
use Serps\SearchEngine\Google\GoogleClient;
use Serps\HttpClient\CurlClient;
$googleUrl = 'https://www.google.com.br/';
$proxy = '';
$googleClient = new GoogleClient(new CurlClient());
$googleClient->query($googleUrl, $proxy);
dump($googleClient);
and the following error is returned
Catchable fatal error: Argument 1 passed to Serps\SearchEngine\Google\GoogleClient::__construct() must be an instance of Serps\Core\Browser\BrowserInterface, instance of Serps\HttpClient\CurlClient given, called in /home/marcio/dados/public_html/git/manual/google-appengine/index.php on line 20 and defined in /home/marcio/dados/public_html/git/manual/google-appengine/vendor/serps/search-engine-google/src/GoogleClient.php on line 33
If I use the following script
use Serps\SearchEngine\Google\GoogleClient;
use Serps\HttpClient\CurlClient;
$googleUrl = 'https://www.google.com.br/';
$proxy = '';
$googleClient = new GoogleClient();
$googleClient->query($googleUrl, $proxy);
dump($googleClient);
the following error is returned
Catchable fatal error: Argument 1 passed to Serps\SearchEngine\Google\GoogleClient::query() must be an instance of Serps\SearchEngine\Google\GoogleUrlInterface, string given, called in /home/marcio/dados/public_html/git/manual/google-appengine/index.php on line 22 and defined in /home/marcio/dados/public_html/git/manual/google-appengine/vendor/serps/search-engine-google/src/GoogleClient.php on line 48
Hi @gsouf
Three days back we did the pull for latest code.
For last 3 days, we are getting one error in urlarchivetrait while calling this function : $response->getNaturalResults();
It throws the error
ErrorException in UrlArchiveTrait.php line 322:
class_implements(): Class string does not exist and could not be loaded
Are you not facing this error or in the last 3 days did you do some changes which fixes this?
Cookies and cookie jars should be serialisable, then we can save them and use it in other scripts
Phantomjs supports screenshots, but maybe some wkhtmltopdf or alternative can be used from any http adapter and allow to make screenshot from the dom
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.