Git Product home page Git Product logo

core's People

Contributors

atefbb avatar gsouf avatar hughsaffar avatar kurorido avatar rubtsovav avatar shiftas avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

core's Issues

Url normalization

As talked in #14 it would be useful to have url normalization as described here: https://tools.ietf.org/html/rfc3986#section-6.2.2

The following resources look perfect for that:

@RubtsovAV You were talking about zend uri, but I told that using a full uri abstraction library had too many drawbacks and limitations for SERPS. Please check at sabre/uri, it does not provide a full uri implementation, instead it gives some tools to manipulate uri. Looks perfect for that.

First configuration

Sorry but I can not understand how to run the first installation. I installed and configured Symfony, then I entered all the configuration files in composer and ran a composer update. What should I do next?

Thanks

Scraper Boilerplate

Hello!

I love your implementation for a site getter and parser.

Do you have a boilerplate / any documentation that I could use to build a generic / alternative search engine scraper ?

Otherwise I'm guessing I'd be best off breaking your project down to it's most basic components and creating a generator ?

Thanks in advance for your time.

Proxy::getIp should be Proxy::getHost

Currently proxy are only aimed to support proxy with ip, not domain.

Proxy documentation, methods and properties must be updated to use host instead of ip.

Make http test internal

Currently http tests use httpbin.org, that's not self dependant. Additionally proxy testing is not very stable, a proxy should be started with phpunit tests

problem to get serp when setting page number

hi, I got a problem to get serp when I set page number.
It just returns empty space to the "start" parameter.

I found that QueryParam class 's queryItemToString method returns only name when the value is number type.

Thank you for great library anyway !

shinjiro nojima

Allow selecting server IPv4 or IPv6 for

Its very common when it comes to crawling to use proxies,
however its also feasible to do so when the server come with multiple IPv4 or IPv6 subnets
https://stackoverflow.com/questions/2425651/select-outgoing-ip-for-curl-request

i guess we can edit http-client-curl/src/CurlClient.php and serps/core/src/Core/Http/Proxy.php
to add another case with proxy type as 'iface'

while using below curl options

(CURLOPT_INTERFACE, 'IP');
(CURLOPT_IPRESOLVE, CURL_IPRESOLVE_V4);

Could not find package serps/serps at any version for your minimum-stability (stable)

If you just follow the README:

$ composer require serps/serps

  [InvalidArgumentException]
  Could not find package serps/serps at any version for your minimum-stabilit
  y (stable). Check the package spelling or your minimum-stability

require [--dev] [--prefer-source] [--prefer-dist] [--no-progress] [--no-suggest] [--no-update] [--no-scripts] [--update-no-dev] [--update-with-dependencies] [--ignore-platform-reqs] [--prefer-stable] [--prefer-lowest] [--sort-packages] [-o|--optimize-autoloader] [-a|--classmap-authoritative] [--apcu-autoloader] [--] [<packages>]...

Solved by adding it all directly to composer.json:

      "serps/core": "*@alpha",
        "serps/http-client-curl": "*@alpha",
        "serps/search-engine-google": "*@alpha"

Review of url structure

Url base structure has design issue, the constructor is frozen because it is defined in the base interface.
The cnstructor must be removed from the interface, and, instead a static build method will be implemented.

Url will also be created from array (as returned by parse_url).

In addition unsupported port user and pass are also implemented

introduce resolveAsString

Now resolve and resolveAsString are 2 distinct methods. the reason is that resolve could return either an url instance and a string, making things hard to manage in some edge cases

Add a parser error manager

An error manager should be available for parser. The reason is that the way errors are managed depend on the user of the library.

Sometimes people want the script execution fails, sometimes people want just the value to be null, and that will still depend on what kind of error.

The idea would be to use an error manager that can recognize error type and act according to the detected error.

Example with google client:

  • we want to throw exceptions when error happens while parsing a classical result, but not for other result type
  • but we want to return null when google failed to parse the title

Additionally everything will go in an error bag, containing the name/attributes of the error, the description of the error, a trace of the error (like exceptions), and the resolution chosen.

This is very repository specific, then it will be implemented as a component, and it will be added as a dependency on the packages that need it, and for instance we can define it globally for google GoogleClient::registerParsingErrorHandler($errorHandler) or per google client $googleClient->setErrorHandler($errorHandler); then other packages can do the same.

The way errors will be detected is still obscure, the obvious options would be:

  • redis-like keys NAMESPACE::ITEM::DETAILS e.g: GOOGLE::PARSING::CLASSICAL::TITLE
    and match it with regexp e.g : GOOGLE::PARSING::*

Featured Snippets being included in Classical ranking

I'm pretty impressed with what you have here. However, I've noticed that the new "Featured Snippets" for all search results are being included as the Rank #1 serp (via the provided getOnPagePosition method). There should instead be a custom method for determining whether a Classical result is actually a Featured Snippet (I'm currently seeing this as a Rank #0 result across the web, and the featured-snippet result is usually repeated within the top ten serp results). I just came across your library today, so I haven't really gotten into it very much, but I'll likely be trying to implement a custom parsing callback function to try and identify those featured snippets in the meantime. Hopefully this feature enhancement will rank with you. :) Thanks.

Maintained?

Is this repository maintained? I cant see or find any contact information to ask you personally.

Review urls structure

  • Think about a common interface instead of UrlArchive
  • Review the relations between url/and urls that can generate requests (consider moving GoogleUrl::generateRequest away)
  • Review the url factory (currently the constructor is used, that's not flexible)
  • SerpUrlInterface should extend some UrlInterface

Error: Catchable fatal error: Argument 1 passed to Serps\SearchEngine\Google\GoogleClient::__construct() must be an instance of Serps\Core\Browser\BrowserInterface, instance of Serps\HttpClient\CurlClient given

Hello good afternoon

Error: Catchable fatal error: Argument 1 passed to Serps\SearchEngine\Google\GoogleClient::__construct() must be an instance of Serps\Core\Browser\BrowserInterface, instance of Serps\HttpClient\CurlClient given

Following the example you provided in

https://serp-spider.github.io/documentation/overview/

If I use the following script

use Serps\SearchEngine\Google\GoogleClient;
use Serps\HttpClient\CurlClient;

$googleUrl = 'https://www.google.com.br/';
$proxy = '';

$googleClient = new GoogleClient(new CurlClient());

$googleClient->query($googleUrl, $proxy);

dump($googleClient);

and the following error is returned

Catchable fatal error: Argument 1 passed to Serps\SearchEngine\Google\GoogleClient::__construct() must be an instance of Serps\Core\Browser\BrowserInterface, instance of Serps\HttpClient\CurlClient given, called in /home/marcio/dados/public_html/git/manual/google-appengine/index.php on line 20 and defined in /home/marcio/dados/public_html/git/manual/google-appengine/vendor/serps/search-engine-google/src/GoogleClient.php on line 33

If I use the following script

use Serps\SearchEngine\Google\GoogleClient;
use Serps\HttpClient\CurlClient;

$googleUrl = 'https://www.google.com.br/';
$proxy = '';

$googleClient = new GoogleClient();

$googleClient->query($googleUrl, $proxy);

dump($googleClient);

the following error is returned

Catchable fatal error: Argument 1 passed to Serps\SearchEngine\Google\GoogleClient::query() must be an instance of Serps\SearchEngine\Google\GoogleUrlInterface, string given, called in /home/marcio/dados/public_html/git/manual/google-appengine/index.php on line 22 and defined in /home/marcio/dados/public_html/git/manual/google-appengine/vendor/serps/search-engine-google/src/GoogleClient.php on line 48

class_implements(): Class string does not exist and could not be loaded

Hi @gsouf
Three days back we did the pull for latest code.
For last 3 days, we are getting one error in urlarchivetrait while calling this function : $response->getNaturalResults();

It throws the error
ErrorException in UrlArchiveTrait.php line 322:
class_implements(): Class string does not exist and could not be loaded

Are you not facing this error or in the last 3 days did you do some changes which fixes this?

Make cookie serializable

Cookies and cookie jars should be serialisable, then we can save them and use it in other scripts

Add a way to make screenshots

Phantomjs supports screenshots, but maybe some wkhtmltopdf or alternative can be used from any http adapter and allow to make screenshot from the dom

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.