Git Product home page Git Product logo

gocrawl's Introduction

gocrawl Go Reference build status

gocrawl is a polite, slim and concurrent web crawler written in Go.

For a simpler yet more flexible web crawler written in a more idiomatic Go style, you may want to take a look at fetchbot, a package that builds on the experience of gocrawl.

Translations

Features

  • Full control over the URLs to visit, inspect and query (using a pre-initialized goquery document)
  • Crawl delays applied per host
  • Obedience to robots.txt rules (using the robotstxt.go library)
  • Concurrent execution using goroutines
  • Configurable logging
  • Open, customizable design providing hooks into the execution logic

Installation and dependencies

gocrawl depends on the following userland libraries:

It requires Go1.1+ because of its indirect dependency on golang.org/x/net/html. To install:

go get github.com/PuerkitoBio/gocrawl

To install a previous version, you have to git clone https://github.com/PuerkitoBio/gocrawl into your $GOPATH/src/github.com/PuerkitoBio/gocrawl/ directory, and then run (for example) git checkout v0.3.2 to checkout a specific version, and go install to build and install the Go package.

Changelog

  • 2021-05-19 : Use Go modules for dependencies. Tag v1.1.0.
  • 2019-07-22 : Use pre-compiled matchers for goquery (thanks @mikefaraponov). Tag v1.0.1.
  • 2016-11-20 : Fix log message so that it prints enqueued URLs (thanks @oherych). Tag as v1.0.0.
  • 2016-05-24 : Set the *URLContext.SourceURL() and *URLContext.NormalizedSourceURL() to the original URL on redirections (see #55). Thanks to github user @tmatsuo.
  • 2016-02-24 : Always use Options.UserAgent to make requests, use Options.RobotUserAgent only for robots.txt policy matching. Lint and vet the code a bit, better godoc documentation.
  • 2014-11-06 : Change import paths of net/html to golang.org/x/net/html (see https://groups.google.com/forum/#!topic/golang-nuts/eD8dh3T9yyA).
  • v0.4.1 : now go-getable, since goquery is go-getable too.
  • v0.4.0 : BREAKING CHANGES major refactor, API changes: * Use an *URLContext structure as first argument to all Extender interface functions that are called in the context of an URL, instead of a simple *url.URL pointer that was sometimes normalized, sometimes not. * Remove the EnqueueOrigin enumeration flag. It wasn't even used by gocrawl, and it is a kind of state associated with the URL, so this feature is now generalized with the next bullet... * Add a State for each URL, so that the crawling process can associate arbitrary data with a given URL (for example, the ID of a record in the database). Fixes issue #14. * More idiomatic use of errors (ErrXxx variables, Run() now returns an error, removing the need for the EndReason enum). * Much simplified Filter() function. It now only returns a bool indicating if it should be visited or not. The HEAD request override feature is provided by the *URLContext structure, and can be set anywhere. The priority feature was unimplemented and has been removed from the return values, if it gets implemented it will probably be via the *URLContext structure too. * Start, Run, Visit, Visited and the EnqueueChan all work with the empty interface type for the URL data. While this is an inconvenience for compile-time checks, it allows for more flexibility regarding the state feature. Instead of always forcing a map[string]interface{} type even when no state is needed, gocrawl supports various types. * Some other internal changes, better tests.
  • v0.3,2 : Fix the high CPU usage when waiting for a crawl delay.
  • v0.3.1 : Export the HttpClient variable used by the default Fetch() implementation (see issue #9).
  • v0.3.0 : BEHAVIOR CHANGE filter done with normalized URL, fetch done with original, non-normalized URL (see issue #10).
  • v0.2.0 : BREAKING CHANGES rework extension/hooks.
  • v0.1.0 : Initial release.

Example

From example_test.go:

// Only enqueue the root and paths beginning with an "a"
var rxOk = regexp.MustCompile(`http://duckduckgo\.com(/a.*)?$`)

// Create the Extender implementation, based on the gocrawl-provided DefaultExtender,
// because we don't want/need to override all methods.
type ExampleExtender struct {
    gocrawl.DefaultExtender // Will use the default implementation of all but Visit and Filter
}

// Override Visit for our need.
func (x *ExampleExtender) Visit(ctx *gocrawl.URLContext, res *http.Response, doc *goquery.Document) (interface{}, bool) {
    // Use the goquery document or res.Body to manipulate the data
    // ...

    // Return nil and true - let gocrawl find the links
    return nil, true
}

// Override Filter for our need.
func (x *ExampleExtender) Filter(ctx *gocrawl.URLContext, isVisited bool) bool {
    return !isVisited && rxOk.MatchString(ctx.NormalizedURL().String())
}

func ExampleCrawl() {
    // Set custom options
    opts := gocrawl.NewOptions(new(ExampleExtender))

    // should always set your robot name so that it looks for the most
    // specific rules possible in robots.txt.
    opts.RobotUserAgent = "Example"
    // and reflect that in the user-agent string used to make requests,
    // ideally with a link so site owners can contact you if there's an issue
    opts.UserAgent = "Mozilla/5.0 (compatible; Example/1.0; +http://example.com)"

    opts.CrawlDelay = 1 * time.Second
    opts.LogFlags = gocrawl.LogAll

    // Play nice with ddgo when running the test!
    opts.MaxVisits = 2

    // Create crawler and start at root of duckduckgo
    c := gocrawl.NewCrawlerWithOptions(opts)
    c.Run("https://duckduckgo.com/")

    // Remove "x" before Output: to activate the example (will run on go test)

    // xOutput: voluntarily fail to see log output
}

API

Gocrawl can be described as a minimalist web crawler (hence the "slim" tag, at ~1000 sloc), providing the basic engine upon which to build a full-fledged indexing machine with caching, persistence and staleness detection logic, or to use as is for quick and easy crawling. Gocrawl itself does not attempt to detect staleness of a page, nor does it implement a caching mechanism. If an URL is enqueued to be processed, it will make a request to fetch it (provided it is allowed by robots.txt - hence the "polite" tag). And there is no prioritization among the URLs to process, it assumes that all enqueued URLs must be visited at some point, and that the order in which they are is unimportant.

However, it does provide plenty of hooks and customizations. Instead of trying to do everything and impose a way to do it, it offers ways to manipulate and adapt it to anyone's needs.

As usual, the complete godoc reference can be found here.

Design rationale

The major use-case behind gocrawl is to crawl some web pages while respecting the constraints of robots.txt policies and while applying a good web citizen crawl delay between each request to a given host. Hence the following design decisions:

  • Each host spawns its own worker (goroutine) : This makes sense since it must first read its robots.txt data, and only then proceed sequentially, one request at a time, with the specified delay between each fetch. There are no constraints inter-host, so each separate worker can crawl independently.
  • The visitor function is called on the worker goroutine : Again, this is ok because the crawl delay is likely bigger than the time required to parse the document, so this processing will usually not penalize the performance.
  • Edge cases with no crawl delay are supported, but not optimized : In the rare but possible event when a crawl with no delay is needed (e.g.: on your own servers, or with permission outside busy hours, etc.), gocrawl accepts a null (zero) delay, but doesn't provide optimizations. That is, there is no "special path" in the code where visitor functions are de-coupled from the worker, or where multiple workers can be launched concurrently on the same host. (In fact, if this case is your only use-case, I would recommend not to use a library at all - since there is little value in it -, and simply use Go's standard libs and fetch at will with as many goroutines as are necessary.)
  • Extender interface provides the means to write a drop-in, fully encapsulated behaviour : An implementation of Extender can radically enhance the core library, with caching, persistence, different fetching strategies, etc. This is why the Extender.Start() method is somewhat redundant with the Crawler.Run() method, Run allows calling the crawler as a library, while Start makes it possible to encapsulate the logic required to select the seed URLs into the Extender. The same goes for Extender.End() and the return value of Run.

Although it could probably be used to crawl a massive amount of web pages (after all, this is fetch, visit, enqueue, repeat ad nauseam!), most realistic (and um... tested!) use-cases should be based on a well-known, well-defined limited bucket of seeds. Distributed crawling is your friend, should you need to move past this reasonable use.

Crawler

The Crawler type controls the whole execution. It spawns worker goroutines and manages the URL queue. There are two helper constructors:

  • NewCrawler(Extender) : Creates a crawler with the specified Extender instance.
  • *NewCrawlerWithOptions(Options) : Creates a crawler with a pre-initialized *Options instance.

The one and only public function is Run(seeds interface{}) error which take a seeds argument (the base URLs used to start crawling) that can be expressed a number of different ways. It ends when there are no more URLs waiting to be visited, or when the Options.MaxVisit number is reached. It returns an error, which is ErrMaxVisits if this setting is what caused the crawling to stop.

The various types that can be used to pass the seeds are the following (the same types apply for the empty interfaces in `Extender.Start(interface{}) interface{}`, `Extender.Visit(*URLContext, *http.Response, *goquery.Document) (interface{}, bool)` and in `Extender.Visited(*URLContext, interface{})`, as well as the type of the `EnqueueChan` field):
  • string : a single URL expressed as a string
  • []string : a slice of URLs expressed as strings
  • *url.URL : a pointer to a parsed URL object
  • []*url.URL : a slice of pointers to parsed URL objects
  • map[string]interface{} : a map of URLs expressed as strings (for the key) and their associated state data
  • map[*url.URL]interface{} : a map of URLs expressed as parsed pointers to URL objects (for the key) and their associated state data

For convenience, the types gocrawl.S and gocrawl.U are provided as equivalent to the map of strings and map of URLs, respectively (so that, for example, the code can look like gocrawl.S{"http://site.com": "some state data"}).

The Options type is detailed in the next section, and it offers a single constructor, NewOptions(Extender), which returns an initialized options object with defaults and the specified Extender implementation.

Hooks and customizations

The Options type provides the hooks and customizations offered by gocrawl. All but Extender are optional and have working defaults, but the UserAgent and RobotUserAgent options should be set to a custom value fitting for your project.

  • UserAgent : The user-agent string used to fetch the pages. Defaults to the Firefox 15 on Windows user-agent string. Should be changed to contain a reference to your robot's name and a contact link (see the example).

  • RobotUserAgent : The robot's user-agent string used to find a matching policy in the robots.txt file. Defaults to Googlebot (gocrawl vM.m) where M.m is the major and minor version of gocrawl. This should always be changed to a custom value such as the name of your project (see the example). See the robots exclusion protocol (full specification as interpreted by Google here) for details about the rule-matching based on the robot's user agent. It is good practice to include contact information in the user agent should the site owner need to contact you.

  • MaxVisits : The maximum number of pages visited before stopping the crawl. Probably more useful for development purposes. Note that the Crawler will send its stop signal once this number of visits is reached, but workers may be in the process of visiting other pages, so when the crawling stops, the number of pages visited will be at least MaxVisits, possibly more (worst case is MaxVisits + number of active workers). Defaults to zero, no maximum.

  • EnqueueChanBuffer : The size of the buffer for the Enqueue channel (the channel that allows the extender to arbitrarily enqueue new URLs in the crawler). Defaults to 100.

  • HostBufferFactor : The factor (multiplier) for the size of the workers map and the communication channel when SameHostOnly is set to false. When SameHostOnly is true, the Crawler knows exactly the required size (the number of different hosts based on the seed URLs), but when it is false, the size may grow exponentially. By default, a factor of 10 is used (size is set to 10 times the number of different hosts based on the seed URLs).

  • CrawlDelay : The time to wait between each request to the same host. The delay starts as soon as the response is received from the host. This is a time.Duration type, so it can be specified with 5 * time.Second for example (which is the default value, 5 seconds). If a crawl delay is specified in the robots.txt file, in the group matching the robot's user-agent, by default this delay is used instead. Crawl delay can be customized further by implementing the ComputeDelay extender function.

  • WorkerIdleTTL : The idle time-to-live allowed for a worker before it is cleared (its goroutine terminated). Defaults to 10 seconds. The crawl delay is not part of idle time, this is specifically the time when the worker is available, but there are no URLs to process.

  • SameHostOnly : Limit the URLs to enqueue only to those links targeting the same host, which is true by default.

  • HeadBeforeGet : Asks the crawler to issue a HEAD request (and a subsequent RequestGet() extender method call) before making the eventual GET request. This is set to false by default. See also the URLContext structure explained below.

  • URLNormalizationFlags : The flags to apply when normalizing the URL using the purell library. The URLs are normalized before being enqueued and passed around to the Extender methods in the URLContext structure. Defaults to the most aggressive normalization allowed by purell, purell.FlagsAllGreedy.

  • LogFlags : The level of verbosity for logging. Defaults to errors only (LogError). Can be a set of flags (i.e. LogError | LogTrace).

  • Extender : The instance implementing the Extender interface. This implements the various callbacks offered by gocrawl. Must be specified when creating a Crawler (or when creating an Options to pass to NewCrawlerWithOptions constructor). A default extender is provided as a valid default implementation, DefaultExtender. It can be used by embedding it as an anonymous field to implement a custom extender when not all methods need customization (see the example above).

The Extender interface

This last option field, Extender, is crucial in using gocrawl, so here are the details for each callback function required by the Extender interface.

  • Start : Start(seeds interface{}) interface{}. Called when Run is called on the crawler, with the seeds passed to Run as argument. It returns the data that will be used as actual seeds, so that this callback can control which seeds are processed by the crawler. See the various supported types for more information. By default, this is a passthrough, it returns the data received as argument.

  • End : End(err error). Called when the crawling ends, with the error or nil. This same error is also returned from the Crawler.Run() function. By default, this method is a no-op.

  • Error : Error(err *CrawlError). Called when a crawling error occurs. Errors do not stop the crawling execution. A CrawlError instance is passed as argument. This specialized error implementation includes - among other interesting fields - a Kind field that indicates the step where the error occurred, and an *URLContext field identifying the processed URL that caused the error. By default, this method is a no-op.

  • Log : Log(logFlags LogFlags, msgLevel LogFlags, msg string). The logging function. By default, prints to the standard error (Stderr), and outputs only the messages with a level included in the LogFlags option. If a custom Log() method is implemented, it is up to you to validate if the message should be considered, based on the level of verbosity requested (i.e. if logFlags&msgLevel == msgLevel ...), since the method always gets called for all messages.

  • ComputeDelay : ComputeDelay(host string, di *DelayInfo, lastFetch *FetchInfo) time.Duration. Called by a worker before requesting a URL. Arguments are the host's name (the normalized form of the *url.URL.Host), the crawl delay information (includes delays from the Options struct, from the robots.txt, and the last used delay), and the last fetch information, so that it is possible to adapt to the current responsiveness of the host. It returns the delay to use.

The remaining extension functions are all called in the context of a given URL, so their first argument is always a pointer to an URLContext structure. So before documenting these methods, here is an explanation of all URLContext fields and methods:

  • HeadBeforeGet bool : This field is initialized with the global setting from the crawler's Options structure. It can be overridden at any time, though to be useful it should be done before the call to Fetch, where the decision to make a HEAD request or not is made.
  • State interface{} : This field holds the arbitrary state data associated with the URL. It can be nil or a value of any type.
  • URL() *url.URL : The getter method that returns the parsed URL in non-normalized form.
  • NormalizedURL() *url.URL : The getter method that returns the parsed URL in normalized form.
  • SourceURL() *url.URL : The getter method that returns the source URL in non-normalized form. Can be nil for seeds or URLs enqueued via the EnqueueChan.
  • NormalizedSourceURL() *url.URL : The getter method that returns the source URL in normalized form. Can be nil for seeds or URLs enqueued via the EnqueueChan.
  • IsRobotsURL() bool : Indicates if the current URL is a robots.txt URL.

With this out of the way, here are the other Extender functions:

  • Fetch : Fetch(ctx *URLContext, userAgent string, headRequest bool) (*http.Response, error). Called by a worker to request the URL. The DefaultExtender.Fetch() implementation uses the public HttpClient variable (a custom http.Client) to fetch the pages without following redirections, instead returning a special error (ErrEnqueueRedirect) so that the worker can enqueue the redirect-to URL. This enforces the whitelisting by the Filter() of every URL fetched by the crawling process. If headRequest is true, a HEAD request is made instead of a GET. Note that as of gocrawl v0.3, the default Fetch implementation uses the non-normalized URL.
Internally, gocrawl sets its http.Client's `CheckRedirect()` function field to a custom implementation that follows redirections for robots.txt URLs only (since a redirect on robots.txt still means that the site owner wants us to use these rules for this host). The worker is aware of the `ErrEnqueueRedirect` error, so if a non-robots.txt URL asks for a redirection, `CheckRedirect()` returns this error, and the worker recognizes this and enqueues the redirect-to URL, stopping the processing of the current URL. It is possible to provide a custom `Fetch()` implementation based on the same logic. Any `CheckRedirect()` implementation that returns a `ErrEnqueueRedirect` error will behave this way - that is, the worker will detect this error and will enqueue the redirect-to URL. See the source files ext.go and worker.go for details.

The `HttpClient` variable being public, it is possible to customize it so that it uses another `CheckRedirect()` function, or a different `Transport` object, etc. This customization should be done prior to starting the crawler. It will then be used by the default `Fetch()` implementation, or it can also be used by a custom `Fetch()` if required. Note that this client is shared by all crawlers in your application. Should you need different http clients per crawler in the same application, a custom `Fetch()` using a private `http.Client` instance should be provided.
  • RequestGet : RequestGet(ctx *URLContext, headRes *http.Response) bool. Indicates if the crawler should proceed with a GET request based on the HEAD request's response. This method is only called if a HEAD was requested (based on the *URLContext.HeadBeforeGet field). The default implementation returns true if the HEAD response status code was 2xx.

  • RequestRobots : RequestRobots(ctx *URLContext, robotAgent string) (data []byte, request bool). Asks whether the robots.txt URL should be fetched. If false is returned as second value, the data value is considered to be the robots.txt cached content, and is used as such (if it is empty, it behaves as if there was no robots.txt). The DefaultExtender.RequestRobots implementation returns nil, true.

  • FetchedRobots : FetchedRobots(ctx *URLContext, res *http.Response). Called when the robots.txt URL has been fetched from the host, so that it is possible to cache its content and feed it back to future RequestRobots() calls. By default, this is a no-op.

  • Filter : Filter(ctx *URLContext, isVisited bool) bool. Called when deciding if a URL should be enqueued for visiting. It receives the *URLContext and a bool "is visited" flag, indicating if this URL has already been visited in this crawling execution. It returns a bool flag ordering gocrawl to visit (true) or ignore (false) the URL. Even if the function returns true to enqueue the URL for visiting, the normalized form of the URL must still comply to these rules:

  1. It must be an absolute URL

  2. It must have a http/https scheme

  3. It must have the same host if the SameHostOnly flag is set

    The DefaultExtender.Filter implementation returns true if the URL has not been visited yet (the visited flag is based on the normalized version of the URLs), false otherwise.

  • Enqueued : Enqueued(ctx *URLContext). Called when a URL has been enqueued by the crawler. An enqueued URL may still be disallowed by a robots.txt policy, so it may end up not being fetched. By default, this method is a no-op.

  • Visit : Visit(ctx *URLContext, res *http.Response, doc *goquery.Document) (harvested interface{}, findLinks bool). Called when visiting a URL. It receives the URL context, a *http.Response response object, along with a ready-to-use *goquery.Document object (or nil if the response body could not be parsed). It returns the links to process (see above for the possible types), and a bool flag indicating if gocrawl should find the links himself. When this flag is true, the harvested return value is ignored and gocrawl searches the goquery document for links to enqueue. When false, the harvested data is enqueued, if any. The DefaultExtender.Visit implementation returns nil, true so that links from a visited page are automatically found and processed.

  • Visited : Visited(ctx *URLContext, harvested interface{}). Called after a page has been visited. The URL context and the URLs found during the visit (either by the Visit function or by gocrawl) are passed as argument. By default, this method is a no-op.

  • Disallowed : Disallowed(ctx *URLContext). Called when an enqueued URL gets denied acces by a robots.txt policy. By default, this method is a no-op.

Finally, by convention, if a field named EnqueueChan with the very specific type of chan<- interface{} exists and is accessible on the Extender instance, this field will get set to the enqueue channel, which accepts the expected types as data for URLs to enqueue. This data will then be processed by the crawler as if it had been harvested from a visit. It will trigger calls to Filter() and, if allowed, will get fetched and visited.

The DefaultExtender structure has a valid EnqueueChan field, so if it is embedded as an anonymous field in a custom Extender structure, this structure automatically gets the EnqueueChan functionality.

This channel can be useful to arbitrarily enqueue URLs that would otherwise not be processed by the crawling process. For example, if an URL raises a server error (status code 5xx), it could be re-enqueued in the Error() extender function, so that another fetch is attempted.

Thanks

  • Richard Penman
  • Dmitry Bondarenko
  • Markus Sonderegger
  • @lionking6792

License

The BSD 3-Clause license.

gocrawl's People

Contributors

fwang2002 avatar goodsign avatar h8liu avatar juliend2 avatar lionking6792 avatar minarc avatar mna avatar moredure avatar nanolab avatar oherych avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gocrawl's Issues

Redirect + normalization problem

Hi Martin!

Currently I'm having a problem, but I'm not sure what I should focus on or whether it is a complex of problems. I'll try to explain what I'm encountering and I'd be thankful if you leave some comments on that, because maybe it's not even a bug.

Okay, for example, let's crawl 'http://golang.org' : if you look at the golang.org source code, you'll see links like: /pkg/, /doc/, etc.

These links are getting resolved to absolute and normalized by gocrawl, so for example for /pkg/ I get 'http://golang/pkg' (Default purell flag is 'all greedy' so I lose the trailing slash).

If you visit 'http://golang/pkg' (even just using your browser) you'll see that it would redirect you to '/pkg/' (Just where the initial link goes).

First problem

And here goes the first problem, which is depicted by a piece of gocrawl log (I removed unneccessary log parts):

enqueue: http://golang.org/pkg
...
worker 1 - popped: http://golang.org/pkg
...
worker 1 - redirect to /pkg/
...
receive url /pkg/
ignore on absolute policy: /pkg

So it seems that redirected URL doesn't get resolved to absolute one like the original one was. I checked your code and saw resolving logic only in worker.processLinks if I'm not mistaken. So it seems that somewhere in the redirect logic resolving is missing.

Second problem

Even if the redirect URL would get resolved to an absolute one, it still gets normalized if I don't change URLNormalizationFlags. So the trailing slash would still be always removed (We see that in log we 'receive /pkg/' and 'ignore /pkg') and thus we'll be infinitely redirected, because golang.org/pkg redirects to golang.org/pkg/ (and after normalization it gets to golang.org/pkg and it redirects to ....).

My temporary solution

I've temporarily solved that just by avoiding any slash-related logic, so I've set

opts.URLNormalizationFlags = purell.FlagsAllGreedy & (^purell.FlagRemoveTrailingSlash)

and everything went fine.

Fix proposal and discussion

Maybe some other normalization flag should be chosen as the default?

Or maybe it is even better to change the strategy a bit:

  • Pass a normalized URL to filter,
  • After Filter returned 'true', fetch the original URL as-is

Personally I like the latter, because this way I exclude the situation that normalization changes URL and website gives something different for the modified one (like a redirect to the original again).

What do you think? Tell me if I'm missing something here.

Redirecting to same normalized URL skips this URL

Hi,

I am using gocrawl to browse my blog and extract all links.

It turns out that URLs are redirected to their "/" version.
For example, /post/tizen-cli-application-ligne-commandes is redirected to /post/tizen-cli-application-ligne-commandes/

When the worker processes the URL, it founds the redirection and, as documented, skips the current URL and enqueues the redirected one.
However, this enqueued URL fails at the next filtering step because both the original URL and the redirected one have the same normalized URL, which is already present in the visited map.

Should this be addressed at the worker level by checking whether if both normalized URLs are identical before following the redirection?

Jérémy

URL normalization bug?

I instantiated a crawler gocrawl.NewCrawlerWithOptions(opts) with opts.URLNormalizationFlags = purell.FlagsUsuallySafeNonGreedy. The url I'm returning in Visit is http://www.example.com/new/?start=60 (parsed with url.Parse) but the enqueue log says it's http://www.example.com/new/. I checked but purell is giving the correct answer, without stripping the query part.

Some piece of code is eating the query part between Visits, but I couldn't figure it out.

Non test based example

Is there an example of using this package in a standalone main.go program. I am interested in using this package to walk some of our data sources using linked open data patterns.

Having some issues converting the test example in the readme to a standalone app...

thanks

Crawl timeout

I would like the crawls to stop after 1 hour, because some websites have an infinite amount of URLs.
So I think it would be great to have a CrawlTimeout option, or an Extender function like ShouldFinishCrawl() bool.

Note: I use it with SameHostOnly = true, to crawl only one site.

Redirects break Filter logic

Currently, if a website performs redirects, I may end up visiting the same URL multiple times, even if the Filter returns false for visited URLs. Also, I may end up visiting URLs not allowed by Filter and URLs from other domain (even if SameHost option is set).

When following a redirect: isVisited is true when Visit() has never been called

Minimum working example:

package main

import (
    "github.com/PuerkitoBio/gocrawl"
    "github.com/PuerkitoBio/goquery"
    "log"
    "net/http"
)

type ExampleExtender struct {
    gocrawl.DefaultExtender
}

func (this *ExampleExtender) Visit(ctx *gocrawl.URLContext, res *http.Response, doc *goquery.Document) (interface{}, bool) {
    log.Printf("Visiting: %s", ctx.NormalizedURL())
    return nil, true
}

func (this *ExampleExtender) Filter(ctx *gocrawl.URLContext, isVisited bool) bool {
    if (isVisited) {
        log.Printf("Already visited: %s", ctx.NormalizedURL())
        return false
    } else {
        return true
    }
}

func main() {
    opts := gocrawl.NewOptions(new(ExampleExtender))
    c := gocrawl.NewCrawlerWithOptions(opts)
    c.Run("http://google.co.uk/")
}

Just in case Google changes their redirect system (note that the seed URL redirects from http://google.co.uk/ -> http://www.google.co.uk/):

$ curl -Is http://google.co.uk | head -n 2
HTTP/1.1 301 Moved Permanently
Location: http://www.google.co.uk/

Finally, the output from the minimum working example above:

$ mwe
2014/09/27 21:40:12 Already visited: http://google.co.uk

The problem is that isVisited is true when Visit() hasn't been called yet -- note that the Visiting: ... debug output has not occurred. With logging:

$ mwe
2014/09/27 21:51:35 init() - seeds length: 1
2014/09/27 21:51:35 init() - host count: 1
2014/09/27 21:51:35 robot user-agent: Googlebot (gocrawl v0.4)
2014/09/27 21:51:35 worker 1 launched for host google.co.uk
2014/09/27 21:51:35 enqueue: http://google.co.uk/robots.txt
2014/09/27 21:51:35 enqueue: http://google.co.uk/
2014/09/27 21:51:35 worker 1 - waiting for pop...
2014/09/27 21:51:35 worker 1 - popped: http://google.co.uk/robots.txt
2014/09/27 21:51:35 worker 1 - waiting for crawl delay
2014/09/27 21:51:35 worker 1 - using crawl-delay: 5s
2014/09/27 21:51:35 worker 1 - popped: http://google.co.uk/
2014/09/27 21:51:35 worker 1 - waiting for crawl delay
2014/09/27 21:51:40 worker 1 - using crawl-delay: 5s
2014/09/27 21:51:40 worker 1 - redirect to http://www.google.co.uk/
2014/09/27 21:51:40 worker 1 - waiting for pop...
2014/09/27 21:51:40 receive url(s) to enqueue [0xc20816e040]
2014/09/27 21:51:40 Already visited: http://google.co.uk
2014/09/27 21:51:40 ignore on filter policy: http://google.co.uk
2014/09/27 21:51:40 sending STOP signals...
2014/09/27 21:51:40 waiting for goroutines to complete...
2014/09/27 21:51:40 worker 1 - stop signal received.
2014/09/27 21:51:40 worker 1 - worker done.
2014/09/27 21:51:40 crawler done.

I haven't tested this too thoroughly (yet), so I'm not sure whether this happens with all redirects or if it's only an issue when the seed is a redirect as in this case. It's causing me an issue because the program I'm writing will exit almost instantly if the seed URL is a redirect, because my filter uses !isVisited just like the default behaviour.

I plan on looking into this myself, but thought I'd post an issue too in case you happen to read it quickly and know what might be happening. Or maybe this behaviour is by design! Cheers for the library, by the way, I'm finding it very useful :)

HeadBeforeGet() error

I'm wanting to use gocrawl for an upcoming project, however I need to use the HeadBeforeGet() method to make sure I only download text/html pages, and don't end up downloading 300mb video files. HeadBeforeGet looks like the ideal way to achieve this, however I always received the following with that option activated:

2014/04/14 19:49:29 Unsolicited response received on idle HTTP channel starting with "\x1f"; err=<nil>

Simplified Code:

package main

import (
    "github.com/PuerkitoBio/gocrawl"
    "github.com/PuerkitoBio/goquery"
    // "log"
    "net/http"
    "time"
)

type ExampleExtender struct {
    gocrawl.DefaultExtender // Will use the default implementation of all but Visit()
}

func (this *ExampleExtender) Visit(ctx *gocrawl.URLContext, res *http.Response, doc *goquery.Document) (interface{}, bool) {
    log.Println(ctx.URL().String())
    return nil, true
}

func main() {
    opts := gocrawl.NewOptions(new(ExampleExtender))
    opts.CrawlDelay = 1 * time.Second
    // opts.LogFlags = gocrawl.LogTrace
    opts.HeadBeforeGet = true

    c := gocrawl.NewCrawlerWithOptions(opts)
    c.Run("http://jakeaustwick.me")
}

Any idea what's going on?

Support for base tag

I don't know if this issue belongs in purell (if so, please let me know), but I think it would be nice to support sites that use the <base> tag. An example of such practice can be found in my own website's HTML.

Because as it is right now, if I crawl the aforementioned website with gocrawl, it won't use the base's href value to properly treat the relative tags it will see, example:

If the crawler is currently in http://www.juliendesrosiers.com/2012/11/20/wordpress-plugins-for-complex-websites , it will see the portfolio's relative link (href="portfolio") and try to go here instead of here.

I would like to create a pull request for this feature, so if you have any suggestion for the implementation, I'll take it. Thanks!

Forbidden 403

I want to crawler website,this website use cloudflare
when i run your package I get failed

ERROR status code for https://www.example.com/: 403 Forbidden

httpClient.Transport

Currently I'm having a little problem with changing http client transport. It is caused by two facts:

  • httpClient (ext.go:120) is private
  • DefaultExtender.Fetch uses it

In my case I needed to change the standard transport to add timeouts:

client.Transport = &http.Transport{
            Dial: func(netw, addr string) (net.Conn, error) {
                deadline := time.Now().Add(ext.Info.Timeout)
                c, err := net.DialTimeout(netw, addr, ext.Info.Timeout)
                if err != nil {
                    return nil, err
                }
                c.SetDeadline(deadline)
                return c, nil
            },
        },

In current situation I ended up copypasting your default 'Fetch' code (as-is) and 'CheckRedirect' part of your httpClient (as-is). Also, I ended up copypasting your 'isRobotsTxt' func :)

But I think it would be better to allow somehow to change just parts of this logic, without rewriting it all.

For example, if we talk about Transport, it could be OK to add a Transport field to Options and to use it in the instantiation code (probably then it should be moved from package-level vars to DefaultExtender):

httpClient = &http.Client{Transport: options.Transport,  CheckRedirect: func(req *http.Request, via []*http.Request) error {
    // For robots.txt URLs, allow up to 10 redirects, like the default http client.
    // Rationale: the site owner explicitly tells us that this specific robots.txt
    // should be used for this domain.
    if isRobotsTxtUrl(req.URL) {
        if len(via) >= 10 {
            return errors.New("stopped after 10 redirects")
        }
        return nil
    }

    // For all other URLs, do NOT follow redirections, the default Fetch() implementation
    // will ask the worker to enqueue the new (redirect-to) URL. Returning an error
    // will make httpClient.Do() return a url.Error, with the URL field containing the new URL.
    return &EnqueueRedirectError{"redirection not followed"}
}}

I'm not sure if it is already in your package reorganization plans, I thought that I should submit it just in case.

Re-enqueue URL if error on Fetch

If an error occurs when Fetching an URL (status 500, or something), offer a way to re-enqueue this URL in the crawler, possibly after a delay.

examples_test.go run error

package main

import (
"github.com/PuerkitoBio/gocrawl"
"github.com/PuerkitoBio/goquery"
"net/http"
"net/url"
"regexp"
"time"
)

// Only enqueue the root and paths beginning with an "a"
var rxOk = regexp.MustCompile(http://duckduckgo\.com(/a.*)?$)

// Create the Extender implementation, based on the gocrawl-provided DefaultExtender,
// because we don't want/need to override all methods.
type ExampleExtender struct {
gocrawl.DefaultExtender // Will use the default implementation of all but Visit() and Filter()
}

// Override Visit for our need.
func (this _ExampleExtender) Visit(res *http.Response, doc *goquery.Document) ([]_url.URL, bool) {
// Use the goquery document or res.Body to manipulate the data
// ...

// Return nil and true - let gocrawl find the links
return nil, true

}

// Override Filter for our need.
func (this *ExampleExtender) Filter(u *url.URL, src *url.URL, isVisited bool, origin gocrawl.EnqueueOrigin) (bool, int, gocrawl.HeadRequestMode) {
// Priority (2nd return value) is ignored at the moment
return rxOk.MatchString(u.String()), 0, gocrawl.HrmDefault
}

func main() {
// Set custom options
opts := gocrawl.NewOptions(new(ExampleExtender))
opts.CrawlDelay = 1 * time.Second
opts.LogFlags = gocrawl.LogError | gocrawl.LogInfo

// Play nice with ddgo when running the test!
opts.MaxVisits = 2

// Create crawler and start at root of duckduckgo
c := gocrawl.NewCrawlerWithOptions(opts)
c.Run("http://duckduckgo.com/")

// Remove "x" before Output: to activate the example (will run on go test)

// xOutput: voluntarily fail to see log output

}

2013/01/16 11:19:37 robot user-agent: Googlebot (gocrawl v0.2)
2013/01/16 11:19:37 worker 1 launched for host duckduckgo.com
2013/01/16 11:19:37 worker 1 - waiting for pop...
2013/01/16 11:19:37 worker 1 - popped: http://duckduckgo.com/robots.txt
2013/01/16 11:19:37 worker 1 - using crawl-delay: 1s
2013/01/16 11:19:37 worker 1 - popped: http://duckduckgo.com
2013/01/16 11:19:38 worker 1 - using crawl-delay: 1s
2013/01/16 11:19:38 worker 1 - waiting for pop...
2013/01/16 11:19:38 worker 1 - popped: http://duckduckgo.com
2013/01/16 11:19:38 worker 1 - using crawl-delay: 1s
2013/01/16 11:19:38 worker 1 - waiting for pop...
2013/01/16 11:19:38 worker 1 - popped: http://duckduckgo.com
2013/01/16 11:19:38 worker 1 - using crawl-delay: 1s
2013/01/16 11:19:38 worker 1 - waiting for pop...
2013/01/16 11:19:38 worker 1 - popped: http://duckduckgo.com
2013/01/16 11:19:38 worker 1 - using crawl-delay: 1s
2013/01/16 11:19:38 worker 1 - waiting for pop...
2013/01/16 11:19:38 worker 1 - popped: http://duckduckgo.com
2013/01/16 11:19:38 worker 1 - using crawl-delay: 1s
2013/01/16 11:19:38 worker 1 - waiting for pop...
2013/01/16 11:19:38 worker 1 - popped: http://duckduckgo.com
2013/01/16 11:19:38 worker 1 - using crawl-delay: 1s
2013/01/16 11:19:39 worker 1 - waiting for pop...
2013/01/16 11:19:39 worker 1 - popped: http://duckduckgo.com
2013/01/16 11:19:39 worker 1 - using crawl-delay: 1s
2013/01/16 11:19:39 worker 1 - waiting for pop...
2013/01/16 11:19:39 worker 1 - popped: http://duckduckgo.com
2013/01/16 11:19:39 worker 1 - using crawl-delay: 1s
2013/01/16 11:19:39 worker 1 - waiting for pop...
2013/01/16 11:19:39 worker 1 - popped: http://duckduckgo.com
2013/01/16 11:19:39 worker 1 - using crawl-delay: 1s
2013/01/16 11:19:39 worker 1 - waiting for pop...
2013/01/16 11:19:39 worker 1 - popped: http://duckduckgo.com
2013/01/16 11:19:39 worker 1 - using crawl-delay: 1s
2013/01/16 11:19:39 worker 1 - waiting for pop...
2013/01/16 11:19:39 worker 1 - popped: http://duckduckgo.com
2013/01/16 11:19:39 worker 1 - using crawl-delay: 1s
2013/01/16 11:19:39 worker 1 - waiting for pop...
2013/01/16 11:19:39 worker 1 - popped: http://duckduckgo.com
2013/01/16 11:19:39 worker 1 - using crawl-delay: 1s
2013/01/16 11:19:39 worker 1 - waiting for pop...
2013/01/16 11:19:39 worker 1 - popped: http://duckduckgo.com
2013/01/16 11:19:39 worker 1 - using crawl-delay: 1s
2013/01/16 11:19:39 worker 1 - waiting for pop...
2013/01/16 11:19:39 worker 1 - popped: http://duckduckgo.com
2013/01/16 11:19:39 worker 1 - using crawl-delay: 1s
2013/01/16 11:19:39 worker 1 - waiting for pop...
2013/01/16 11:19:39 worker 1 - popped: http://duckduckgo.com
2013/01/16 11:19:39 worker 1 - using crawl-delay: 1s
2013/01/16 11:19:39 worker 1 - waiting for pop...
2013/01/16 11:19:39 worker 1 - popped: http://duckduckgo.com
2013/01/16 11:19:39 worker 1 - using crawl-delay: 1s
2013/01/16 11:19:39 worker 1 - waiting for pop...
2013/01/16 11:19:39 worker 1 - popped: http://duckduckgo.com
2013/01/16 11:19:39 worker 1 - using crawl-delay: 1s
2013/01/16 11:19:39 worker 1 - waiting for pop...
2013/01/16 11:19:39 worker 1 - popped: http://duckduckgo.com
2013/01/16 11:19:39 worker 1 - using crawl-delay: 1s
2013/01/16 11:19:39 worker 1 - waiting for pop...
2013/01/16 11:19:39 worker 1 - popped: http://duckduckgo.com
2013/01/16 11:19:39 worker 1 - using crawl-delay: 1s
2013/01/16 11:19:39 worker 1 - waiting for pop...
2013/01/16 11:19:39 worker 1 - popped: http://duckduckgo.com

Concurent fetching on the same host

Hello!

Is it possible to make crawler concurrent while fetching pages from the same host? I need to crawl one host and it has 100k pages. Existing implementation creates one worker with consistent fetching, one by one, this is too slow.

Any advice? I can't find where I can configure that. Please help where I can patch the code for that.

High CPU usage while waiting for crawl delay

Between CrawlDelay we are stuck in the default's else branch and do the busy waiting. runtime.Gosched() is executed over and over again without any need.

(issue submitted via email by M. Sonderegger)

Can I have a Crawler that blocks waiting for more urls on EnqueueChan?

My usage scenario of the crawler is different from the one-time shot from the sample testcase. I'd like to instantiate a Crawler and pass nil to Run, since the arguments doesn't help on keep feeding the Crawler. I would then use EnqueueChan instead to constantly feed seeds. Digging the sources, it looks like termination of Run based on EnqueueChan is solely due to len(EnqueueChan) which would require me to make sure that EnqueueChan is aways non-empty to avoid the restarting of the crawling process.

Test case "InvalidSeed" fails on Go1.1

Go1.1 changes the way URLs are parsed, so that segments like #foo produce a valid *url.URL. I need to change the assertions for Go1.1 to make it pass, while keeping it this way for Go1.0.x.

Keep User-Agent the same across requests

I think RobotUserAgent may well be able to be deprecated.
User-Agent header of a crawler is expected to be the same if it is for the same purpose of crawling (like for text search indexing.)
gocrawl seems to change the User-Agent header when accessing robots.txt.
Also it includes Googlebot even though it is not Googlebot, which makes site owners feel strange.

Sorry for me going to post multiple issues at once.
No offense to you but I believe these ideas useful to make this popular crawler even politer.

invalid memory

when I run my gocrawler, but suddenly it's stop. and I get message
panic: runtime error: invalid memory address or nil pointer dereference
how I fix it

Exclude Images

Currently gocrawl visits images, links with .jpg etc. I feel a useful addition would be a config option to filter these urls by default.

I understand this could be done with a custom Filter() method, just seems common enough to be a built in option.

Thoughts?

Filter out image file but not count

In Filter() I can only get the URL. I'd detect whether it's image by Context-Type. But when I get the Context-Type, that page has already been counted in. Is it possible to do something like: "Crawl first 100 pages with Context-Type equals to text/html" ?

robots.txt encoding error

I'm using gocrawl to spider http://www.baidu.com/ site as example:

package main

import (
    "github.com/PuerkitoBio/gocrawl"
    "github.com/PuerkitoBio/goquery"
    "net/http"
    "time"
)

// Create the Extender implementation, based on the gocrawl-provided DefaultExtender,
// because we don't want/need to override all methods.
type UrlExtender struct {
    gocrawl.DefaultExtender // Will use the default implementation of all but Visit() and Filter()
}

func (this *UrlExtender) Visit(ctx *gocrawl.URLContext, res *http.Response, doc *goquery.Document) (interface{}, bool) {
    return nil, true
}

func (this *UrlExtender) Filter(ctx *gocrawl.URLContext, isVisited bool) bool {
    return !isVisited && true
}

func main() {
    // Set custom options
    opts := gocrawl.NewOptions(new(UrlExtender))
    opts.CrawlDelay = 1 * time.Second
    opts.LogFlags = gocrawl.LogEnqueued
    opts.SameHostOnly = false
    opts.MaxVisits = 50

    // Play nice with ddgo when running the test!
    opts.MaxVisits = 5

    // Create crawler and start at root of duckduckgo
    c := gocrawl.NewCrawlerWithOptions(opts)
    c.Run("http://www.baidu.com")
}

Then will got error with:

robotstxt from bytes:3:9: illegal UTF-8 encoding
robotstxt from bytes:3:11: illegal UTF-8 encoding
robotstxt from bytes:3:14: illegal UTF-8 encoding
robotstxt from bytes:3:15: illegal UTF-8 encoding
robotstxt from bytes:3:16: illegal UTF-8 encoding
robotstxt from bytes:3:18: illegal UTF-8 encoding
robotstxt from bytes:3:19: illegal UTF-8 encoding
robotstxt from bytes:3:21: illegal UTF-8 encoding
....

How can I disabled this error info? or just bypass robots parse?

Pattern for gather via end function

I am working with the goCrawl package (github.com/PuerkitoBio/gocrawl) and have a basic grawler working that dives into JSON data.

I use this to crawl sites in geoscience for structured data. It's working well so far but I have an issue.

Within the Visit function I make a call via

go writeLinks(jmap["href"].(string), c)

calling

func writeLinks(msg string, c chan string) {
    fmt.Printf("=====================> Writing link from channel %s \n", msg)
}

This works well if I want to write to screen or write to file. However, if I want to simply gather the data into a struct (for example) and then access this collected data at the end from the End function how would I do that.

I am not familiar enough with Go to see how to do that. Perhaps it's not so easy to do and in fact my main use case is likely covered with the with simple channel call to something like a writeLinks function.

Obviously the the concept of a global struct is bad on many (all) levels. What I am trying to see is if there is a way gather data collected during the crawl.

Any suggestions on a an approach?

User-Agent is lost on redirect

While looking though my logs today I found this

blog.benjojo.co.uk 54.166.28.206 - - 2015-09-11T14:35:23Z "GET /robots.txt HTTP/1.1" 301 377 "" "Googlebot (gocrawl v0.4)" 2244194f10332390
blog.benjojo.co.uk 54.166.28.206 - - 2015-09-11T14:35:25Z "GET /robots.txt HTTP/1.1" 200 771 "http://blog.benjojo.co.uk/robots.txt" "Go 1.1 package http" 22441959ce3523c0

So I found this repo though the first User-Agent, but the 2nd one shows that you are now putting back in the user-agent on the CheckRedirect part of the HTTP request.

This can lead to confusion and blocking from systems that kill default user agents due to abuse associated with them

Logger interface

Currently Options.Logger is the only way to get log from gocrawl. It has standard type log.Logger, which in my opinion is not the best choice here.

The point is that in current situation I cannot find out the log level which gocrawl used to log a message anyhow. So I cannot perform the basic tasks which I would like to do with log:

  • Separate log file for errors
  • Critical errors are sent to my email
  • Info is written to standard output and Error -- to stderr
  • etc.

Personally I use Seelog library for that, and I would be happy to map:

  • Your LogError -> my log.Error(...)
  • Your LogInfo -> my log.Info(...)
  • Your LogEnqueued, LogIgnored -> my log.Debug(...)
  • etc.

Currently I cannot do that in a proper way because I don't get log level anywhere. Currently I do a 'dirty hack': I get your messages, analyze them, and try to find keywords like 'ERROR', 'enqueued', 'visited', etc. and guess log level depending on them.

But I would be more happy to see something like this in Options:

LogFunc func(level LogFlags, msg string)

You could provide a default implementation for that :

func DefaultLogFunc(logger *log.Logger) func(LogFlags, string) {
    return func(level LogFlags, msg string) {
        ... your default implementation as it is now ...
    }
}

And this way the ones who just want to use logger as it is now could write

opts.LogFunc = gocrawl.DefaultLogFunc(log.New())

But also I could write my mapper func:

opts.LogFunc = myLogFunc

...

func myLogFunc(level LogFlags, msg string) {
    switch level {
    case gocrawl.LogError:
        seelog.Error(msg) 
    case gocrawl.LogInfo:
        seelog.Info(msg)
   ... etc ...
    }
}

I feel this is a valid proposal because log level is frequently used to determine what to do with log message and log.Logger interface doesn't include it in its func declarations.

What do you think?

Best way to crawl several sites

I've been having very good results with this package. Now I am interested in using it to search several sites (a couple dozen) and to store results in my REDIS (which is working well).

However, what is the best way to approach searching this many sites? A simple collection/map of the sites will not work with it? ie.. I can't do the following can I?

type Map1 map[string]string
    m := Map1{"boo": "http://www.algaebase.org/", "foo": "http://data.oceandrilling.org", "bar":"http://animaldiversity.ummz.umich.edu/", "buz":"http://giantsstudio.com"}

    for k, v := range m {
        fmt.Println("k:", k, "v:", v)
        c := gocrawl.NewCrawlerWithOptions(opts)
        c.Run(v)
    }

processLinks miss <iframe src="">

in worker.go

// Scrape the document's content to gather all links
func (this *worker) processLinks(doc *goquery.Document) (result []*url.URL) {
    urls := doc.Find("a[href]").Map(func(_ int, s *goquery.Selection) string {
        val, _ := s.Attr("href")
        return val
    })

......

how to get it?

like this?

    ahref := doc.Find("a[href]").Map(func(_ int, s *goquery.Selection) string {
        val, _ := s.Attr("href")
        return val
    })

    iframe := doc.Find("iframe[src]").Map(func(_ int, s *goquery.Selection) string {
        val, _ := s.Attr("src")
        return val
    })

    frame := doc.Find("frame[src]").Map(func(_ int, s *goquery.Selection) string {
        val, _ := s.Attr("src")
        return val
    })

    urls := append(ahref, iframe...)
    urls = append(urls, frame...)

Additional pre-fetch step: HEAD request

This is a debatable proposal, which seems logical to me, but we could discuss it, maybe some other ways to do the same are better.

The problem

The problem is: often I decide to visit a URL or not, depending on its header fields such as 'Content-Type' or 'Content-Length'. So, when deciding to visit or not, I would like to send HEAD request to a resource, then check, whether it is a good 'text/html' or it is some big pdf file/image. Only after I see that its content is worth downloading, I would like to visit it.

Currently, I implement that in Extender.Filter, where I create my own HEAD request, get the results and return false in case it has the wrong content or something like that.

The proposal

I think that this step is pretty common to crawlers (imo, 'Filter' decision is based on content-type even more frequently than on 'from/u' parameters) and it deserves a 'get-head' option in Options and a hook in the Extender. Or, maybe it can be added as a parameter to the Filter itself:

Filter(u *url.URL, from *url.URL, isVisited bool, head *http.Header ) (enqueue bool, priority int)

In the last case, the 'head' will be nil if the 'get-head' option is disabled.

What do you think about it? Thanks!

Panic Error When crawler

I have trouble, when gocrawler run but suddenly it's stop error message is Get link: EOF exit status 1
when I check this url, this web is working

and Why my gocrawler visit the same link
this is my config

const (
  DEPTH = 0
)
opts := gocrawl.NewOptions(new(ExampleExtender))
  opts.CrawlDelay = 10
  opts.LogFlags = gocrawl.LogNone
  opts.RobotUserAgent = "Google"
  opts.UserAgent ="Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:39.0) Gecko/20100101 Firefox/39.0"
  c := gocrawl.NewCrawlerWithOptions(opts)
  c.Run(gocrawl.S{
  "https://www.someurl1.com/": DEPTH, 
  "http://www.someurl2.co.id/" : DEPTH, 
  "http://www.someurl3.co.id/" : DEPTH, 
  "https://www.someurl4.com/": DEPTH, 
  "https://www.someurl5.com/": DEPTH, 
  "http://www.someurl6.com/": DEPTH, 
  "https://www.someurl7.com/": DEPTH, 
  "http://someurl8.com/": DEPTH, 
  "http://www.someurl9.com/": DEPTH, 
  "http://www.someurl10.com/": DEPTH, 
  "http://www.someurl11.com/": DEPTH,
  })

Same host policy issue

If a seed URL of www.host.com is provided, the normalized host will be saved in the "hosts" slice (used to control the same host policy). Under the default settings, this means that "host.com" is the normalized host.

If the page www.host.com/page1 is visited, the Visit() method is called using the non-normalized URL. If this Visit() returns some new URLs to crawl, those will be enqueued by passing the (non-normalized) source URL as reference, i.e. new URL=www.host.com/page2, SourceURL=www.host.com/page1.

In this case, the same host policy is enforced by comparing the normalized form of the host of the new URL (host.com) to the non-normalized form of the host of the source URL (www.host.com), and rejects the new URL due to the same host policy. It should compare with a normalized form of the source URL.

Default user-agent is likely going to be blocked in most cases

First of all, Why (by defaults) have two user agents?

You are using

Mozilla/5.0 (Windows NT 6.1; rv:15.0) Gecko/20120716 Firefox/15.0a2 for page crawling and then Googlebot (gocrawl v0.4) for when you are grabbing robots.txt from what I can see.

Impersonating googlebot is already playing with fire against systems that detect and block things that claim to be Google Bot.

But also why have two different? Does this actually gain anything, Why not ship a user agent that explains that your program is a crawler, it's better to be honest than a WAF detecting that your request isnt what it says it is.

I understand these are defaults, but encouraging people to use these User-Agent strings by default IMO is a bad idea.

There is a bug in popchannel.go

In stack method:
func (this popChannel) stack(cmd ...*workerCommand) {
toStack := cmd
for {
select {
case this <- toStack:
return
case old := <-this:
toStack = append(old, cmd...) //should be toStack = append(old, toStack...)
}
}
}
as the comment says, something will be lost when loop multiple times.

Max Body Size

While crawling through sites, sometimes you hit something like a video file - or just an extremely large response. This usually wouldn't be a problem, however because gocrawl is a single goroutine per host - this can lock the entire thing up for a long time.

Suggestions is that rather than using ioutil.ReadAll() (https://github.com/PuerkitoBio/gocrawl/blob/master/worker.go#L331), supporting a configuration option to only read the first N bytes - and then just proceed.

The problem I foresee is that maybe this would require an API change, because in your Visit function you would have to receive a variable which told you whether the entire body was downloaded or not. The only way around this that I can think of that would not require an API change would be to pass everything forward to a VisitBodyNotCompleted function, however this isn't as ideal.

How open are you to breaking API changes, or can you think of a way around this?

What are you thoughts on supporting this?

Thanks,
Jake

Allow storage of arbitrary state data with an URL

When a page is crawled, some data is extracted. Sometimes, the complete data on a given piece of information is split across many pages. It may be necessary to store some state when crawling a page so that when a "child" page is crawled, this information is available.

For example, a page /author is crawled and information on the author is saved in a DB, with an ID. The URL /author/book1 is then enqueued, but if this page is crawled in a stateless way, it has no way to link the information back to the previously crawled author (it could find the author name in the book page, but let's pretend it's not there, or even if it was, there are maybe many authors with the same name, or there might be a typo, etc.).

Not sure yet if this should be managed by gocrawl or not. Should seed URLs also be allowed to have state? How much of a pain will it be to implement, complexify the API?

method for new robotstxt FromResponse vs. FromBytes

in worker.go

I think this will be better,
it handle the res for GET robots.txt return with code 404

func (this *worker) getRobotsTxtGroup(ctx *URLContext, b []byte, res *http.Response) (g *robotstxt.Group) {
    var data *robotstxt.RobotsData
    var e error

    if res != nil {
        data, e = robotstxt.FromResponse(res)
        // Error or not, the robots.txt has been fetched, so notify
        this.opts.Extender.FetchedRobots(ctx, res)
    } else {
        data, e = robotstxt.FromBytes(b)
    }

redirects don't have a SourceURL value

I ran into a scenario where http://foobar.com does a 301 to https://foobar.com. The default flags normalize to http which is fine for urls that are the same content (not redirects), so I didn't want to turn off that normalization flag. Instead I was going to check the scheme/path/host of the source url to the current url in Filter. The issue is https://foobar.com does not have a source url of http://foobar.com which to me seems like it should.

is there another preferred way to do what I'm checking that I'm not thinking of?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.