Git Product home page Git Product logo

geziyor's Introduction

Geziyor

Geziyor is a blazing fast web crawling and web scraping framework. It can be used to crawl websites and extract structured data from them. Geziyor is useful for a wide range of purposes such as data mining, monitoring and automated testing.

GoDoc report card Code Coverage

Features

  • JS Rendering
  • 5.000+ Requests/Sec
  • Caching (Memory/Disk/LevelDB)
  • Automatic Data Exporting (JSON, CSV, or custom)
  • Metrics (Prometheus, Expvar, or custom)
  • Limit Concurrency (Global/Per Domain)
  • Request Delays (Constant/Randomized)
  • Cookies, Middlewares, robots.txt
  • Automatic response decoding to UTF-8
  • Proxy management (Single, Round-Robin, Custom)

See scraper Options for all custom settings.

Status

We highly recommend you to use Geziyor with go modules.

Usage

This example extracts all quotes from quotes.toscrape.com and exports to JSON file.

func main() {
    geziyor.NewGeziyor(&geziyor.Options{
        StartURLs: []string{"http://quotes.toscrape.com/"},
        ParseFunc: quotesParse,
        Exporters: []export.Exporter{&export.JSON{}},
    }).Start()
}

func quotesParse(g *geziyor.Geziyor, r *client.Response) {
    r.HTMLDoc.Find("div.quote").Each(func(i int, s *goquery.Selection) {
        g.Exports <- map[string]interface{}{
            "text":   s.Find("span.text").Text(),
            "author": s.Find("small.author").Text(),
        }
    })
    if href, ok := r.HTMLDoc.Find("li.next > a").Attr("href"); ok {
        g.Get(r.JoinURL(href), quotesParse)
    }
}

See tests for more usage examples.

Documentation

Installation

go get -u github.com/geziyor/geziyor

If you want to make JS rendered requests, make sure you have Chrome installed.

NOTE: macOS limits the maximum number of open file descriptors. If you want to make concurrent requests over 256, you need to increase limits. Read this for more.

Making Normal Requests

Initial requests start with StartURLs []string field in Options. Geziyor makes concurrent requests to those URLs. After reading response, ParseFunc func(g *Geziyor, r *Response) called.

geziyor.NewGeziyor(&geziyor.Options{
    StartURLs: []string{"http://api.ipify.org"},
    ParseFunc: func(g *geziyor.Geziyor, r *client.Response) {
        fmt.Println(string(r.Body))
    },
}).Start()

If you want to manually create first requests, set StartRequestsFunc. StartURLs won't be used if you create requests manually.
You can make requests using Geziyor methods:

geziyor.NewGeziyor(&geziyor.Options{
    StartRequestsFunc: func(g *geziyor.Geziyor) {
    	g.Get("https://httpbin.org/anything", g.Opt.ParseFunc)
        g.Head("https://httpbin.org/anything", g.Opt.ParseFunc)
    },
    ParseFunc: func(g *geziyor.Geziyor, r *client.Response) {
        fmt.Println(string(r.Body))
    },
}).Start()

Making JS Rendered Requests

JS Rendered requests can be made using GetRendered method. By default, geziyor uses local Chrome application CLI to start Chrome browser. Set BrowserEndpoint option to use different chrome instance. Such as, "ws://localhost:3000"

geziyor.NewGeziyor(&geziyor.Options{
    StartRequestsFunc: func(g *geziyor.Geziyor) {
        g.GetRendered("https://httpbin.org/anything", g.Opt.ParseFunc)
    },
    ParseFunc: func(g *geziyor.Geziyor, r *client.Response) {
        fmt.Println(string(r.Body))
    },
    //BrowserEndpoint: "ws://localhost:3000",
}).Start()

Extracting Data

We can extract HTML elements using response.HTMLDoc. HTMLDoc is Goquery's Document.

HTMLDoc can be accessible on Response if response is HTML and can be parsed using Go's built-in HTML parser If response isn't HTML, response.HTMLDoc would be nil.

geziyor.NewGeziyor(&geziyor.Options{
    StartURLs: []string{"http://quotes.toscrape.com/"},
    ParseFunc: func(g *geziyor.Geziyor, r *client.Response) {
        r.HTMLDoc.Find("div.quote").Each(func(_ int, s *goquery.Selection) {
            log.Println(s.Find("span.text").Text(), s.Find("small.author").Text())
        })
    },
}).Start()

Exporting Data

You can export data automatically using exporters. Just send data to Geziyor.Exports chan. Available exporters

geziyor.NewGeziyor(&geziyor.Options{
    StartURLs: []string{"http://quotes.toscrape.com/"},
    ParseFunc: func(g *geziyor.Geziyor, r *client.Response) {
        r.HTMLDoc.Find("div.quote").Each(func(_ int, s *goquery.Selection) {
            g.Exports <- map[string]interface{}{
                "text":   s.Find("span.text").Text(),
                "author": s.Find("small.author").Text(),
            }
        })
    },
    Exporters: []export.Exporter{&export.JSON{}},
}).Start()

Custom Requests - Passing Metadata To Callbacks

You can create custom requests with client.NewRequest

Use that request on geziyor.Do(request, callback)

geziyor.NewGeziyor(&geziyor.Options{
    StartRequestsFunc: func(g *geziyor.Geziyor) {
        req, _ := client.NewRequest("GET", "https://httpbin.org/anything", nil)
        req.Meta["key"] = "value"
        g.Do(req, g.Opt.ParseFunc)
    },
    ParseFunc: func(g *geziyor.Geziyor, r *client.Response) {
        fmt.Println("This is our data from request: ", r.Request.Meta["key"])
    },
}).Start()

Proxy - Use proxy per request

If you want to use proxy for your requests, and you have 1 proxy, you can just set these env values: HTTP_PROXY HTTPS_PROXY And geziyor will use those proxies.

Also, you can use in-order proxy per request by setting ProxyFunc option to client.RoundRobinProxy Or any custom proxy selection function that you want. See client/proxy.go on how to implement that kind of custom proxy selection function.

Proxies can be HTTP, HTTPS and SOCKS5.

Note: If you use http scheme for proxy, It'll be used for http requests and not for https requests.

geziyor.NewGeziyor(&geziyor.Options{
    StartURLs:         []string{"http://httpbin.org/anything"},
    ParseFunc:         parseFunc,
    ProxyFunc:         client.RoundRobinProxy("http://some-http-proxy.com", "https://some-https-proxy.com", "socks5://some-socks5-proxy.com"),
}).Start()

Benchmark

8748 request per seconds on Macbook Pro 15" 2016

See tests for this benchmark function:

>> go test -run none -bench Requests -benchtime 10s
goos: darwin
goarch: amd64
pkg: github.com/geziyor/geziyor
BenchmarkRequests-8   	  200000	    108710 ns/op
PASS
ok  	github.com/geziyor/geziyor	22.861s

geziyor's People

Contributors

albertbronsky avatar cristoper avatar deepsourcebot avatar dependabot[bot] avatar glaslos avatar harnnless avatar isacikgoz avatar musabgultekin avatar nmelis avatar walker088 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

geziyor's Issues

Comparison to GoColly

Hi @musabgultekin ,

What do you intent to do to differentiate this to GoColly?
Do you have any plans for the crawler to listen to a queue ? Or some form of road map

Enhancement : Execute Chrome & Get Chrome BrowserEndpoint

For a Project I must execute Chrome in Headless mode = false.
I haven't found any options to execute Chrome & set this value to false

So I made this code to do this

	opts := append(chromedp.DefaultExecAllocatorOptions[:],
		chromedp.Flag("headless", false),
		chromedp.Flag("disable-gpu", false),
		chromedp.Flag("enable-automation", false),
		chromedp.Flag("disable-extensions", true),
		chromedp.Flag("remote-debugging-port", "9222"),
	)
	allocCtx, cancel := chromedp.NewExecAllocator(context.Background(), opts...)
	defer cancel()
	ctx, cancel := chromedp.NewContext(
		allocCtx,
		chromedp.WithLogf(log.Printf),
	)
	defer cancel()
	if err := chromedp.Run(ctx); err != nil {
		log.Fatal(err)
	}

	// use :9222/json/version, /json/version return json with param webSocketDebuggerUrl
	var result map[string]interface{}
	resp, err := http.Get("http://localhost:9222/json/version")
	if err != nil {
		log.Fatal(err)
	}
	if err := json.NewDecoder(resp.Body).Decode(&result); err != nil {
		log.Fatal(err)
	}

	crawler := geziyor.NewGeziyor(&geziyor.Options{
		ParseFunc:       quotesParse,
		BrowserEndpoint: result["webSocketDebuggerUrl"].(string),
	})
......

out of control RAM usage

I've got a script that clicks every link and then clicks every link, and it quickly gets out of hand in terms of memory usage (40+GB) before crashing. Any suggestions as to where it's getting out of control? Storing millions of requests shouldn't take that much RAM in my mind.

Exporters should accept io.Writer

Exporters can currently be initialized with a file name to write to. It would be more flexible if instead (or in addition to, to prevent breaking changes) they could accept an io.Writer interface to write to. This would make it easy to use existing exporters to write to stdout, for example.

client: use external http.Client or use external cookies

I have already retrieved cookies with a different http.Client and would like to use these cookies or that http.Client for crawling. How do I set this up? In StartRequestsFunc I tried:

g.Client.Client = httpClient
g.Client.SetCookies(baseURL, cookies)
g.Client.Jar.SetCookies(baseURL, cookies)

In all three cases I was redirected to the login screen. Did I set the cookies wrong or is it usually done a different way? With httpClientthe retrieval of the site works but just without JS, which I try to fix with your awesome library. As baseURL I passed one time the base URL of the website and once the exact URL I am querying.

EDIT: I might have to check whether its due to a different UserAgent.

Binary and interface

How about create HTTP or CLI interface and dist package as binary for non-golang developers?

How to get response error other than HTTP errors

Hi,

How can I get response error other than HTTP errors (StatusCode), like time out, address not found, Website isn't reachable.... ?
For example

	geziyor.NewGeziyor(&geziyor.Options{
		StartURLs: []string{"http://www.1b4f.com/"},
		ParseFunc: func(g *geziyor.Geziyor, r *client.Response) {
			fmt.Println(string(r.Body))
		},
	}).Start()

Log output :

2019/12/10 15:00:21 Scraping Started
2019/12/10 15:00:21 Retrying: http://www.1b4f.com/
2019/12/10 15:00:21 Retrying: http://www.1b4f.com/
2019/12/10 15:00:21 Response error: Get http://www.1b4f.com/: dial tcp: lookup www.1b4f.com: no such host
2019/12/10 15:00:21 Scraping Finished

I want to strore in DataBase site Url & Error ("http://www.1b4f.com/", "dial tcp: lookup www.1b4f.com: no such host")

Fix License

Current license contains this line:

Copyright [yyyy] [name of copyright owner]

Closing square bracket in JSON export

Hello,
In the file 'json.go' responsible for json export, you open the file with the 'O_APPEND ' flag, so the closing square bracket instead of the last comma is not written to the file and this error is not processed.

Callback method not called all the time while using js rendering

I am facing this weird issue where the data goes missing randomly for some pages. Upon debugging I found that callback method is not called all the time though Get request is made for all the links given. There is no error logged so I am not sure how to give you more data points. Only option is too attach debugger and see if I am able to replicate it.

Render js not working

        url := "https://www.jd.co.th/product/1.2_160677.html"
	geziyor.NewGeziyor(&geziyor.Options{
		StartRequestsFunc: func(g *geziyor.Geziyor) {
			g.GetRendered(url, g.Opt.ParseFunc)
		},
		ParseFunc: func(g *geziyor.Geziyor, r *client.Response) {
			fmt.Println("[x] bakatest ", string(r.Body))
		},
	}).Start()

The beauty of geziyor

Hello,

I'm doing a small interview with those who have been rising in Go works, and have began to make a name for themselves, and wanted to discuss where the very creative and stunning name 'geziyor' came from, and what prompted the creative talent to use it? Is it a personal preference, or was it a sudden come to mind thought to use?

It's quite provocative and sends a strong battle cry feeling when said, as stated by one of our interns, and wanted to know if it had a type of medieval take. Have you / y'all thought of other names in previous times, and what were other choices if you had?

Is geziyor an online identity, or does it stem further into a construction of meaning and brought together by the talent working behind it?

Thank you,
Ouch

proxy support?

Hello, is it possible to somehow work with proxies in your library? I need every request to work through a proxy

Recursive Exports / Native return channels

I found it quite common to have recursive / nested scarping.

โ”œโ”€โ”€ a
โ”‚ย ย  โ”œโ”€โ”€ itemA
โ”‚ย ย  โ””โ”€โ”€ foldA
โ”‚ย ย      โ””โ”€โ”€ itemB
โ””โ”€โ”€ b
 ย ย  โ”œโ”€โ”€ itemC
 ย ย  โ””โ”€โ”€ foldA
 ย ย      โ””โ”€โ”€ itemA

Total result being something like:

{
  "a": [
    {
      "title": "itemA",
      "author": "Foo Bar",
      "contents": "asdjnasknd"
    },
    {
      "title": "foldA",
      "children": [
        {
          "title": "itemB",
          "author": "Foo Baz",
          "contents": "afgdgasknd"
        }
      ]
    }
  ],
  "b": [
    {
      "title": "itemC",
      "author": "Foo Bar",
      "contents": "odjfoij"
    },
    {
      "title": "foldA",
      "children": [
        {
          "title": "itemA",
          "author": "Foo Baz",
          "contents": "alsd"
        }
      ]
    }
  ]
}

Problem is, as soon as you pass something to g.Do(), you have no way of hearing back from the function.

context deadline exceeded

I'm trying to scrape 3242 webpages but I'm getting response: Get "https://www.typeform.com/templates/t/course-evaluation-form-template/": context deadline exceeded (Client.Timeout exceeded while awaiting headers) for a lot of urls

Any advice?

Using chromdp and geziyor

Hello,
I'm trying to scrape a site that requires javascript and I can't seem to find a solid example of chromedp being used with geziyor that works.

found packages geziyor (geziyor.go) and main (test.go) in /Users/geziyor go

I get this issue when testing out this code.

package main

import (
"github.com/PuerkitoBio/goquery"
"github.com/geziyor/geziyor"
"github.com/geziyor/geziyor/client"
"github.com/geziyor/geziyor/export"
)

func main() {
geziyor.NewGeziyor(&geziyor.Options{
StartURLs: []string{"http://quotes.toscrape.com/"},
ParseFunc: quotesParse,
Exporters: []export.Exporter{&export.JSON{}},
}).Start()
}

func quotesParse(g *geziyor.Geziyor, r *client.Response) {
r.HTMLDoc.Find("div.quote").Each(func(i int, s *goquery.Selection) {
g.Exports <- map[string]interface{}{
"text": s.Find("span.text").Text(),
"author": s.Find("small.author").Text(),
}
})
if href, ok := r.HTMLDoc.Find("li.next > a").Attr("href"); ok {
g.Get(r.JoinURL(href), quotesParse)
}
}

having issue while installing via go get

root@machine:~# go get github.com/geziyor/geziyor
github.com/geziyor/geziyor/export

src/github.com/geziyor/geziyor/export/csv.go:47:15: val.MapRange undefined (type reflect.Value has no field or method MapRange)

Proxy Management Not supported

In order to integrate proxies, geziyor does not provide any interface. It does provide request middlewares but the object that can be manipulated in the middleware does not have proxy related configuration. Would be great if that can be supported as well.

Unable to stop geziyor and close browser after scraping finishes

I'm currently using geziyor with chromedp/headless-shell. I'm connecting via a remote URL.

limit := 1000
urls := []string{....}
geziyor.NewGeziyor(&geziyor.Options{
			StartRequestsFunc: func(g *geziyor.Geziyor) {
				for i := 0; i < limit; i++ {
					g.GetRendered(url[i], g.Opt.ParseFunc)
				}
			},
			ParseFunc: func(g *geziyor.Geziyor, r *client.Response) {
				if r.StatusCode == 404 {
					fmt.Println("Page not found")
					return
				}
				// Scraping some stuff...
			}, RobotsTxtDisabled: true,
			BrowserEndpoint: result["webSocketDebuggerUrl"].(string),
}).Start()

Issue
I'm unable to stop geziyor program after scraping finishes. Also not able to close the browser (don't know if you guys handle it in the background) once scraping finishes, so I get these errors in chromedp/headless-shell
[0813/133906.020892:WARNING:resource_bundle.cc(1048)] locale resources are not loaded
[0813/135451.951241:ERROR:broker_posix.cc(46)] Received unexpected number of handles
headless-shell error screenshot

While in geziyor, I get these error messages
request getting rendered: could not dial "ws://127.0.0.1:9222/devtools/browser/d224d555-fbd7-491a-86c0-332edb8f2975": context deadline exceeded
geziyor error screenshot

Any advice on the issue?

Panic during GetRendered call

Trying to parse webpage using GetRendered is resulting in following error though Get works fine.

runtime error: invalid memory address or nil pointer dereference goroutine 117 [running]: runtime/debug.Stack(0xc0006c9d48, 0x16846e0, 0x1d708a0) /usr/local/opt/go/libexec/src/runtime/debug/stack.go:24 +0x9d github.com/geziyor/geziyor.(*Geziyor).recoverMe(0xc0001ce000) /Users/xxx/golang/pkg/mod/github.com/geziyor/[email protected]/geziyor.go:303 +0x57 panic(0x16846e0, 0x1d708a0) /usr/local/opt/go/libexec/src/runtime/panic.go:679 +0x1b2 main.parseProductPage(0xc0001ce000, 0xc008049350) /Users/xxx/golang/src/main.go:39 +0xa1 github.com/geziyor/geziyor.(*Geziyor).do(0xc0001ce000, 0xc000393340, 0x177afe0) /Users/xxx/golang/pkg/mod/github.com/geziyor/[email protected]/geziyor.go:264 +0x283 created by github.com/geziyor/geziyor.(*Geziyor).Do /Users/xxx/golang/pkg/mod/github.com/geziyor/[email protected]/geziyor.go:232 +0xa8

LogDisabled scope of action need to be restricted to Gezior only

Hi,

Geziyor's log disabling feature needs to be restricted to Geziyor only. Using the LogDisabled option currently makes you unable to hide Gezior's logs and use the Golang built-in logger.

I believe this bit of code is the culprit since it's discarding everything fed to the built-in logger:

geziyor/geziyor.go

Lines 119 to 124 in 129402d

// Logging
if opt.LogDisabled {
log.SetOutput(ioutil.Discard)
} else {
log.SetOutput(os.Stdout)
}

Best regards,
E. Lundin.

Are there any plan to add supports for a POST request?

Hi there, I was using the project for a personal crawler, after navigating the source code I've realized that the only way to send a POST request might be implementing a StartRequestsFunc (let me know if I'm wrong lol) which manipulates the http client directly,
e.g.,

func postToUrl(url string, body io.Reader) {
	geziyor.NewGeziyor(&geziyor.Options{
		StartRequestsFunc: func(g *geziyor.Geziyor) {
			req, _ := client.NewRequest("POST", url, body)
			g.Do(req, nil)
		}
	}).Start()
}

I haven't tried this approach yet but I'd like to know if that's the proper way to send requests other than a GET? Or is there any plan to add other implementations or an official example about a POST request?

Unable to get content dynamically generated by js

Hello, the webpage I want to crawl, because it is a node dynamically generated by js. So I try to use Geziyor to get the content I want, and I have two questions.
First, I used the official example to get the attribute value of the src of the video tag, and it was still unable to print out the attribute value (code is below).
Second, whether the content dynamically generated by js can open multiple requests like getting a static page.

package main

import (
	"fmt"
	"github.com/geziyor/geziyor"
	"github.com/geziyor/geziyor/client"
)

func main() {
	geziyor.NewGeziyor(&geziyor.Options{
		// set UserAgent
		UserAgent: "Mozilla/5.0 (iPhone; CPU iPhone OS 13_2_3 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.3 Mobile/15E148 Safari/604.1",
		//
		StartRequestsFunc: func(g *geziyor.Geziyor) {
			g.GetRendered("https://m.douyu.com/100", g.Opt.ParseFunc)
		},
		// Handle response
		ParseFunc: func(g *geziyor.Geziyor, r *client.Response) {
			if src, ok := r.HTMLDoc.Find("#html5player-video").Attr("src"); ok {
				fmt.Println("src is:", src)
			}
		},
	}).Start()
}

Memory leak (gocoroutines)

At the moment, each request that is queued up is first pushed into a gocoroutine, which is then blocked until space appears in the queue - however, if you're crawling very large websites with lots of links this causes exponential gocoroutines to build-up, at ~8kb a pop (each link in the "queue")

I attempted to fix this on my side with a semaphore which blocks requests getting added to the queue, however as the middleware's can cancel the request I'm not getting exposed any kind of response to handle that semaphore and I run into deadlocks.

In short, I feel this needs a queue in the core vs. spinning up tens of thousands of go coroutines, if you try crawling, say, Wikipedia with 100 connections, you'll see memory usage accelerate incredibly quickly (into gb's in seconds).

What I was essentially doing on my side was adding a semaphore (very similar to your existing semaphore) but having this block at the point of URL addition/response consumption before the go coroutine is created - which would solve the issue, if it wasn't for my deadlocks(!)

Let me know if I can add any more context! :-)

Question: Pass through data

What's the best way to implement pass through data e.g. from the top level I would like to pass tags which could be used while writing to file.

Scrape URLs then get to there.

I'm looking for,

  • Parses URLs
  • Visit to each parsed URLs
  • Parse data from visited page.

For example,

  • Get books URLs from Goodreads
  • Visit those places
  • Get books' data from the visited pages.

This is possible with colly, I wonder if it's possible with geziyor.

runtime error: invalid memory address or nil pointer dereference

I just ran the basic example and got this error

code:

package main

import (
	"fmt"

	"github.com/geziyor/geziyor"
	"github.com/geziyor/geziyor/client"
)

func main() {
	geziyor.NewGeziyor(&geziyor.Options{
		StartRequestsFunc: func(g *geziyor.Geziyor) {
			g.GetRendered("https://httpbin.org/anything", g.Opt.ParseFunc)
		},
		ParseFunc: func(g *geziyor.Geziyor, r *client.Response) {
			fmt.Println(r.HTMLDoc.Find("title").Text())
		},
		//BrowserEndpoint: "ws://localhost:3000",
	}).Start()
}

error:

Scraping Started
Crawled: (200) <GET https://httpbin.org/anything>
runtime error: invalid memory address or nil pointer dereference goroutine 40 [running]:
runtime/debug.Stack()
        C:/Program Files/Go/src/runtime/debug/stack.go:24 +0x65
github.com/geziyor/geziyor.(*Geziyor).recoverMe(0xc00016cdc0)
        C:/Users/Marshall/go/pkg/mod/github.com/geziyor/[email protected]/geziyor.go:307 +0x45
panic({0x111dc60, 0x17f7d60})
        C:/Program Files/Go/src/runtime/panic.go:838 +0x207
main.main.func2(0xc00014a1c8?, 0xc000409d10?)
        C:/Users/Marshall/Desktop/gezi/main.go:16 +0x18
github.com/geziyor/geziyor.(*Geziyor).do(0xc00016cdc0, 0xc0001524b0, 0x12350c8)
        C:/Users/Marshall/go/pkg/mod/github.com/geziyor/[email protected]/geziyor.go:262 +0x235
created by github.com/geziyor/geziyor.(*Geziyor).Do
        C:/Users/Marshall/go/pkg/mod/github.com/geziyor/[email protected]/geziyor.go:228 +0xd2

Scraping Finished

Any advice?

go get geziyor has error

the problem is that go get gezibor relies on github/knq/sysutil. The module's path was modified resulting in the error message:
github,com/knq/sysutil parsing go.mod
module declares its path as: github.com/chromedp/sysutil
but was required as: github.com/knq/sysutil

This should a simple correction.

renderd js not working

even though I copied the text from the documentation without any changes I still get a runtime error

here is my code ( main.go func calls just calls it)

import (
	"fmt"

	"github.com/geziyor/geziyor"
	"github.com/geziyor/geziyor/client"
)

func FindMenu() {
	geziyor.NewGeziyor(&geziyor.Options{
		StartRequestsFunc: func(g *geziyor.Geziyor) {
			g.GetRendered("https://www.google.com", g.Opt.ParseFunc)
		},
		ParseFunc: func(g *geziyor.Geziyor, r *client.Response) {
			fmt.Println(string(r.Body))
		},
		//BrowserEndpoint: "ws://localhost:3000",
	}).Start()

}

and here is the error:

Scraping Started
assignment to entry in nil map goroutine 22 [running]:
runtime/debug.Stack()
        C:/Program Files/Go/src/runtime/debug/stack.go:24 +0x65
github.com/geziyor/geziyor.(*Geziyor).recoverMe(0xc0000e74a0)
        C:/Users/simeo/go/pkg/mod/github.com/geziyor/[email protected]/geziyor.go:307 +0x45
panic({0x134cd40, 0x1552d50})
        C:/Program Files/Go/src/runtime/panic.go:884 +0x212
net/textproto.MIMEHeader.Set(...)
        C:/Program Files/Go/src/net/textproto/header.go:22
net/http.Header.Set(...)
        C:/Program Files/Go/src/net/http/header.go:40
github.com/geziyor/geziyor/client.ConvertMapToHeader(0xc00022a680?)
        C:/Users/simeo/go/pkg/mod/github.com/geziyor/[email protected]/client/client.go:297 +0x125
github.com/geziyor/geziyor/client.(*Client).doRequestChrome(0xc0003777d0, 0xc0000ac5f0)
        C:/Users/simeo/go/pkg/mod/github.com/geziyor/[email protected]/client/client.go:237 +0x6ca
github.com/geziyor/geziyor/client.(*Client).DoRequest(0xc0003777d0, 0xc0000ac5f0)
        C:/Users/simeo/go/pkg/mod/github.com/geziyor/[email protected]/client/client.go:96 +0x33
github.com/geziyor/geziyor.(*Geziyor).do(0xc0000e74a0, 0xc0000ac5f0, 0x1488ad0)
        C:/Users/simeo/go/pkg/mod/github.com/geziyor/[email protected]/geziyor.go:246 +0x12e
created by github.com/geziyor/geziyor.(*Geziyor).Do
        C:/Users/simeo/go/pkg/mod/github.com/geziyor/[email protected]/geziyor.go:228 +0xd2

Scraping Finished

I'm sorry if the error is on my side

cannot create context from nil parent

I'm pretty new to go so I'm not exactly sure why this is happening. The program is pretty is simple

package main

import (
	"encoding/json"
	"fmt"
	"log"
	"net/http"

	"github.com/geziyor/geziyor"
	"github.com/geziyor/geziyor/client"
)

func main() {
	var result map[string]interface{}
	resp, err := http.Get("http://localhost:9222/json/version")
	if err != nil {
		log.Fatal(err)
	}
	if err := json.NewDecoder(resp.Body).Decode(&result); err != nil {
		log.Fatal(err)
	}

	geziyor.NewGeziyor(&geziyor.Options{
		StartRequestsFunc: func(g *geziyor.Geziyor) {
			g.GetRendered("https://growth.cx", g.Opt.ParseFunc)
		},
		ParseFunc: func(g *geziyor.Geziyor, r *client.Response) {
			fmt.Println(string(r.Body))
		},
		BrowserEndpoint:   result["webSocketDebuggerUrl"].(string),
		RobotsTxtDisabled: true,
	}).Start()
}

I created a Dockerfile like this

FROM golang:1.15-buster as builder

WORKDIR /app

COPY go.* ./
RUN go mod download

COPY . ./

RUN go build -v -o server

FROM chromedp/headless-shell:latest

COPY --from=builder /app/server /app/server

CMD ["/app/server"]

When docker exec -it container-name bash and run ./app/server, I get the following error

Scraping Started
cannot create context from nil parent goroutine 42 [running]:
runtime/debug.Stack(0xc000077a78, 0xb3dbe0, 0xd08cd0)
        /usr/local/go/src/runtime/debug/stack.go:24 +0x9f
github.com/geziyor/geziyor.(*Geziyor).recoverMe(0xc000228fa0)
        /go/pkg/mod/github.com/geziyor/[email protected]/geziyor.go:310 +0x57
panic(0xb3dbe0, 0xd08cd0)
        /usr/local/go/src/runtime/panic.go:969 +0x1b9
context.WithCancel(0x0, 0x0, 0xc00006cc00, 0x0, 0x203000)
        /usr/local/go/src/context/context.go:234 +0x165
github.com/chromedp/chromedp.NewRemoteAllocator(0x0, 0x0, 0xc0000fa5a0, 0x49, 0xd0, 0xd0, 0x203000)
        /go/pkg/mod/github.com/chromedp/[email protected]/allocate.go:505 +0x3f
github.com/geziyor/geziyor/client.(*Client).doRequestChrome(0xc000033f40, 0xc0002a2200, 0x0, 0x0, 0x0)
        /go/pkg/mod/github.com/geziyor/[email protected]/client/client.go:165 +0xd0
github.com/geziyor/geziyor/client.(*Client).DoRequest(0xc000033f40, 0xc0002a2200, 0x0, 0x0, 0x0)
        /go/pkg/mod/github.com/geziyor/[email protected]/client/client.go:84 +0x5c
github.com/geziyor/geziyor.(*Geziyor).do(0xc000228fa0, 0xc0002a2200, 0xc77390)
        /go/pkg/mod/github.com/geziyor/[email protected]/geziyor.go:249 +0x165
created by github.com/geziyor/geziyor.(*Geziyor).Do
        /go/pkg/mod/github.com/geziyor/[email protected]/geziyor.go:231 +0xa8

Scraping Finished

Is this my mistake or something wrong with geziyor?
Please advice

Brave browser doesn't work with geziyor

I have brave browser installed and for as far as i know brave browser is build on top of chrome. So it should work right?

It doesn't run without BrowserEndpoint and If i specify the browser endpoint i get this:

Scraping Started
interface conversion: interface {} is nil, not string goroutine 54 [running]:
runtime/debug.Stack()
	/opt/homebrew/Cellar/go/1.18.2/libexec/src/runtime/debug/stack.go:24 +0x68
github.com/geziyor/geziyor.(*Geziyor).recoverMe(0x1400032a000)
	/Users/arjen/go/pkg/mod/github.com/geziyor/[email protected]/geziyor.go:307 +0x40
panic({0x103573c60, 0x14000320b70})
	/opt/homebrew/Cellar/go/1.18.2/libexec/src/runtime/panic.go:838 +0x204
github.com/chromedp/chromedp.detectURL({0x1033fcc7f, 0x14})
	/Users/arjen/go/pkg/mod/github.com/chromedp/[email protected]/util.go:72 +0x2ac
github.com/chromedp/chromedp.NewRemoteAllocator({0x103640748?, 0x140001b4008?}, {0x1033fcc7f, 0x14})
	/Users/arjen/go/pkg/mod/github.com/chromedp/[email protected]/allocate.go:513 +0x48
github.com/geziyor/geziyor/client.(*Client).doRequestChrome(0x14000312080, 0x14000342000)
	/Users/arjen/go/pkg/mod/github.com/geziyor/[email protected]/client/client.go:176 +0x70
github.com/geziyor/geziyor/client.(*Client).DoRequest(0x14000312080, 0x14000342000)
	/Users/arjen/go/pkg/mod/github.com/geziyor/[email protected]/client/client.go:96 +0x30
github.com/geziyor/geziyor.(*Geziyor).do(0x1400032a000, 0x14000342000, 0x1036362b0)
	/Users/arjen/go/pkg/mod/github.com/geziyor/[email protected]/geziyor.go:246 +0xe0
created by github.com/geziyor/geziyor.(*Geziyor).Do
	/Users/arjen/go/pkg/mod/github.com/geziyor/[email protected]/geziyor.go:228 +0xc8

Scraping Finished

can't use http2

Test http2 with:

geziyor.NewGeziyor(&geziyor.Options{
StartURLs: []string{"https://http2.pro/api/v1"},
ParseFunc: func(g *geziyor.Geziyor, r *client.Response) {
	fmt.Println(string(r.Body))
},
RobotsTxtDisabled: true,
}).Start()

Get {"http2":0,"protocol":"HTTP\/1.1","push":0,"user_agent":"Geziyor 1.0"}

May related to golang/go#17051 ?

problem installing geziyor

go get -u github.com/geziyor/geziyor
go: go.mod file not found in current directory or any parent directory.
	'go get' is no longer supported outside a module.
	To build and install a command, use 'go install' with a version,
	like 'go install example.com/cmd@latest'
	For more information, see https://golang.org/doc/go-get-install-deprecation
	or run 'go help get' or 'go help install'.
ubuntu2204@ubuntu2204:~/goscrape$ go get go.mod
go: go.mod file not found in current directory or any parent directory.
	'go get' is no longer supported outside a module.
	To build and install a command, use 'go install' with a version,
	like 'go install example.com/cmd@latest'
	For more information, see https://golang.org/doc/go-get-install-deprecation
	or run 'go help get' or 'go help install'.
ubuntu2204@ubuntu2204:~/goscrape$ go install github.com/geziyor/geziyor
go: 'go install' requires a version when current directory is not in a module
	Try 'go install github.com/geziyor/geziyor@latest' to install the latest version
ubuntu2204@ubuntu2204:~/goscrape$ go install github.com/geziyor/geziyor@latest
go: downloading github.com/geziyor/geziyor v0.0.0-20220429000531-738852f9321d
go: downloading golang.org/x/time v0.0.0-20220411224347-583f2d630306
go: downloading github.com/chromedp/chromedp v0.8.0
go: downloading github.com/PuerkitoBio/goquery v1.8.0
go: downloading github.com/chromedp/cdproto v0.0.0-20220428002153-285dfb42699c
go: downloading golang.org/x/net v0.0.0-20220425223048-2871e0cb64e4
go: downloading golang.org/x/text v0.3.7
go: downloading github.com/go-kit/kit v0.12.0
go: downloading github.com/prometheus/client_golang v1.12.1
go: downloading github.com/temoto/robotstxt v1.1.2
go: downloading github.com/andybalholm/cascadia v1.3.1
go: downloading github.com/beorn7/perks v1.0.1
go: downloading github.com/cespare/xxhash/v2 v2.1.2
go: downloading github.com/golang/protobuf v1.5.2
go: downloading github.com/prometheus/client_model v0.2.0
go: downloading github.com/cespare/xxhash v1.1.0
go: downloading github.com/prometheus/common v0.34.0
go: downloading github.com/prometheus/procfs v0.7.3
go: downloading google.golang.org/protobuf v1.28.0
go: downloading github.com/VividCortex/gohistogram v1.0.0
go: downloading github.com/matttproud/golang_protobuf_extensions v1.0.1
go: downloading golang.org/x/sys v0.0.0-20220422013727-9388b58f7150
package github.com/geziyor/geziyor is not a main package

google-chrome: executable file not found in $PATH

Issue:

I get an error when I start my service on the server. Local on my machine everything works so far.

request getting rendered: exec: "google-chrome": executable file not found in $PATH

Code

main.go

// ...
	crawler := geziyor.NewGeziyor(&geziyor.Options{
		StartRequestsFunc: func(g *geziyor.Geziyor) {
			g.GetRendered("https://www.google.com/", g.Opt.ParseFunc)
		},
		Exporters: []export.Exporter{&export.JSON{}},
	})
	
	crawler.Start()
// ...

Dockerfile

# -- Stage 1 -- #
FROM golang:1.16-alpine as builder
WORKDIR /app

COPY . .
RUN go build -mod=readonly -o bin/service

# -- Stage 2 -- #
FROM alpine

# Install any required dependencies.
RUN apk --no-cache add ca-certificates

WORKDIR /root/

COPY --from=builder /app/bin/service /usr/local/bin/

CMD ["service"]

Question

I assume I need additional dependencies on my server for geziyor to run smoothly? For example Headless Chrome?

Cookie cutters and Declarative scrapping

Many web sites can be scrapped using standard CSS selection without defining fancy Go code to do that. For this, I still like goscrape's "structured scraper" approach. Ref:

https://github.com/andrew-d/goscrape#goscrape

And here is how its scrapping is defined declaratively:

https://github.com/andrew-d/goscrape/blob/d89ba4ccc7f78429613f2a71bc7703c8faf9e8c9/_examples/scrape_hn.go#L15-L26

	config := &scrape.ScrapeConfig{
		DividePage: scrape.DividePageBySelector("tr:nth-child(3) tr:nth-child(3n-2):not([style='height:10px'])"),

		Pieces: []scrape.Piece{
			{Name: "title", Selector: "td.title > a", Extractor: extract.Text{}},
			{Name: "link", Selector: "td.title > a", Extractor: extract.Attr{Attr: "href"}},
			{Name: "rank", Selector: "td.title[align='right']",
				Extractor: extract.Regex{Regex: regexp.MustCompile(`(\d+)`)}},
		},

		Paginator: paginate.BySelector("a[rel='nofollow']:last-child", "href"),
	}

Hope geziyor can do declarative scrapping using predefined cookie cutters like above as well.

Need help understanding geziyor usage

First of all, thanks for making this module! I needed a scraper that supports JS rendering and so far it's been great!

One thing I don't understand however is how can I re-use a geziyor instance to make multiple requests? Every example I've seen from either this repo or repo's using the package is that one geziyor instance is used to create one request, using the ParseFunc like shown in your example:

func main() {
    geziyor.NewGeziyor(&geziyor.Options{
        StartURLs: []string{"http://quotes.toscrape.com/"},
        ParseFunc: quotesParse,
        Exporters: []export.Exporter{&export.JSON{}},
    }).Start()
}

func quotesParse(g *geziyor.Geziyor, r *client.Response) {
    r.HTMLDoc.Find("div.quote").Each(func(i int, s *goquery.Selection) {
        g.Exports <- map[string]interface{}{
            "text":   s.Find("span.text").Text(),
            "author": s.Find("small.author").Text(),
        }
    })
    if href, ok := r.HTMLDoc.Find("li.next > a").Attr("href"); ok {
        g.Get(r.JoinURL(href), quotesParse)
    }
}

So my question is, how can I now make a second request to a different URL? Do I have to create a new geziyor instance using a different ParseFunc? That seems like an inefficient way of doing it.

Another question; how would I return data to the function calling ParseFunc? I've only seen examples printing to stdout or to a file. I want to return a variable.

I hope I made myself clear ๐Ÿ˜Š.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.