Git Product home page Git Product logo

Comments (13)

Evert-Arends avatar Evert-Arends commented on May 18, 2024

@TheUltimateCookie Did you manage to find a solution? I'm in the same boat sadly...

from geziyor.

musabgultekin avatar musabgultekin commented on May 18, 2024

Are you connecting the headless browser from outside of the container. Cause you should be inside. Cause according to the chromedp library docs, it should be inside.

from geziyor.

Evert-Arends avatar Evert-Arends commented on May 18, 2024

from geziyor.

musabgultekin avatar musabgultekin commented on May 18, 2024

Ive just pushed a possible fix. And Ive updated the chromedp library. Can you test again.

from geziyor.

Evert-Arends avatar Evert-Arends commented on May 18, 2024

How would I go about trying that, is it an official release?
go get -u && go mod tidy and a rerun give me the same issues.

Should I clone the project and link locally from my source? I have never done this before, my apologies.

Thanks for listening, and love the project.

from geziyor.

musabgultekin avatar musabgultekin commented on May 18, 2024

How many requests are you making concurrently? If youre on 16gb mem device. You can probably make 100~
Note that, @TheUltimateCookie tries to make 1000 concurrently on the original message. Which is not possible for the current regular PCs

You can be sure using the fixed version by this command

go get github.com/geziyor/geziyor@738852f9321de26c193ae88a9b2fb4d6aebb6540

Also, there could be some kind of Windows issue. Are you using Windows?

What exactly the error youre getting? context deadline exceeded ?

from geziyor.

Evert-Arends avatar Evert-Arends commented on May 18, 2024

I'm using Linux, I try to crawl on http basis (submit url and it gets crawled, should not be more then 1 per sec at max). Right now it's just 1 url in main() func.

Pc specs should be fine
CPU: AMD Ryzen Threadripper 2920X (24) @ 3.500GHz
Memory: 10567MiB / 31954MiB

This is the output: https://berms.onlyfans.je/41d7d013-1bcf-44c5-9aac-4493f96e5312.png

I use pkill main to quit the app now

from geziyor.

musabgultekin avatar musabgultekin commented on May 18, 2024

Oh, Thats weird. Ive just tried on ubuntu and it worked. Are you using BrowserEndpoint and connecting to chromedp/headless-shell docker image?

from geziyor.

Evert-Arends avatar Evert-Arends commented on May 18, 2024

I'm not, here's the options:

a := geziyor.NewGeziyor(&geziyor.Options{
			RobotsTxtDisabled: true,
			UserAgent:         "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36",
			StartRequestsFunc: func(g *geziyor.Geziyor) {
				req, err := client.NewRequest("GET", createUrl(id), nil)
				if err != nil {
					return
				}
				req.Header.Set("Cookie", "cookie: mid=YcSa0AAEAAG4kXzdhE8C-Coo; iid=-9F82-416-9087-A0B96564FD5F; csrftoken=kgN3CetChNHD54JKA2g4pCsO8pkZ; ds_user_id=; sessionid="")
				req.Rendered = false
				g.Do(req, g.Opt.ParseFunc)
			},
			ParseFunc: func(g *geziyor.Geziyor, r *client.Response) {
				reader := ioutil.NopCloser(bytes.NewBuffer(r.Body))
				err := json.NewDecoder(reader).Decode(&video)
				if err != nil {
					fmt.Println(err)
				}
				parseVid(w, video)
                                return
			},
		})
		a.Start()

I have Chromium installed, like the readme asked. No docker or any other stuff involved. The crawl is succesfull, the process just does not die.

Edit: OS Linux 5.15.32-1-MANJARO

from geziyor.

musabgultekin avatar musabgultekin commented on May 18, 2024

Hmm, unfortunately I dont even know whats happening here. I would recommend starting headless docker image and using BrowserEndpoint.
Btw, i have installed google chrome and not chromium. Idk if it'll fix the issue anyways.

from geziyor.

Evert-Arends avatar Evert-Arends commented on May 18, 2024

Thank you for taking the time to talk to me about this issue, this is for a hobby project so I'm not stressing, just rather weird to me that this happens.

I have found a bug I think? If I set a BrowserEndpoint, and it's incorrect, it just ignores it. For me it feels like this should fail right?

docker run -d -p 9222:9222 --rm --name headless-shell chromedp/headless-shell

I forgot to change the endpoint to 9222 in the code, and wrote 3000 like in the ReadMe. The scraping still works. Which means it isn't using the docker one. I fixed the endpoint, but I don't believe it's using it, Even though I removed chromium, so now I believe it's using brave's instance. You might have an idea where I can check that?

EDIT:

I found this question about a similar topic:
https://stackoverflow.com/questions/67696969/get-selenium-to-work-with-brave-browser-on-linux

One of the commenters said:

"The question is why is close() not working in brave when headless, it is working for chrome!"

So I think I found the headless issue, if anyone googles they might endup here, so that's a win. Doesn't explain why docker does not work and it prioritizes the headless brave instead of docker.

from geziyor.

musabgultekin avatar musabgultekin commented on May 18, 2024

Oh! I just realised you used req.Rendered = false in your code. That makes the geziyor not use the rendered chrome at all!

from geziyor.

Evert-Arends avatar Evert-Arends commented on May 18, 2024

Interesting, thanks. When Rendered = true, it adds html to my json response, which I really really do not want. However I'll find a workaround for that.

Even when using docker and Rendered = true it still refuses to die after a quit signal. I'll build something with subprocess that executes pkill appname as a work around for now.

from geziyor.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.