Comments (13)
@TheUltimateCookie Did you manage to find a solution? I'm in the same boat sadly...
from geziyor.
Are you connecting the headless browser from outside of the container. Cause you should be inside. Cause according to the chromedp library docs, it should be inside.
from geziyor.
from geziyor.
Ive just pushed a possible fix. And Ive updated the chromedp library. Can you test again.
from geziyor.
How would I go about trying that, is it an official release?
go get -u && go mod tidy
and a rerun give me the same issues.
Should I clone the project and link locally from my source? I have never done this before, my apologies.
Thanks for listening, and love the project.
from geziyor.
How many requests are you making concurrently? If youre on 16gb mem device. You can probably make 100~
Note that, @TheUltimateCookie tries to make 1000 concurrently on the original message. Which is not possible for the current regular PCs
You can be sure using the fixed version by this command
go get github.com/geziyor/geziyor@738852f9321de26c193ae88a9b2fb4d6aebb6540
Also, there could
be some kind of Windows issue. Are you using Windows?
What exactly the error youre getting? context deadline exceeded ?
from geziyor.
I'm using Linux, I try to crawl on http basis (submit url and it gets crawled, should not be more then 1 per sec at max). Right now it's just 1 url in main() func.
Pc specs should be fine
CPU: AMD Ryzen Threadripper 2920X (24) @ 3.500GHz
Memory: 10567MiB / 31954MiB
This is the output: https://berms.onlyfans.je/41d7d013-1bcf-44c5-9aac-4493f96e5312.png
I use pkill main
to quit the app now
from geziyor.
Oh, Thats weird. Ive just tried on ubuntu and it worked. Are you using BrowserEndpoint and connecting to chromedp/headless-shell docker image?
from geziyor.
I'm not, here's the options:
a := geziyor.NewGeziyor(&geziyor.Options{
RobotsTxtDisabled: true,
UserAgent: "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36",
StartRequestsFunc: func(g *geziyor.Geziyor) {
req, err := client.NewRequest("GET", createUrl(id), nil)
if err != nil {
return
}
req.Header.Set("Cookie", "cookie: mid=YcSa0AAEAAG4kXzdhE8C-Coo; iid=-9F82-416-9087-A0B96564FD5F; csrftoken=kgN3CetChNHD54JKA2g4pCsO8pkZ; ds_user_id=; sessionid="")
req.Rendered = false
g.Do(req, g.Opt.ParseFunc)
},
ParseFunc: func(g *geziyor.Geziyor, r *client.Response) {
reader := ioutil.NopCloser(bytes.NewBuffer(r.Body))
err := json.NewDecoder(reader).Decode(&video)
if err != nil {
fmt.Println(err)
}
parseVid(w, video)
return
},
})
a.Start()
I have Chromium installed, like the readme asked. No docker or any other stuff involved. The crawl is succesfull, the process just does not die.
Edit: OS Linux 5.15.32-1-MANJARO
from geziyor.
Hmm, unfortunately I dont even know whats happening here. I would recommend starting headless docker image and using BrowserEndpoint.
Btw, i have installed google chrome and not chromium. Idk if it'll fix the issue anyways.
from geziyor.
Thank you for taking the time to talk to me about this issue, this is for a hobby project so I'm not stressing, just rather weird to me that this happens.
I have found a bug I think? If I set a BrowserEndpoint, and it's incorrect, it just ignores it. For me it feels like this should fail right?
docker run -d -p 9222:9222 --rm --name headless-shell chromedp/headless-shell
I forgot to change the endpoint to 9222 in the code, and wrote 3000 like in the ReadMe. The scraping still works. Which means it isn't using the docker one. I fixed the endpoint, but I don't believe it's using it, Even though I removed chromium, so now I believe it's using brave's instance. You might have an idea where I can check that?
EDIT:
I found this question about a similar topic:
https://stackoverflow.com/questions/67696969/get-selenium-to-work-with-brave-browser-on-linux
One of the commenters said:
"The question is why is close() not working in brave when headless, it is working for chrome!"
So I think I found the headless issue, if anyone googles they might endup here, so that's a win. Doesn't explain why docker does not work and it prioritizes the headless brave instead of docker.
from geziyor.
Oh! I just realised you used req.Rendered = false in your code. That makes the geziyor not use the rendered chrome at all!
from geziyor.
Interesting, thanks. When Rendered = true, it adds html to my json response, which I really really do not want. However I'll find a workaround for that.
Even when using docker and Rendered = true it still refuses to die after a quit signal. I'll build something with subprocess that executes pkill appname
as a work around for now.
from geziyor.
Related Issues (20)
- How to add customize headers while request? HOT 2
- context deadline exceeded HOT 7
- can't use http2 HOT 1
- Memory leak (gocoroutines) HOT 10
- Is scraping shadow DOM an option? HOT 6
- runtime error: invalid memory address or nil pointer dereference HOT 3
- to mirror a website like `wget -r`? HOT 1
- Does it support random UserAgent? HOT 1
- Scrape URLs then get to there. HOT 1
- Recursive Exports / Native return channels HOT 3
- problem installing geziyor HOT 1
- Wait until page is fully loaded HOT 1
- A way to handle onclick ? HOT 1
- renderd js not working HOT 2
- Library disables CTRL+C when used with Fiber (context-based router) HOT 11
- Exporters should accept io.Writer
- Need help understanding geziyor usage HOT 2
- Brave browser doesn't work with geziyor
- Export to variable HOT 1
- issue about getting js rendered response HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from geziyor.