I am trying to serialize the node and store it in redis ( Crawling a website and using

Problem in serializing the Node. about dryscrape HOT 6 CLOSED

bibiboot commented on September 27, 2024

Problem in serializing the Node.

from dryscrape.

Comments (6)

niklasb commented on September 27, 2024

Hello and thanks for taking the time to write this up. The issue tracker is absolutely the right place to ask, issues can be bug reports, feature requests or general questions.

Now, the Node class is just a thin wrapper around a node inside the current DOM of the webkit_server, which is an external process. A Node object basically just contains a numerical ID to identify the node and a reference to the open socket connection to the server. Thus, the state of a node depends on the internal browser state of the webkit_server, which is not reproducable, let alone serializable.

So unless you can share a socket with your worker threads/processes, there is no sensible way to serialize a node. If sharing the socket is an option, you'll have to handle synchronization by yourself, as webkit_server obviously is not thread-safe (it wouldn't make sense to be thread-safe either, seeing that webkit allows no parallel processing).

If you really want to go down that road (I don't see why you would, you could just as well design your program as a sequential algorithm), I strongly discourage implementing __getstate__ and __setstate__, as they cannot be implemented sensibly for the Node class. Instead, you should write a custom function that transforms a node into whatever serializable representation you need.

Maybe if you provide more information about what you want to achieve, I can be of more help :)

from dryscrape.

bibiboot commented on September 27, 2024

Thanks for explaining it to me nicely. My basic purpose was to store the nodes at different point in crawling and start the crawling business from their ( these points will acts as restore points ) , thereby avoiding the repetitive code and execution. Anyways i would stick to sequential implementation.

Do you recommend any ways by which we can keep restore points during crawling.

from dryscrape.

niklasb commented on September 27, 2024

Now that's definitely not possible. Webkit uses mutable state, which is the logical thing to do in a real browser context (the whole idea is for dryscrape to act like a real browser). What you describe would require a browser engine based on immutable state or copy-on-write-like techniques. I know of no such engine, and even if one existed, it would probably be a lot slower and less compatible than webkit.

So what you probably want to do is to write an explicit restore procedure yourself. Maybe it's as simple as loading an URL, maybe you also need to click some buttons afterwards. What you can also try is to save the page HTML and current URL as strings and call set_html in your restore procedure. Both of those approaches would have the advantage that you can reproduce the same state in multiple sessions. Remember that you can have several distinct sessions, so you can even parallelize the scraping process this way.

from dryscrape.

bibiboot commented on September 27, 2024

Remember that you can have several distinct sessions, so you can even parallelize the scraping process this way but you said that the session of the webkit-server is not thread safe. I am storing the url and the cookie to act as restore point after your guidance but please clarify the parallel concept in webkit-server.

It would be great help as the scraping is very slow.

from dryscrape.

bibiboot commented on September 27, 2024

I have acted on your suggestion and use the url + cookie as the restore point, but if i run two instances of the client code parallel then the object is shared within them, please suggest what am i missing here ?

from dryscrape.

niklasb commented on September 27, 2024

Sorry, I think don't understand the question. What I meant by Webkit not being thread-safe is that you can't access a single session from multiple threads. Of course you can have multiple threads with each one using its own session. That's like having several browser windows open.

from dryscrape.

Problem in serializing the Node. about dryscrape HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent