Git Product home page Git Product logo

Comments (6)

niklasb avatar niklasb commented on September 27, 2024

Hello and thanks for taking the time to write this up. The issue tracker is absolutely the right place to ask, issues can be bug reports, feature requests or general questions.

Now, the Node class is just a thin wrapper around a node inside the current DOM of the webkit_server, which is an external process. A Node object basically just contains a numerical ID to identify the node and a reference to the open socket connection to the server. Thus, the state of a node depends on the internal browser state of the webkit_server, which is not reproducable, let alone serializable.

So unless you can share a socket with your worker threads/processes, there is no sensible way to serialize a node. If sharing the socket is an option, you'll have to handle synchronization by yourself, as webkit_server obviously is not thread-safe (it wouldn't make sense to be thread-safe either, seeing that webkit allows no parallel processing).

If you really want to go down that road (I don't see why you would, you could just as well design your program as a sequential algorithm), I strongly discourage implementing __getstate__ and __setstate__, as they cannot be implemented sensibly for the Node class. Instead, you should write a custom function that transforms a node into whatever serializable representation you need.

Maybe if you provide more information about what you want to achieve, I can be of more help :)

from dryscrape.

bibiboot avatar bibiboot commented on September 27, 2024

Thanks for explaining it to me nicely. My basic purpose was to store the nodes at different point in crawling and start the crawling business from their ( these points will acts as restore points ) , thereby avoiding the repetitive code and execution. Anyways i would stick to sequential implementation.

Do you recommend any ways by which we can keep restore points during crawling.

from dryscrape.

niklasb avatar niklasb commented on September 27, 2024

Now that's definitely not possible. Webkit uses mutable state, which is the logical thing to do in a real browser context (the whole idea is for dryscrape to act like a real browser). What you describe would require a browser engine based on immutable state or copy-on-write-like techniques. I know of no such engine, and even if one existed, it would probably be a lot slower and less compatible than webkit.

So what you probably want to do is to write an explicit restore procedure yourself. Maybe it's as simple as loading an URL, maybe you also need to click some buttons afterwards. What you can also try is to save the page HTML and current URL as strings and call set_html in your restore procedure. Both of those approaches would have the advantage that you can reproduce the same state in multiple sessions. Remember that you can have several distinct sessions, so you can even parallelize the scraping process this way.

from dryscrape.

bibiboot avatar bibiboot commented on September 27, 2024

Remember that you can have several distinct sessions, so you can even parallelize the scraping process this way but you said that the session of the webkit-server is not thread safe. I am storing the url and the cookie to act as restore point after your guidance but please clarify the parallel concept in webkit-server.

It would be great help as the scraping is very slow.

from dryscrape.

bibiboot avatar bibiboot commented on September 27, 2024

I have acted on your suggestion and use the url + cookie as the restore point, but if i run two instances of the client code parallel then the object is shared within them, please suggest what am i missing here ?

from dryscrape.

niklasb avatar niklasb commented on September 27, 2024

Sorry, I think don't understand the question. What I meant by Webkit not being thread-safe is that you can't access a single session from multiple threads. Of course you can have multiple threads with each one using its own session. That's like having several browser windows open.

from dryscrape.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.