Git Product home page Git Product logo

hcomicfetcher's People

Contributors

pharaun avatar

Stargazers

 avatar

Watchers

 avatar  avatar

hcomicfetcher's Issues

Website auth

Should be able to employ the same auth pipeline trick as the karmator auth bit.

Implement a form of midware that takes a conduit or pipes pipe to listen to the in/out http stream to handle specific auth bits for specific sites.

Convert all callback based parsers to pipeline parser once this style of parsing has proven itself.

What the subject said.

    -- Pipeline parser
    case parse of
        CallbackParser cp -> runEffect $ (chanProducer toReturn) >-> (toPipeline cp) >-> (chanConsumer toFetch)
        PipelineParser pp -> runEffect $ (chanProducer toReturn) >-> pp >-> (chanConsumer toFetch)

There is two style of parser in play the interpreter, and the callback based parser (well technically a third, a raw pipeline one).

So we should be looking into converting all of the callback parser to pipelined parser (mostly so that we can have actual termination because the callback style parsers has poor termination characteristics)

Fix up the timing issue with the pipe parser.

It exits which is great, however for ex it'll exit early (ie before the queue is empty) which is perfectly fine, the main gotcha is we need to have a neat way of waiting till the queue is empty then getting confirmation from the web-worker that it IS done downloading/storing/whatever.

This may lead to having a separate worker/queue for images

Make the codebase actually robustly multi-threaded.

Right now its dual-queue single-threaded so kinda not exactly the most useful setup of code.

What would be ideal is to be able to have as many parser as possible so that we can crunch the incoming data then submit to the queue more urls to fetch.

Then on the downloader side, should look into how to do a fan-out in which we have timers for each site/sub-domain so that we don't flood any single server but we are able to have multiple independent downloads going at the same time.

Backtracking CPS

This is an extension of issue #3 with more in depth analysis of the problem.

Basically there's several ways that a website can be parsed for the relevant information, but often time there's several level of nesting and its often time easier to just parse entries in bulk then process em.

For example:

On site 1 we parse out a list of all books (volume 1... x) and then put it into the queue, which after the fetcher returns the content we are able to parse the episode (chapter 1...x) and again toss all of this back onto the queue, then repeat till we finally parse out the actual image link and store this to disk.

This approach works but it forces you to break up your code into "callback-alike" mode in which you do one or more related unit of work then you toss it into a queue with a Tag to specify what you need to do next with the result. This doesn't feel like the nicest way to code things up.

So I thought why not a form of CPS, in which the control passes back to the fetcher every time we request it to fetch a webpage/image, so that we can then proceed to do that then return control to the code.

Now this would work fantastically for code that are very linear, ie I parse Volume, then Chapter, then Page in order and store the result to disk then I repeat that for the next bit. But this does not work very well for what i refer to as "chunky parsing" in which you parse up a batch of volume, then parse up a batch of chapters from each volume...

Doing it via CPS would then force the user to have to keep track of each batch and handle the back-tracking manually which seems rather awkward and somewhat self-defeating of having CPS for simplifying/making the coding be nicer.

So logically the next idea is some form of back-tracking CPS. (This allows us to tackle it in either depth first or breadth first execution wise). Anyway basically the idea behind Back-tracking CPS is that it naturally supports both linear and chunky parsing in that for linear there is minimal to no back-tracking needed, then for the chunky ones, it would work by:

We would first parse out a list of book, then when we fetch the first book, instead of putting it all on the queue, we fetch the first one and store the rest, and hop right onto the chapter parsing, then fetch the first chapter, and store the rest, then hop onto the page parse, then when the page parse is done, we back-track to the chapter parse and repeat, then back track all the way up to the volume....

This seems to support both the extremely simple page to page webcomic, and the simpler linear webcomic that has linear volume -> chapter -> page parse, and it also supports the stickiest case, the chunky parser which parses a big batch then has more batches from the batch, thus the queue length explodes.

Look into CPS for making the parser stuff much nicer

Right now we have lots of back and forth via passing around opaque object to and from the fetcher to the parser. Should be able to do some form of CPS in which we're able to seam-lessly parse in like a single function and at each injuction point where we need to fetch something from the network that parser waits for the data to be streamed back before resuming.

CBZ & CBR support.

This maybe a bit of a complicated feature at time to implement depending on how a page is parsed.

  1. The bulk page parse will probably be handled quite well - ie the BFS style page parse because we will have a nice stream of just images toward the end and we can just differentiate on chapter/volume as a trigger to make a new cbz archive. Also this will be usually translated into DFS so we should have a clear signal upon the fetching of a new chapter for example.

  2. The streaming page parser will probably be a bit more of a challenge, may need to open a cbz file for archiving then we keep it open in an MRU cache and once we reach a certain limit on open files start closing the least recently used ones down, alt can prematurely close if we get a clear signal that the next chapter/volume is starting.

libxml2 parsing -> DOM

20:12:29 <solatis> there goes my elegant parsing code... :)
23:22:22 <solatis> https://github.com/alphaHeavy/libxml-conduit
23:22:28 <solatis> someone beat us to it :)
23:22:35 <solatis> it's not on hackage, but it works pretty well
23:23:21 <solatis> that essentially transforms the aweful interface of LibXML.SAX to a nice conduit stream of events
23:24:46 <solatis> and it uses https://hackage.haskell.org/package/xml-types-0.3.4/docs/Data-XML-Types.html for element types
23:24:51 <solatis> which is pretty nice too :)
20:57:14 <solatis> in case you're there, it seems like the complete ecosystem for making a DOM based in libxml2 is already out there
20:57:24 <solatis> just not everything properly document / on hackage, but it works
20:57:38 <solatis> https://github.com/alphaHeavy/libxml-conduit
20:57:50 <solatis> that converts a stream of bytes to Data.Xml.Types 
20:58:09 <solatis> then you have a 'proper' stream of XML event types (EventBeginDocument, etc)
20:58:57 <solatis> and guess what we have here:
20:59:00 <solatis> https://hackage.haskell.org/package/xml-conduit-1.2.3/docs/Text-XML.html#g:6
20:59:17 <solatis> fromEvents :: MonadThrow m => Consumer EventPos m Document
20:59:30 <solatis> thhat one converts those XML events to a Document

Cache Invalidation

Right now the cache is extremely dumb, it just caches all content forever, and hash by.. url i believe to disk.

Should update it to a newer form of cache.

  1. Hash the url for static content (images) and store that
  2. Hash the content + url + timestamp for "dynamic" content (html) and store that.

Now for html pages we have timestamp so that if we refetch it we can pull from the cache till it expires, once it expire we fetch a new version off the site, and compare its content hash and if its the same don't bother processing it. Now if its different, there's two approach.

  1. Find out the difference and process it
  2. Just process the whole thing, and relay on the fact that if its a existing static content we can check the url hash to see if we have already processed this before and skip.

Some sort of source list

Probably can do some sort of quasiquotation or TH and parse a list something like

FetcherType Status url

Where you put down the name of a Fetcher type (for the larger aggregation comic sites) and the status (ie is it still on going, ie do you want to check for updates or not), then root url of a particular comic.

Then find a neat way to look through the list for stuff that are still active and check their cached page/template and see if there's any updates, and fetch more if there exists more.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.