pharaun / hcomicfetcher Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 0.0 166 KB

Haskell webcomic and manga fetcher.

License: GNU General Public License v3.0

Haskell 100.00%

hcomicfetcher's People

Contributors

Stargazers

Watchers

hcomicfetcher's Issues

Modernize the cabal file and add in version bounds.

Implement a more useful cache that includes expiry of some sort for the tip/index.

Basically we want to be able to check if there is a new update to the index or the tip/latest page in the series, so we'll need a good way of recording this data into the cache so that we can expire and check them once in a while.

Improve the doc on the interpreter

Explain how it works (basically around taking a BFS traversal of a webpage and re-writing it into DFS form).

Implement a form of token bucket for the downloading code so that we can self-regulate.

Would be extremely useful to have some form of per site configuration and a token bucket so that we can self-regulate the speed of downloading.

Website auth

Should be able to employ the same auth pipeline trick as the karmator auth bit.

Implement a form of midware that takes a conduit or pipes pipe to listen to the in/out http stream to handle specific auth bits for specific sites.

Convert all callback based parsers to pipeline parser once this style of parsing has proven itself.

What the subject said.

    -- Pipeline parser
    case parse of
        CallbackParser cp -> runEffect $ (chanProducer toReturn) >-> (toPipeline cp) >-> (chanConsumer toFetch)
        PipelineParser pp -> runEffect $ (chanProducer toReturn) >-> pp >-> (chanConsumer toFetch)

There is two style of parser in play the interpreter, and the callback based parser (well technically a third, a raw pipeline one).

So we should be looking into converting all of the callback parser to pipelined parser (mostly so that we can have actual termination because the callback style parsers has poor termination characteristics)

Fix up the timing issue with the pipe parser.

It exits which is great, however for ex it'll exit early (ie before the queue is empty) which is perfectly fine, the main gotcha is we need to have a neat way of waiting till the queue is empty then getting confirmation from the web-worker that it IS done downloading/storing/whatever.

This may lead to having a separate worker/queue for images

Separate the actual file downloader from the parser streaming

Should be able to enqueue the file-downloader separately from the parser streaming so that we can let the pictures queue up to be downloaded without blocking all of the other workers.

Make the codebase actually robustly multi-threaded.

Right now its dual-queue single-threaded so kinda not exactly the most useful setup of code.

What would be ideal is to be able to have as many parser as possible so that we can crunch the incoming data then submit to the queue more urls to fetch.

Then on the downloader side, should look into how to do a fan-out in which we have timers for each site/sub-domain so that we don't flood any single server but we are able to have multiple independent downloads going at the same time.

Backtracking CPS

This is an extension of issue #3 with more in depth analysis of the problem.

Basically there's several ways that a website can be parsed for the relevant information, but often time there's several level of nesting and its often time easier to just parse entries in bulk then process em.

For example:

On site 1 we parse out a list of all books (volume 1... x) and then put it into the queue, which after the fetcher returns the content we are able to parse the episode (chapter 1...x) and again toss all of this back onto the queue, then repeat till we finally parse out the actual image link and store this to disk.

This approach works but it forces you to break up your code into "callback-alike" mode in which you do one or more related unit of work then you toss it into a queue with a Tag to specify what you need to do next with the result. This doesn't feel like the nicest way to code things up.

So I thought why not a form of CPS, in which the control passes back to the fetcher every time we request it to fetch a webpage/image, so that we can then proceed to do that then return control to the code.

Now this would work fantastically for code that are very linear, ie I parse Volume, then Chapter, then Page in order and store the result to disk then I repeat that for the next bit. But this does not work very well for what i refer to as "chunky parsing" in which you parse up a batch of volume, then parse up a batch of chapters from each volume...

Doing it via CPS would then force the user to have to keep track of each batch and handle the back-tracking manually which seems rather awkward and somewhat self-defeating of having CPS for simplifying/making the coding be nicer.

So logically the next idea is some form of back-tracking CPS. (This allows us to tackle it in either depth first or breadth first execution wise). Anyway basically the idea behind Back-tracking CPS is that it naturally supports both linear and chunky parsing in that for linear there is minimal to no back-tracking needed, then for the chunky ones, it would work by:

We would first parse out a list of book, then when we fetch the first book, instead of putting it all on the queue, we fetch the first one and store the rest, and hop right onto the chapter parsing, then fetch the first chapter, and store the rest, then hop onto the page parse, then when the page parse is done, we back-track to the chapter parse and repeat, then back track all the way up to the volume....

This seems to support both the extremely simple page to page webcomic, and the simpler linear webcomic that has linear volume -> chapter -> page parse, and it also supports the stickiest case, the chunky parser which parses a big batch then has more batches from the batch, thus the queue length explodes.

Look into CPS for making the parser stuff much nicer

Right now we have lots of back and forth via passing around opaque object to and from the fetcher to the parser. Should be able to do some form of CPS in which we're able to seam-lessly parse in like a single function and at each injuction point where we need to fetch something from the network that parser waits for the data to be streamed back before resuming.

CBZ & CBR support.

This maybe a bit of a complicated feature at time to implement depending on how a page is parsed.

The bulk page parse will probably be handled quite well - ie the BFS style page parse because we will have a nice stream of just images toward the end and we can just differentiate on chapter/volume as a trigger to make a new cbz archive. Also this will be usually translated into DFS so we should have a clear signal upon the fetching of a new chapter for example.
The streaming page parser will probably be a bit more of a challenge, may need to open a cbz file for archiving then we keep it open in an MRU cache and once we reach a certain limit on open files start closing the least recently used ones down, alt can prematurely close if we get a clear signal that the next chapter/volume is starting.

libxml2 parsing -> DOM

20:12:29 <solatis> there goes my elegant parsing code... :)
23:22:22 <solatis> https://github.com/alphaHeavy/libxml-conduit
23:22:28 <solatis> someone beat us to it :)
23:22:35 <solatis> it's not on hackage, but it works pretty well
23:23:21 <solatis> that essentially transforms the aweful interface of LibXML.SAX to a nice conduit stream of events
23:24:46 <solatis> and it uses https://hackage.haskell.org/package/xml-types-0.3.4/docs/Data-XML-Types.html for element types
23:24:51 <solatis> which is pretty nice too :)
20:57:14 <solatis> in case you're there, it seems like the complete ecosystem for making a DOM based in libxml2 is already out there
20:57:24 <solatis> just not everything properly document / on hackage, but it works
20:57:38 <solatis> https://github.com/alphaHeavy/libxml-conduit
20:57:50 <solatis> that converts a stream of bytes to Data.Xml.Types 
20:58:09 <solatis> then you have a 'proper' stream of XML event types (EventBeginDocument, etc)
20:58:57 <solatis> and guess what we have here:
20:59:00 <solatis> https://hackage.haskell.org/package/xml-conduit-1.2.3/docs/Text-XML.html#g:6
20:59:17 <solatis> fromEvents :: MonadThrow m => Consumer EventPos m Document
20:59:30 <solatis> thhat one converts those XML events to a Document

Cache Invalidation

Right now the cache is extremely dumb, it just caches all content forever, and hash by.. url i believe to disk.

Should update it to a newer form of cache.

Hash the url for static content (images) and store that
Hash the content + url + timestamp for "dynamic" content (html) and store that.

Now for html pages we have timestamp so that if we refetch it we can pull from the cache till it expires, once it expire we fetch a new version off the site, and compare its content hash and if its the same don't bother processing it. Now if its different, there's two approach.

Find out the difference and process it
Just process the whole thing, and relay on the fact that if its a existing static content we can check the url hash to see if we have already processed this before and skip.

Some sort of source list

Probably can do some sort of quasiquotation or TH and parse a list something like

FetcherType Status url

Where you put down the name of a Fetcher type (for the larger aggregation comic sites) and the status (ie is it still on going, ie do you want to check for updates or not), then root url of a particular comic.

Then find a neat way to look through the list for stuff that are still active and check their cached page/template and see if there's any updates, and fetch more if there exists more.

pharaun / hcomicfetcher Goto Github PK

hcomicfetcher's People

Contributors

Stargazers

Watchers

hcomicfetcher's Issues

Recommend Projects

Recommend Topics

Recommend Org