futureg-lab / mx-scraper Goto Github PK

View Code? Open in Web Editor NEW

3.0 3.0 1.0 18.15 MB

mx-scraper is an opensource plugin based manga website scraper

Batchfile 0.22% TypeScript 99.78%

manga manga-downloader metadata-extraction scraping gravure nsfw sfw cli graphql-server

mx-scraper's Introduction

mx-scraper

Notice: This project has been migrated to deno starting from v4.0.0 due to pkg being deprecated.

Usage

The Flaresolverr proxy feature requires FlareSolverr

For the headless/headfull browser feature, you may want to install the appropriate chrome or firefox version here, if a browser is already available locally, you can set it on the configuration file.

mx-scraper --help --verbose
mx-scraper --infos
mx-scraper -h -v
mx-scraper --show-plugins -v
mx-scraper --show-plugins -v -cs
mx-scraper --search-plugin -v http://link/to/a/title
mx-scraper --auto --fetch http://link/to/a/title
mx-scraper --plugin <PLUGIN_NAME> --fetch-all title1 title2 title3
mx-scraper --auto --fetch-all --download --parallel http://link/to/title1 http://link/to/title2
mx-scraper --auto --download --parallel --fetch-file list.txt --meta-only
mx-scraper -a -d -pa -ff list.txt -mo
mx-scraper -a -d -pa -ff list.txt
mx-scraper -v -d --load-plan danbooru.yaml --plan-params TAG=bocchi_the_rock! "TITLE=Bocchi The Rock"

Development

Setup

Download deno
Install puppeteer
Install FlareSolverr or update the configuration file if an instance is available on your network
Run the following commands to make sure everything is working

# Cache dependencies
deno cache --config=./src/config.json --lock-write ./src/main.ts
# Testing (some tests require Flaresolverr)
deno test -A --config=./src/config.json ./tests
# Running (dev)
deno run -A --config=./src/config.json ./src/main.ts --infos
# Compiling
# deno compile -A --output mx-scraper --config=./src/config.json ./src/main.ts --is_compiled_binary

Playground

HtmlParser engine can be used through a local graphql client, this is very useful if you want to understand how a web page is generated. A server can be spawn with the --dev-parser flag (available by default on http://localhost:3000/graphql).

mx-scraper's People

Contributors

Stargazers

Watchers

Forkers

luckasranarison

mx-scraper's Issues

Gelbooru scraper

Implement a cli engine

Fetch metadatas

mx-scraper --plugin nhentai --fetch-all 177013 410410 .... 159013
mx-scraper --plugin nhentai --fetch 177013
mx-scraper --plugin manganelo --fetch manga-yg951863
mx-scraper --auto --fetch http://some/link/to/a/title

Download flag

mx-scraper --auto --fetch http://some/link/to/a/title --download --all
mx-scraper --auto http://some/link/to/a/title --download --chapter 4

CI Jenkins pipeline script

Description

Create a Jenkins file for building stuff

Implement headless mode

Description

The engine should be able to render a pages dynamically rendered by javascript

Implementation

Use Puppeter or JSDOM

if (useProxy)
  return html_generated_by_flaresolverr;
// begin
if (this.renderHTML) {
 return html_generated_by_puppeter_or_jsdom;
} else {
 return html_generated_by_a_simple_request;
}
// end

Add cli option --fetch-file list.txt

Description

Define a job using a file, such a file must contain

List to fetch | download separated by \n, \t or a space

Mangaclash scraper

Use a set to avoid redundant values when fetching a list

Description

Use a set to avoid redundant values when fetching a list which could result in an undefined behavior when the books are downloaded
concurrently.

Reproduction

Repeat a title as many as you want
mx-scraper -a -d -pa -fa 123 123 123 (....) 123

Write tests for the CLI engine

Inject Cookies with a custom browser extension

Description

mx-scraper --plugin <name> .... --load-cookies
# will wait till the browser extension makes a POST request containing the Cookies
# the cookies will be stored inside `binary_folder/cookies.json`

Use external config

Create a config file to override environment.ts
If such a file does not exist, create a new file based on the contents of environment.ts

Create plugin proxy session only if needed

Description

Add a feature that allows to to enable the proxy through the MXPlugin manager

Implementation

Split the core logic into multiple functions

Pipe stream bug

Make sure that everything is terminated before moving the folder from 'temp' to 'download'

CONNTimeout on Ehentai plugin

Description

Urls seems to to be valid only for a certain period of time

Reproduction

Try to download any gallery that has a lot of pages
Try to download any gallery with slow internet connection

Add basic documentation

Variables/Attributes/Fields names
Function names
Constant names
Using const as much as possible
Plugin state (init, config, usage)
Request state (Proxy, session_id)

Feature for downloading a book locally

Download a book using a Book object.

Danbooru scraper

Implement advanced HTML Parser on top of Cheerio

Description

Implement advanced HTML Parser on top of Cheerio
Concept idea :

const parser = createParser();
let r1 = parser.use(html).select('div>a').where('attr.href : my_link & text : %download%');
let r2 = parser.use(html).select('img').where('(attr.src : %.jpg | attr.src : %.gif) & attr.src : @reg /cat(.*)?/i'));
let r3 = parser.use(html).select('img').where('attr.src : @not %.gif'));

Implementation

Create a lexer/tokenizer to generate tokens
Construct an AST from the tokens
Eval the AST

Generic scraper, query planner

Description

Implement a generic scraper to allow users to write basic scripts

mx-scraper --from-plan query_plan.yaml arg1=val1 arg2=val2 ...

A plan can also be used inside a plugin, the cli parameters can be specified programmatically.

// some plugin's getBook
const plan = QueryPlan.load("plan.yaml").withParams({
  arg1: "val1",
  arg2: "val2"
});
const book = plan.run((url, index, total, err) => {
  console.log("Parsing", url, index, "/", total, err ? "[Fail]": "[Ok]");
});
assert(book != null);
return book;

Implementation

The idea is to define a fetch plan inside a yaml file

Case 1: retrieving pictures on a page

version: 1.0.0
target: https://some.link/to/pics
title: Pokemon
filter: 
  select: "img"
  where: "attr.src: %.%  & attr.alt: %Pikachu%"

Case 2: Retrieving pictures on multiple links with the same filter

version: 1.0.0
target: [ "http://link.one", "http://link.two", "http://link.three" ]
filter: 
  select: img
  where: "attr.src: %.jpg"

Case 3: Plan a generic scraper

version: 1.0.0
target: https://some.link?foo={cli_param}&album={counter1}&page={counter2}
title: Some title for the batch (can be inferred from the page)

# applied to each generated target
filter: 
  select: a
  where: "attr.class: %pic"
  linkFrom: href
  followLink:
    filter: "img"
    where: "attr.src: %.jpg"

iterate:
  counter1:
    range: [1, 10]
    onError: break
    each: 
      counter2: 
       range: [1, 5]
       onError: continue

Idol gravureprincess type blog support

Description

Implement a scraper to support the following domains :

idol.gravureprincess.date
eyval.net

Manganelo scraper

Migrate deno puppeteer (unmaintened) to npm

Precompute a hash to ensure uniqueness of a folder

Description

Unwanted behavior occurs when we download two titles that share the same substring prefix of length 70 as defined in the function Utils.ts/cleanFolderName(title:string).

Reproduction

Find titles with with the same names or same 70-length substring prefixes but different ids
Example :
Using eh as the source, try downloading 999636/2760b1ad01 and 2138126/646617f10a at the same time, both have target titles that share the same 70 length substring prefix, thus triggering an overwrite race.

Fetch and download metadata only

Description

Add a command to fetch metadata without downloading a book
Desired syntax --metadata-only | -mo

Save/Cache metadata in a query_cache folder when using Fetch

Description

When a download fails, retrieving the metadata again can take time depending on the plugin
Metadata caching would save time and bandwidth depending on the plugin (pagination, captcha, cloudfare, ...)

Implementation

The command --no-cache | -nc is expected to ignore the cached metadata
Example :
The command mx-scraper -a -d -f some_book_id

Generates a file
base_dir/_metacache/{cache_filename}.json such that cache_filename <- sha256(some_book_id + plugin.title)
The +plugin.title part is very important because there can be books with the same identifier on different sources

EHPlugin bug, wrong page index

Description

eh pagination starts at index 0.
The plugin incorrectly assumes pagination starts at index 1 i.e. page 2

Reproduction

Try downloading {target_url}/g/1837012/a6795d5783

Parallel download feature

Add a cli flag to allow concurrent downloads when "--fetch-all" is used

Separate mx-scraper-core and the plugins

Separating the core engine from the plugins and QueryPlans into different repos might be a good idea since the engine does not make any assumption on what the target endpoint looks like as long as it contains images.

Add a command to show error stacktrace

Description

For convenience, adding a cli flag to dynamically override the config file would be really helpful when testing
Example :

mx-scraper -things-that-fail --error-stack
mx-scraper -things-that-fail -es

Mangatown scraper

Mangatown is relatively easy, this should be one of the top priorities.

Use separate download folder depending on the plugin

Description

Add separate download folders depending on the plugin
Example :

download/download/plugin_name/title

e-hentai scraper

Add DownloadPlanner

Description

The current downloader looks a bit hacky, refactoring would make debugging and adding new features a lot more easier (Cookie injection, concurrent page download, collecting errors).

Implementation

Rough idea of how it should look

// Book { chapters, title } -> DownloadPlanner { book, downloads }}
// note: book objects are retrieved from the plugins
const planners = books.map((book) => DownloadPlanner.from(book));
const batchDownloader = DownloadBatch.from(planners);

// batchDownloader.flatten(); <- mege book chapters into one (ex: CH_1/1.jpg => CH_1_1.jpg)
// batchDownloader.default(); <- default behavior

batchDownloader.download({ parallel: true })
  .then(( { total} ) => console.log(total, "downloaded"));
  .catch( ({ ok, _fail, total, errors}) =>console.log(`${ok} / ${total} downloaded\n`, errors.join('\n')));

// will not write logs if unspecified
batchDownloader.writeLogs({ infos: "info.log", error: "error.log" });

// events <-> planners <-> books
const events: Events[] = batchDownloader.events();
events[0].onDownload((planner, chapterInfo, pageInfo) => {
  console.log(`[ok] download ${chapterInfo.title} (${chapterInfo.index}) > page ${pageInfo.index}`);
});
events[0].onError((planner, chapterInfo, pageInfo, error) => {
  console.log(`[fail] download ${chapterInfo.bookTitle} > ${chapterInfo.title} (${chapterInfo.index}) > page ${pageInfo.index}`);
  console.log(`   url ${pageInfo.url} > ${error}`);
});

Download in batches to avoid overloading

Description

The ui renders incorrectly when there are too many downloads running in parallel.
When too many requests are created, the host computer's network risks an IP ban

futureg-lab / mx-scraper Goto Github PK

mx-scraper's Introduction

mx-scraper

Usage

Development

Setup

Playground

mx-scraper's People

Contributors

Stargazers

Watchers

Forkers

mx-scraper's Issues

Fetch metadatas

Download flag

Description

Description

Implementation

Description

Description

Reproduction

Description

Description

Implementation

Description

Reproduction

Description

Implementation

Description

Implementation

Case 1: retrieving pictures on a page

Case 2: Retrieving pictures on multiple links with the same filter

Case 3: Plan a generic scraper

Description

Description

Reproduction

Description

Description

Implementation

Description

Reproduction

Description

Description

Description

Implementation

Description

Recommend Projects

Recommend Topics

Recommend Org