Git Product home page Git Product logo

mx-scraper's Introduction

mx-scraper

Notice: This project has been migrated to deno starting from v4.0.0 due to pkg being deprecated.

Usage

The Flaresolverr proxy feature requires FlareSolverr

For the headless/headfull browser feature, you may want to install the appropriate chrome or firefox version here, if a browser is already available locally, you can set it on the configuration file.

mx-scraper --help --verbose
mx-scraper --infos
mx-scraper -h -v
mx-scraper --show-plugins -v
mx-scraper --show-plugins -v -cs
mx-scraper --search-plugin -v http://link/to/a/title
mx-scraper --auto --fetch http://link/to/a/title
mx-scraper --plugin <PLUGIN_NAME> --fetch-all title1 title2 title3
mx-scraper --auto --fetch-all --download --parallel http://link/to/title1 http://link/to/title2
mx-scraper --auto --download --parallel --fetch-file list.txt --meta-only
mx-scraper -a -d -pa -ff list.txt -mo
mx-scraper -a -d -pa -ff list.txt
mx-scraper -v -d --load-plan danbooru.yaml --plan-params TAG=bocchi_the_rock! "TITLE=Bocchi The Rock"

Development

Setup

  1. Download deno
  2. Install puppeteer
  3. Install FlareSolverr or update the configuration file if an instance is available on your network
  4. Run the following commands to make sure everything is working
# Cache dependencies
deno cache --config=./src/config.json --lock-write ./src/main.ts
# Testing (some tests require Flaresolverr)
deno test -A --config=./src/config.json ./tests
# Running (dev)
deno run -A --config=./src/config.json ./src/main.ts --infos
# Compiling
# deno compile -A --output mx-scraper --config=./src/config.json ./src/main.ts --is_compiled_binary

Playground

HtmlParser engine can be used through a local graphql client, this is very useful if you want to understand how a web page is generated. A server can be spawn with the --dev-parser flag (available by default on http://localhost:3000/graphql).

image

mx-scraper's People

Contributors

afmika avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

luckasranarison

mx-scraper's Issues

Implement headless mode

Description

The engine should be able to render a pages dynamically rendered by javascript

Implementation

  • Use Puppeter or JSDOM
if (useProxy)
  return html_generated_by_flaresolverr;
// begin
if (this.renderHTML) {
 return html_generated_by_puppeter_or_jsdom;
} else {
 return html_generated_by_a_simple_request;
}
// end

Use a set to avoid redundant values when fetching a list

Description

Use a set to avoid redundant values when fetching a list which could result in an undefined behavior when the books are downloaded
concurrently.

Reproduction

  • Repeat a title as many as you want
    mx-scraper -a -d -pa -fa 123 123 123 (....) 123

Inject Cookies with a custom browser extension

Description

mx-scraper --plugin <name> .... --load-cookies
# will wait till the browser extension makes a POST request containing the Cookies
# the cookies will be stored inside `binary_folder/cookies.json`

Use external config

  • Create a config file to override environment.ts
  • If such a file does not exist, create a new file based on the contents of environment.ts

Pipe stream bug

Make sure that everything is terminated before moving the folder from 'temp' to 'download'

CONNTimeout on Ehentai plugin

Description

Urls seems to to be valid only for a certain period of time

image

Reproduction

  • Try to download any gallery that has a lot of pages
  • Try to download any gallery with slow internet connection

Add basic documentation

  • Variables/Attributes/Fields names
  • Function names
  • Constant names
  • Using const as much as possible
  • Plugin state (init, config, usage)
  • Request state (Proxy, session_id)

Implement advanced HTML Parser on top of Cheerio

Description

  • Implement advanced HTML Parser on top of Cheerio
    Concept idea :
const parser = createParser();
let r1 = parser.use(html).select('div>a').where('attr.href : my_link & text : %download%');
let r2 = parser.use(html).select('img').where('(attr.src : %.jpg | attr.src : %.gif) & attr.src : @reg /cat(.*)?/i'));
let r3 = parser.use(html).select('img').where('attr.src : @not %.gif'));

Implementation

  1. Create a lexer/tokenizer to generate tokens
  2. Construct an AST from the tokens
  3. Eval the AST

Generic scraper, query planner

Description

Implement a generic scraper to allow users to write basic scripts

mx-scraper --from-plan query_plan.yaml arg1=val1 arg2=val2 ...

A plan can also be used inside a plugin, the cli parameters can be specified programmatically.

// some plugin's getBook
const plan = QueryPlan.load("plan.yaml").withParams({
  arg1: "val1",
  arg2: "val2"
});
const book = plan.run((url, index, total, err) => {
  console.log("Parsing", url, index, "/", total, err ? "[Fail]": "[Ok]");
});
assert(book != null);
return book;

Implementation

The idea is to define a fetch plan inside a yaml file

Case 1: retrieving pictures on a page

version: 1.0.0
target: https://some.link/to/pics
title: Pokemon
filter: 
  select: "img"
  where: "attr.src: %.%  & attr.alt: %Pikachu%"

Case 2: Retrieving pictures on multiple links with the same filter

version: 1.0.0
target: [ "http://link.one", "http://link.two", "http://link.three" ]
filter: 
  select: img
  where: "attr.src: %.jpg"

Case 3: Plan a generic scraper

version: 1.0.0
target: https://some.link?foo={cli_param}&album={counter1}&page={counter2}
title: Some title for the batch (can be inferred from the page)

# applied to each generated target
filter: 
  select: a
  where: "attr.class: %pic"
  linkFrom: href
  followLink:
    filter: "img"
    where: "attr.src: %.jpg"

iterate:
  counter1:
    range: [1, 10]
    onError: break
    each: 
      counter2: 
       range: [1, 5]
       onError: continue

Precompute a hash to ensure uniqueness of a folder

Description

Unwanted behavior occurs when we download two titles that share the same substring prefix of length 70 as defined in the function Utils.ts/cleanFolderName(title:string).

Reproduction

  • Find titles with with the same names or same 70-length substring prefixes but different ids
    Example :
    Using eh as the source, try downloading 999636/2760b1ad01 and 2138126/646617f10a at the same time, both have target titles that share the same 70 length substring prefix, thus triggering an overwrite race.

Save/Cache metadata in a query_cache folder when using Fetch

Description

  • When a download fails, retrieving the metadata again can take time depending on the plugin
  • Metadata caching would save time and bandwidth depending on the plugin (pagination, captcha, cloudfare, ...)

Implementation

  • The command --no-cache | -nc is expected to ignore the cached metadata
    Example :
    The command mx-scraper -a -d -f some_book_id
  • Generates a file
    base_dir/_metacache/{cache_filename}.json such that cache_filename <- sha256(some_book_id + plugin.title)
  • The +plugin.title part is very important because there can be books with the same identifier on different sources

EHPlugin bug, wrong page index

Description

  • eh pagination starts at index 0.
  • The plugin incorrectly assumes pagination starts at index 1 i.e. page 2

Reproduction

  • Try downloading {target_url}/g/1837012/a6795d5783

Separate mx-scraper-core and the plugins

Separating the core engine from the plugins and QueryPlans into different repos might be a good idea since the engine does not make any assumption on what the target endpoint looks like as long as it contains images.

Add a command to show error stacktrace

Description

For convenience, adding a cli flag to dynamically override the config file would be really helpful when testing
Example :

  • mx-scraper -things-that-fail --error-stack
  • mx-scraper -things-that-fail -es

Mangatown scraper

Mangatown is relatively easy, this should be one of the top priorities.

Add DownloadPlanner

Description

The current downloader looks a bit hacky, refactoring would make debugging and adding new features a lot more easier (Cookie injection, concurrent page download, collecting errors).

Implementation

Rough idea of how it should look

// Book { chapters, title } -> DownloadPlanner { book, downloads }}
// note: book objects are retrieved from the plugins
const planners = books.map((book) => DownloadPlanner.from(book));
const batchDownloader = DownloadBatch.from(planners);

// batchDownloader.flatten(); <- mege book chapters into one (ex: CH_1/1.jpg => CH_1_1.jpg)
// batchDownloader.default(); <- default behavior

batchDownloader.download({ parallel: true })
  .then(( { total} ) => console.log(total, "downloaded"));
  .catch( ({ ok, _fail, total, errors}) =>console.log(`${ok} / ${total} downloaded\n`, errors.join('\n')));

// will not write logs if unspecified
batchDownloader.writeLogs({ infos: "info.log", error: "error.log" });

// events <-> planners <-> books
const events: Events[] = batchDownloader.events();
events[0].onDownload((planner, chapterInfo, pageInfo) => {
  console.log(`[ok] download ${chapterInfo.title} (${chapterInfo.index}) > page ${pageInfo.index}`);
});
events[0].onError((planner, chapterInfo, pageInfo, error) => {
  console.log(`[fail] download ${chapterInfo.bookTitle} > ${chapterInfo.title} (${chapterInfo.index}) > page ${pageInfo.index}`);
  console.log(`   url ${pageInfo.url} > ${error}`);
});

Download in batches to avoid overloading

Description

  • The ui renders incorrectly when there are too many downloads running in parallel.
  • When too many requests are created, the host computer's network risks an IP ban

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.