Git Product home page Git Product logo

scriptor's People

Contributors

johanneskiesel avatar querela avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

matuskvackaj

scriptor's Issues

Suggested improvements to snapshot script

 for (const [i, u] of url.entries()) {
        const page = await browserContext.newPage({
            userAgent: 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36'
        });
        promises.push(page.goto(u, { waitUntil: "domcontentloaded" }).then(async (resp) => {
            try {
                await page.waitForLoadState('load', { timeout: 10000 });
            } catch (ex) {}

            // Adapt viewport height to scroll height
            await page.waitForTimeout(500);
            await pages.adjustViewportToPage(page, optionsViewportAdjust);
            await page.waitForTimeout(1000);
            try {
                await page.waitForLoadState('networkidle', { timeout: 15000 });
            } catch (ex) {}

            // Take snapshot
            const snapName = url.length > 1 ? `snapshot-${i}` : "snapshot";
            await pages.takeSnapshot(page, Object.assign(
                {path: path.join(outputDirectory, snapName)}, scriptOptions[SCRIPT_OPTIONS_SNAPSHOT]
            ));
        }));
    }

Nice idea to use domcontentloaded and then wait for load with timeout.

entrypoint.sh unescapes

Does not work:

sudo docker run -it --rm --volume ${PWD}/out:/output ghcr.io/webis-de/scriptor:0.4.0 scriptor --input '{"url":"https://github.com/webis-de/scriptor"}'

works:

sudo docker run -it --rm --volume ${PWD}/out:/output ghcr.io/webis-de/scriptor:0.4.0 scriptor --input '{\"url\":\"https://github.com/webis-de/scriptor\"}

move part of --chain inside the container

  • mount the chain output directory, not just the run-specific output directory
  • write and update a file in the chain output directory that specifies the last output directory and how the following output directories should be called (perhaps using a setting that specifies the name of that file). the update should happen after the run was successful
  • use that file to determine the input and output directory if it already exists

The idea is that scriptor can then be run several times with the same command line, automatically continuing where it left. This will allow easier use in Kubernetes

Update playwright

Just a reminder to update playwright before changing version to 1.0.0

Add Unit Tests

I first need to check unit test frameworks (should run on Github?)

Then I need to come up with a good method for testing runs with docker

And implementing tests will also take a lot of time

--show-browser is not working for docker

To show the browser that is started in the docker image in non-headless mode, I found these solutions:

  1. Just do not offer this: you have to run Scriptor without Docker if you need this (but I wanted to avoid running without Docker... now enforce it in some situations?).
  2. Mount the respective X resources. On Linux, it seems to be enough to mount certain directories in the container to share the display with the container. (does of course break for other operating systems)
  3. Install a VNC server and expose its port. Enlarges the image (but only by 10 MB it seems: not worth it to make a separate image... xvfb is already installed!). Requires a separate program to view the virtual screen (but I think all major OSes have one default installed; for Ubuntu it is "remmina").

I decided on 3: it is the least fragile. And even allows remote (password-protected) access to the browser... might be useful at some point!

Note: I still have to test how the default window manager handles the separate window opened by page.pause(). Might be necessary to install a super-lightweight window manager, which then might cause me to reconsider putting everything into the standard image. After all, it would be not too complicated to add a suffix to the image tag on --show-browser.

async await

still have some functions that are declared async but return a promise either way. Probably most reasonable to await that promise in the return.

Clean up even on error

Though this is now what I originally intended, someone is now running the entrypoint several times in a row. To support this, it would be necessary to clean up a bit more after an error occurs.

The critical spot seems to be here:

const chainable = (true === await execution);

Readme: chaining

Depends on #2

Also includes: continuing a chain using the "start" option

Improve getHeight

document.body.scroll is 0 for example for YouTube (overflow:scroll)

Readme: developing own scripts

Especially a suitable setup. How can we make this easy for IDEs? I now put Playwright as a dev dependency so that it is not downloaded when one just wants to install the executable (since it comes along with browsers, it is kind of big). But for IDE development I guess it should be downloaded? But if you develop without IDE you do not even need to have Node installed...

Includes also: files.js and pages.js

takeNodesSnapshot - page.evaluate: TypeError: Cannot read properties of undefined (reading 'id')

Possible error, not sure how it was raised exactly.

URL: https://lotustl.com/im/immortal-v4c72-zhong-shan-uses-a-sword-to-release-the-blade/?utm_source=rss&utm_medium=rss&utm_campaign=immortal-v4c72-zhong-shan-uses-a-sword-to-release-the-blade

node:internal/process/promises:246
          triggerUncaughtException(err, true /* fromPromise */);
          ^

page.evaluate: TypeError: Cannot read properties of undefined (reading 'id')
    at traverse (eval at evaluate (:3:2389), <anonymous>:25:27)
    at eval (eval at evaluate (:3:2389), <anonymous>:86:5)
    at t.default.evaluate (<anonymous>:3:2412)
    at t.default.<anonymous> (<anonymous>:1:44)
    at takeNodesSnapshot (/scriptor/lib/pages.js:337:36)
    at Object.takeSnapshot (/scriptor/lib/pages.js:258:25)
    at module.exports._processOne (/script/Script-multi.js:115:19)
    at async module.exports.run (/script/Script-multi.js:54:24)
    at async Object.run (/scriptor/lib/scripts.js:64:31)

Note: I may need to run this URL on its own with the default Script to check reproducibility later.

And the curious thing is that I have wrapped the call of await pages.takeSnapshot(page, optionsSnapshot); in try { } catch (error) { } and it still killed my process.

Decompress gzip encoding within WARC

Maybe there is a Pywb option for this. Since we are storing the WARCs compressed either way there is not much reason to have another layer of compression

Readme: replay (did I even test that yet?)

It may also be useful to specify the pywb collection (or the WARC file?) directly: separate option or rather extending the replay option to take a JSON?

--replay '{"readonly":boolean,"collection":path,"type":"warc|pywb"}'

Single browser context

Goal: Simplify output directory structure and scripts

Method:

  • Single browser context
  • Single browser context directory on top-level
  • A scriptor script option to change the browser context directory
  • One still is able to run several browsers in sequence through chains (changing the script option for the next run)

takeNodesSnapshot - page.evaluate: TypeError: JSON.stringify is not a function

I'm honestly not sure why it happened but the webpage did not have JSON.stringify defined. Maybe some scraping protection or whatever ...
It is reproducible, type JSON.stringify in the browser console dev tools (Firefox) for the target URL and another one.

URL: http://www.zeonic-republic.net/?page_id=363

Stacktrace:

{"name":"scriptor","hostname":"scriptor-with-ceph-0466","pid":1,"level":30,"id":"34790","url":"http://www.zeonic-republic.net/?page_id=363","msg":"snapshot","time":"2022-01-12T19:39:27.443Z","v":0}
{"name":"scriptor","hostname":"scriptor-with-ceph-0466","pid":1,"level":30,"old":{"width":1280,"height":720},"new":{"width":1280,"height":7460},"msg":"pages.adjustViewportToPage","time":"2022-01-12T19:39:30.797Z","v":0}
node:internal/process/promises:246
          triggerUncaughtException(err, true /* fromPromise */);
          ^

page.evaluate: TypeError: JSON.stringify is not a function
    at traverse (eval at evaluate (:3:2389), <anonymous>:53:23)
    at eval (eval at evaluate (:3:2389), <anonymous>:87:5)
    at t.default.evaluate (<anonymous>:3:2412)
    at t.default.<anonymous> (<anonymous>:1:44)
    at takeNodesSnapshot (/scriptor/lib/pages.js:337:36)
    at Object.takeSnapshot (/scriptor/lib/pages.js:258:25)
    at module.exports._processOne (/script/Script-multi.js:115:19)
    at async module.exports.run (/script/Script-multi.js:54:24)
    at async Object.run (/scriptor/lib/scripts.js:64:31)

Solution: Probably ignore it for now since it seems to be the exception (pun intended) than the norm. Otherwise, call JSON.stringify in the nodejs context, not the browser page. Not sure about security issues that might arise. But hosters could theoretically redefine other functions, too. So, for now, I would keep it as is, and just know about this issue.

More robust snapshot script

Version of @phoerious

waitForLoadStateWithTimeout should probably be moved into pages.js.

The waiting-and-resizing-part could also be moved into pages.js. Likely helpful also for other scripts.

This script also covers the case of crawling several URLs at once. Not sure yet whether to keep that part. Probably yes.

const fs = require("fs-extra");
const path = require("path");

const { AbstractScriptorScript, files, pages, log } = require("@webis-de/scriptor");

const NAME = "Snapshot";
const VERSION = "0.2.0";

const waitForLoadStateWithTimeout = async (page, event, timeout) => {
  try {
    return await page.waitForLoadState(event, { timeout: timeout });
  } catch (ex) {
    return null;
  }
}

module.exports = class extends AbstractScriptorScript {

  constructor() {
    super(NAME, VERSION);
  }

  async run(browserContexts, scriptDirectory, inputDirectory, outputDirectory) {
    const browserContext = browserContexts[files.BROWSER_CONTEXT_DEFAULT];

    // Script options
    const defaultScriptOptions = {
      viewportAdjust: {},
      snapshot: {
        screenshot: { timeout: 120000 }  // Screenshotting complex pages can take a very long time
      }
    };
    const requiredScriptOptions = [ "url" ];
    const scriptOptions = files.readOptions(files.getExisting(
      files.SCRIPT_OPTIONS_FILE_NAME, [ scriptDirectory, inputDirectory ]),
      defaultScriptOptions, requiredScriptOptions);
    log.info({options: scriptOptions}, "script.options");

    fs.writeJsonSync(path.join(outputDirectory, files.SCRIPT_OPTIONS_FILE_NAME), scriptOptions);

    // Load page(s)
    let url = scriptOptions["url"];
    if (typeof url === "string") {
      url = [url];
    }

    const promises = [];
    for (const [i, u] of url.entries()) {
      const page = await browserContext.newPage();
      promises.push(page.goto(u, { waitUntil: "domcontentloaded" }).then(async (resp) => {
        await waitForLoadStateWithTimeout(page, "load", 10000);

        // Adjust viewport height to scroll height to trigger loading dynamic content
        await pages.adjustViewportToPage(page, scriptOptions["viewportAdjust"]);

        // Wait for three networkidle intervals to ensure dynamic content finished loading
        for (let i = 0; i < 3; ++i) {
          await waitForLoadStateWithTimeout(page, "networkidle", 3500);
        }

        // Update viewport up to three times to accomodate for layout changes and
        // to trigger further dynamic content
        let resizes = 0;
        while (resizes < 3 && await page.viewportSize().height !== await pages.getHeight(page)) {
          await pages.adjustViewportToPage(page, scriptOptions["viewportAdjust"]);
          await waitForLoadStateWithTimeout(page, "networkidle", 2500);
          await page.waitForTimeout(250);
          ++resizes;
        }

        // Take snapshot(s)
        const snapName = url.length > 1 ? `snapshot-${i}` : "snapshot";
        await pages.takeSnapshot(page, Object.assign(
            { path: path.join(outputDirectory, snapName) }, scriptOptions["snapshot"]
        ));
      }));
    }
    await Promise.all(promises);

    return true;
  }
};

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.