webis-de / scriptor Goto Github PK

View Code? Open in Web Editor NEW

6.0 19.0 1.0 1.63 MB

Plug-and-play reproducible web analysis.

License: MIT License

Dockerfile 0.86% JavaScript 97.90% Shell 1.23%

nodejs browser playwright user-simulation web-archiving automation web-analysis

scriptor's People

Contributors

Stargazers

Watchers

Forkers

matuskvackaj

scriptor's Issues

Suggested improvements to snapshot script

 for (const [i, u] of url.entries()) {
        const page = await browserContext.newPage({
            userAgent: 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36'
        });
        promises.push(page.goto(u, { waitUntil: "domcontentloaded" }).then(async (resp) => {
            try {
                await page.waitForLoadState('load', { timeout: 10000 });
            } catch (ex) {}

            // Adapt viewport height to scroll height
            await page.waitForTimeout(500);
            await pages.adjustViewportToPage(page, optionsViewportAdjust);
            await page.waitForTimeout(1000);
            try {
                await page.waitForLoadState('networkidle', { timeout: 15000 });
            } catch (ex) {}

            // Take snapshot
            const snapName = url.length > 1 ? `snapshot-${i}` : "snapshot";
            await pages.takeSnapshot(page, Object.assign(
                {path: path.join(outputDirectory, snapName)}, scriptOptions[SCRIPT_OPTIONS_SNAPSHOT]
            ));
        }));
    }

Nice idea to use domcontentloaded and then wait for load with timeout.

entrypoint.sh unescapes

Does not work:

sudo docker run -it --rm --volume ${PWD}/out:/output ghcr.io/webis-de/scriptor:0.4.0 scriptor --input '{"url":"https://github.com/webis-de/scriptor"}'

works:

sudo docker run -it --rm --volume ${PWD}/out:/output ghcr.io/webis-de/scriptor:0.4.0 scriptor --input '{\"url\":\"https://github.com/webis-de/scriptor\"}

Test Playwright's network replay

https://playwright.dev/docs/release-notes#network-replay

Unclear at the moment: how does this differ from pywb's output, besides being HAR instead of WARC?

If the same is recorded and one can convert HAR to WARC (https://github.com/webrecorder/har2warc), this might simplify Scriptor considerably.

move part of --chain inside the container

mount the chain output directory, not just the run-specific output directory
write and update a file in the chain output directory that specifies the last output directory and how the following output directories should be called (perhaps using a setting that specifies the name of that file). the update should happen after the run was successful
use that file to determine the input and output directory if it already exists

The idea is that scriptor can then be run several times with the same command line, automatically continuing where it left. This will allow easier use in Kubernetes

Update playwright

Just a reminder to update playwright before changing version to 1.0.0

Add Unit Tests

I first need to check unit test frameworks (should run on Github?)

Then I need to come up with a good method for testing runs with docker

And implementing tests will also take a lot of time

Check isScrolledToBottom

This probably needs a similar improvement to getHeight: https://github.com/webis-de/scriptor/blob/main/lib/pages.js#L177

Youtube seems to be a good page to check

some flag to overwrite an existing output directory

--show-browser is not working for docker

To show the browser that is started in the docker image in non-headless mode, I found these solutions:

Just do not offer this: you have to run Scriptor without Docker if you need this (but I wanted to avoid running without Docker... now enforce it in some situations?).
Mount the respective X resources. On Linux, it seems to be enough to mount certain directories in the container to share the display with the container. (does of course break for other operating systems)
Install a VNC server and expose its port. Enlarges the image (but only by 10 MB it seems: not worth it to make a separate image... xvfb is already installed!). Requires a separate program to view the virtual screen (but I think all major OSes have one default installed; for Ubuntu it is "remmina").

I decided on 3: it is the least fragile. And even allows remote (password-protected) access to the browser... might be useful at some point!

Note: I still have to test how the default window manager handles the separate window opened by page.pause(). Might be necessary to install a super-lightweight window manager, which then might cause me to reconsider putting everything into the standard image. After all, it would be not too complicated to add a suffix to the image tag on --show-browser.

snapshot: currently visible elements

async await

still have some functions that are declared async but return a promise either way. Probably most reasonable to await that promise in the return.

Clean up even on error

Though this is now what I originally intended, someone is now running the entrypoint several times in a row. To support this, it would be necessary to clean up a bit more after an error occurs.

The critical spot seems to be here:

scriptor/lib/scripts.js

Line 76 in 4a9329f

const chainable = (true === await execution);

Readme: chaining

Depends on #2

Also includes: continuing a chain using the "start" option

Trace viewer title: context name?

New option in 1.17! https://playwright.dev/docs/api/class-tracing/#tracing-start-option-title

Screenshot animations: disabled

New in playwright 1.20: https://playwright.dev/docs/release-notes#version-120

'Option animations: "disabled" rewinds all CSS animations and transitions to a consistent state'

Probably awesome, Playwright test now also has more features to compare screenshots...

Improve getHeight

document.body.scroll is 0 for example for YouTube (overflow:scroll)

Readme: developing own scripts

Especially a suitable setup. How can we make this easy for IDEs? I now put Playwright as a dev dependency so that it is not downloaded when one just wants to install the executable (since it comes along with browsers, it is kind of big). But for IDE development I guess it should be downloaded? But if you develop without IDE you do not even need to have Node installed...

Includes also: files.js and pages.js

takeNodesSnapshot - page.evaluate: TypeError: Cannot read properties of undefined (reading 'id')

Possible error, not sure how it was raised exactly.

URL: https://lotustl.com/im/immortal-v4c72-zhong-shan-uses-a-sword-to-release-the-blade/?utm_source=rss&utm_medium=rss&utm_campaign=immortal-v4c72-zhong-shan-uses-a-sword-to-release-the-blade

node:internal/process/promises:246
          triggerUncaughtException(err, true /* fromPromise */);
          ^

page.evaluate: TypeError: Cannot read properties of undefined (reading 'id')
    at traverse (eval at evaluate (:3:2389), <anonymous>:25:27)
    at eval (eval at evaluate (:3:2389), <anonymous>:86:5)
    at t.default.evaluate (<anonymous>:3:2412)
    at t.default.<anonymous> (<anonymous>:1:44)
    at takeNodesSnapshot (/scriptor/lib/pages.js:337:36)
    at Object.takeSnapshot (/scriptor/lib/pages.js:258:25)
    at module.exports._processOne (/script/Script-multi.js:115:19)
    at async module.exports.run (/script/Script-multi.js:54:24)
    at async Object.run (/scriptor/lib/scripts.js:64:31)

Note: I may need to run this URL on its own with the default Script to check reproducibility later.

And the curious thing is that I have wrapped the call of await pages.takeSnapshot(page, optionsSnapshot); in try { } catch (error) { } and it still killed my process.

Decompress gzip encoding within WARC

Maybe there is a Pywb option for this. Since we are storing the WARCs compressed either way there is not much reason to have another layer of compression

write an hash as an ID to an output directory as the last step in run completion

run hash file
file that contains the hash of the input directory
make output directory read-only

Readme: replay (did I even test that yet?)

It may also be useful to specify the pywb collection (or the WARC file?) directly: separate option or rather extending the replay option to take a JSON?

--replay '{"readonly":boolean,"collection":path,"type":"warc|pywb"}'

Write docker run log also to output directory

Single browser context

Goal: Simplify output directory structure and scripts

Method:

Single browser context
Single browser context directory on top-level
A scriptor script option to change the browser context directory
One still is able to run several browsers in sequence through chains (changing the script option for the next run)

CLI-Override for browser.json

Change directory options for usage without NodeJS

There is the possibility to hide an option from the help in commander: https://github.com/tj/commander.js/#more-configuration

This might be appropriate here, so that one could still change it in case one needs to run scriptor-run locally.

takeNodesSnapshot - page.evaluate: TypeError: JSON.stringify is not a function

I'm honestly not sure why it happened but the webpage did not have JSON.stringify defined. Maybe some scraping protection or whatever ...
It is reproducible, type JSON.stringify in the browser console dev tools (Firefox) for the target URL and another one.

URL: http://www.zeonic-republic.net/?page_id=363

Stacktrace:

{"name":"scriptor","hostname":"scriptor-with-ceph-0466","pid":1,"level":30,"id":"34790","url":"http://www.zeonic-republic.net/?page_id=363","msg":"snapshot","time":"2022-01-12T19:39:27.443Z","v":0}
{"name":"scriptor","hostname":"scriptor-with-ceph-0466","pid":1,"level":30,"old":{"width":1280,"height":720},"new":{"width":1280,"height":7460},"msg":"pages.adjustViewportToPage","time":"2022-01-12T19:39:30.797Z","v":0}
node:internal/process/promises:246
          triggerUncaughtException(err, true /* fromPromise */);
          ^

page.evaluate: TypeError: JSON.stringify is not a function
    at traverse (eval at evaluate (:3:2389), <anonymous>:53:23)
    at eval (eval at evaluate (:3:2389), <anonymous>:87:5)
    at t.default.evaluate (<anonymous>:3:2412)
    at t.default.<anonymous> (<anonymous>:1:44)
    at takeNodesSnapshot (/scriptor/lib/pages.js:337:36)
    at Object.takeSnapshot (/scriptor/lib/pages.js:258:25)
    at module.exports._processOne (/script/Script-multi.js:115:19)
    at async module.exports.run (/script/Script-multi.js:54:24)
    at async Object.run (/scriptor/lib/scripts.js:64:31)

Solution: Probably ignore it for now since it seems to be the exception (pun intended) than the norm. Otherwise, call JSON.stringify in the nodejs context, not the browser page. Not sure about security issues that might arise. But hosters could theoretically redefine other functions, too. So, for now, I would keep it as is, and just know about this issue.

More robust snapshot script

Version of @phoerious

waitForLoadStateWithTimeout should probably be moved into pages.js.

The waiting-and-resizing-part could also be moved into pages.js. Likely helpful also for other scripts.

This script also covers the case of crawling several URLs at once. Not sure yet whether to keep that part. Probably yes.

const fs = require("fs-extra");
const path = require("path");

const { AbstractScriptorScript, files, pages, log } = require("@webis-de/scriptor");

const NAME = "Snapshot";
const VERSION = "0.2.0";

const waitForLoadStateWithTimeout = async (page, event, timeout) => {
  try {
    return await page.waitForLoadState(event, { timeout: timeout });
  } catch (ex) {
    return null;
  }
}

module.exports = class extends AbstractScriptorScript {

  constructor() {
    super(NAME, VERSION);
  }

  async run(browserContexts, scriptDirectory, inputDirectory, outputDirectory) {
    const browserContext = browserContexts[files.BROWSER_CONTEXT_DEFAULT];

    // Script options
    const defaultScriptOptions = {
      viewportAdjust: {},
      snapshot: {
        screenshot: { timeout: 120000 }  // Screenshotting complex pages can take a very long time
      }
    };
    const requiredScriptOptions = [ "url" ];
    const scriptOptions = files.readOptions(files.getExisting(
      files.SCRIPT_OPTIONS_FILE_NAME, [ scriptDirectory, inputDirectory ]),
      defaultScriptOptions, requiredScriptOptions);
    log.info({options: scriptOptions}, "script.options");

    fs.writeJsonSync(path.join(outputDirectory, files.SCRIPT_OPTIONS_FILE_NAME), scriptOptions);

    // Load page(s)
    let url = scriptOptions["url"];
    if (typeof url === "string") {
      url = [url];
    }

    const promises = [];
    for (const [i, u] of url.entries()) {
      const page = await browserContext.newPage();
      promises.push(page.goto(u, { waitUntil: "domcontentloaded" }).then(async (resp) => {
        await waitForLoadStateWithTimeout(page, "load", 10000);

        // Adjust viewport height to scroll height to trigger loading dynamic content
        await pages.adjustViewportToPage(page, scriptOptions["viewportAdjust"]);

        // Wait for three networkidle intervals to ensure dynamic content finished loading
        for (let i = 0; i < 3; ++i) {
          await waitForLoadStateWithTimeout(page, "networkidle", 3500);
        }

        // Update viewport up to three times to accomodate for layout changes and
        // to trigger further dynamic content
        let resizes = 0;
        while (resizes < 3 && await page.viewportSize().height !== await pages.getHeight(page)) {
          await pages.adjustViewportToPage(page, scriptOptions["viewportAdjust"]);
          await waitForLoadStateWithTimeout(page, "networkidle", 2500);
          await page.waitForTimeout(250);
          ++resizes;
        }

        // Take snapshot(s)
        const snapName = url.length > 1 ? `snapshot-${i}` : "snapshot";
        await pages.takeSnapshot(page, Object.assign(
            { path: path.join(outputDirectory, snapName) }, scriptOptions["snapshot"]
        ));
      }));
    }
    await Promise.all(promises);

    return true;
  }
};