webis-de / scriptor Goto Github PK
View Code? Open in Web Editor NEWPlug-and-play reproducible web analysis.
License: MIT License
Plug-and-play reproducible web analysis.
License: MIT License
for (const [i, u] of url.entries()) {
const page = await browserContext.newPage({
userAgent: 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36'
});
promises.push(page.goto(u, { waitUntil: "domcontentloaded" }).then(async (resp) => {
try {
await page.waitForLoadState('load', { timeout: 10000 });
} catch (ex) {}
// Adapt viewport height to scroll height
await page.waitForTimeout(500);
await pages.adjustViewportToPage(page, optionsViewportAdjust);
await page.waitForTimeout(1000);
try {
await page.waitForLoadState('networkidle', { timeout: 15000 });
} catch (ex) {}
// Take snapshot
const snapName = url.length > 1 ? `snapshot-${i}` : "snapshot";
await pages.takeSnapshot(page, Object.assign(
{path: path.join(outputDirectory, snapName)}, scriptOptions[SCRIPT_OPTIONS_SNAPSHOT]
));
}));
}
Nice idea to use domcontentloaded and then wait for load with timeout.
Does not work:
sudo docker run -it --rm --volume ${PWD}/out:/output ghcr.io/webis-de/scriptor:0.4.0 scriptor --input '{"url":"https://github.com/webis-de/scriptor"}'
works:
sudo docker run -it --rm --volume ${PWD}/out:/output ghcr.io/webis-de/scriptor:0.4.0 scriptor --input '{\"url\":\"https://github.com/webis-de/scriptor\"}
https://playwright.dev/docs/release-notes#network-replay
Unclear at the moment: how does this differ from pywb's output, besides being HAR instead of WARC?
If the same is recorded and one can convert HAR to WARC (https://github.com/webrecorder/har2warc), this might simplify Scriptor considerably.
The idea is that scriptor can then be run several times with the same command line, automatically continuing where it left. This will allow easier use in Kubernetes
Just a reminder to update playwright before changing version to 1.0.0
I first need to check unit test frameworks (should run on Github?)
Then I need to come up with a good method for testing runs with docker
And implementing tests will also take a lot of time
This probably needs a similar improvement to getHeight: https://github.com/webis-de/scriptor/blob/main/lib/pages.js#L177
Youtube seems to be a good page to check
To show the browser that is started in the docker image in non-headless mode, I found these solutions:
I decided on 3: it is the least fragile. And even allows remote (password-protected) access to the browser... might be useful at some point!
Note: I still have to test how the default window manager handles the separate window opened by page.pause()
. Might be necessary to install a super-lightweight window manager, which then might cause me to reconsider putting everything into the standard image. After all, it would be not too complicated to add a suffix to the image tag on --show-browser.
still have some functions that are declared async but return a promise either way. Probably most reasonable to await that promise in the return.
Though this is now what I originally intended, someone is now running the entrypoint several times in a row. To support this, it would be necessary to clean up a bit more after an error occurs.
The critical spot seems to be here:
Line 76 in 4a9329f
Depends on #2
Also includes: continuing a chain using the "start" option
New option in 1.17! https://playwright.dev/docs/api/class-tracing/#tracing-start-option-title
New in playwright 1.20: https://playwright.dev/docs/release-notes#version-120
'Option animations: "disabled" rewinds all CSS animations and transitions to a consistent state'
Probably awesome, Playwright test now also has more features to compare screenshots...
document.body.scroll is 0 for example for YouTube (overflow:scroll)
Especially a suitable setup. How can we make this easy for IDEs? I now put Playwright as a dev dependency so that it is not downloaded when one just wants to install the executable (since it comes along with browsers, it is kind of big). But for IDE development I guess it should be downloaded? But if you develop without IDE you do not even need to have Node installed...
Includes also: files.js and pages.js
Possible error, not sure how it was raised exactly.
node:internal/process/promises:246
triggerUncaughtException(err, true /* fromPromise */);
^
page.evaluate: TypeError: Cannot read properties of undefined (reading 'id')
at traverse (eval at evaluate (:3:2389), <anonymous>:25:27)
at eval (eval at evaluate (:3:2389), <anonymous>:86:5)
at t.default.evaluate (<anonymous>:3:2412)
at t.default.<anonymous> (<anonymous>:1:44)
at takeNodesSnapshot (/scriptor/lib/pages.js:337:36)
at Object.takeSnapshot (/scriptor/lib/pages.js:258:25)
at module.exports._processOne (/script/Script-multi.js:115:19)
at async module.exports.run (/script/Script-multi.js:54:24)
at async Object.run (/scriptor/lib/scripts.js:64:31)
Note: I may need to run this URL on its own with the default Script to check reproducibility later.
And the curious thing is that I have wrapped the call of await pages.takeSnapshot(page, optionsSnapshot);
in try { } catch (error) { }
and it still killed my process.
Maybe there is a Pywb option for this. Since we are storing the WARCs compressed either way there is not much reason to have another layer of compression
It may also be useful to specify the pywb collection (or the WARC file?) directly: separate option or rather extending the replay option to take a JSON?
--replay '{"readonly":boolean,"collection":path,"type":"warc|pywb"}'
Goal: Simplify output directory structure and scripts
Method:
scriptor
script option to change the browser context directoryThere is the possibility to hide an option from the help in commander: https://github.com/tj/commander.js/#more-configuration
This might be appropriate here, so that one could still change it in case one needs to run scriptor-run
locally.
I'm honestly not sure why it happened but the webpage did not have JSON.stringify
defined. Maybe some scraping protection or whatever ...
It is reproducible, type JSON.stringify
in the browser console dev tools (Firefox) for the target URL and another one.
URL: http://www.zeonic-republic.net/?page_id=363
Stacktrace:
{"name":"scriptor","hostname":"scriptor-with-ceph-0466","pid":1,"level":30,"id":"34790","url":"http://www.zeonic-republic.net/?page_id=363","msg":"snapshot","time":"2022-01-12T19:39:27.443Z","v":0}
{"name":"scriptor","hostname":"scriptor-with-ceph-0466","pid":1,"level":30,"old":{"width":1280,"height":720},"new":{"width":1280,"height":7460},"msg":"pages.adjustViewportToPage","time":"2022-01-12T19:39:30.797Z","v":0}
node:internal/process/promises:246
triggerUncaughtException(err, true /* fromPromise */);
^
page.evaluate: TypeError: JSON.stringify is not a function
at traverse (eval at evaluate (:3:2389), <anonymous>:53:23)
at eval (eval at evaluate (:3:2389), <anonymous>:87:5)
at t.default.evaluate (<anonymous>:3:2412)
at t.default.<anonymous> (<anonymous>:1:44)
at takeNodesSnapshot (/scriptor/lib/pages.js:337:36)
at Object.takeSnapshot (/scriptor/lib/pages.js:258:25)
at module.exports._processOne (/script/Script-multi.js:115:19)
at async module.exports.run (/script/Script-multi.js:54:24)
at async Object.run (/scriptor/lib/scripts.js:64:31)
Solution: Probably ignore it for now since it seems to be the exception (pun intended) than the norm. Otherwise, call JSON.stringify
in the nodejs context, not the browser page. Not sure about security issues that might arise. But hosters could theoretically redefine other functions, too. So, for now, I would keep it as is, and just know about this issue.
Version of @phoerious
waitForLoadStateWithTimeout should probably be moved into pages.js.
The waiting-and-resizing-part could also be moved into pages.js. Likely helpful also for other scripts.
This script also covers the case of crawling several URLs at once. Not sure yet whether to keep that part. Probably yes.
const fs = require("fs-extra");
const path = require("path");
const { AbstractScriptorScript, files, pages, log } = require("@webis-de/scriptor");
const NAME = "Snapshot";
const VERSION = "0.2.0";
const waitForLoadStateWithTimeout = async (page, event, timeout) => {
try {
return await page.waitForLoadState(event, { timeout: timeout });
} catch (ex) {
return null;
}
}
module.exports = class extends AbstractScriptorScript {
constructor() {
super(NAME, VERSION);
}
async run(browserContexts, scriptDirectory, inputDirectory, outputDirectory) {
const browserContext = browserContexts[files.BROWSER_CONTEXT_DEFAULT];
// Script options
const defaultScriptOptions = {
viewportAdjust: {},
snapshot: {
screenshot: { timeout: 120000 } // Screenshotting complex pages can take a very long time
}
};
const requiredScriptOptions = [ "url" ];
const scriptOptions = files.readOptions(files.getExisting(
files.SCRIPT_OPTIONS_FILE_NAME, [ scriptDirectory, inputDirectory ]),
defaultScriptOptions, requiredScriptOptions);
log.info({options: scriptOptions}, "script.options");
fs.writeJsonSync(path.join(outputDirectory, files.SCRIPT_OPTIONS_FILE_NAME), scriptOptions);
// Load page(s)
let url = scriptOptions["url"];
if (typeof url === "string") {
url = [url];
}
const promises = [];
for (const [i, u] of url.entries()) {
const page = await browserContext.newPage();
promises.push(page.goto(u, { waitUntil: "domcontentloaded" }).then(async (resp) => {
await waitForLoadStateWithTimeout(page, "load", 10000);
// Adjust viewport height to scroll height to trigger loading dynamic content
await pages.adjustViewportToPage(page, scriptOptions["viewportAdjust"]);
// Wait for three networkidle intervals to ensure dynamic content finished loading
for (let i = 0; i < 3; ++i) {
await waitForLoadStateWithTimeout(page, "networkidle", 3500);
}
// Update viewport up to three times to accomodate for layout changes and
// to trigger further dynamic content
let resizes = 0;
while (resizes < 3 && await page.viewportSize().height !== await pages.getHeight(page)) {
await pages.adjustViewportToPage(page, scriptOptions["viewportAdjust"]);
await waitForLoadStateWithTimeout(page, "networkidle", 2500);
await page.waitForTimeout(250);
++resizes;
}
// Take snapshot(s)
const snapName = url.length > 1 ? `snapshot-${i}` : "snapshot";
await pages.takeSnapshot(page, Object.assign(
{ path: path.join(outputDirectory, snapName) }, scriptOptions["snapshot"]
));
}));
}
await Promise.all(promises);
return true;
}
};
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.