webrecorder / archiveweb.page Goto Github PK

View Code? Open in Web Editor NEW

750.0 20.0 54.0 53.84 MB

A High-Fidelity Web Archiving Extension for Chrome and Chromium based browsers!

Home Page: https://chrome.google.com/webstore/detail/webrecorder/fpeoodllldobpkbkabpblcfaogecpndd

License: GNU Affero General Public License v3.0

Shell 0.25% JavaScript 99.28% HTML 0.47%

chromium extension web-archiving webrecorder archiving wacz

archiveweb.page's Introduction

ArchiveWeb.page Interactive Archiving Extension and Desktop App

ArchiveWeb.page is a JavaScript based system for high-fidelity web archiving directly in the browser. The system can be used as a Chrome/Chromium based browser extension and also as an Electron app.

The system creates, stores and replays high-fidelity web archives stored directly in the browser (via IndexedDB).

For more detailed info on how to use the extension (and the app when it is available), see the: ArchiveWeb.page User Guide

The initial app release is available on the Releases page

Architecture

The extension makes use of the Chrome debugging protocol to capture and save network traffic, and extends the ReplayWeb.page UI and the wabac.js service worker system for replay and storage.

The codebase for the extension and Electron app is shared, but they can be deployed in different ways.

Requirements Building

To develop ArchiveWeb.page, Node 12+ and Yarn are needed.

Using the Extension

The production version of the extension is published to the Chrome Web Store

For development, the extension can be installed from the wr-ext directory as an unpacked extension. If you want to make changes to the extension, it should be installed in this way. This will be a different version than the production version of the extension.

Clone this repo
Open the Chrome Extensions page (chrome://extensions).
Choose 'Load Unpacked Extension' and point to the ./wr-ext directory in your local copy of this repo.
Click the extension icon to show the extension popup, start recording, etc...

Development Workflow

For development, it is recommended to use the dev build of the extension:

Run yarn install and then yarn run build-dev
Run yarn run start-ext -- this will ensure the wr-ext directory is rebuilt after any changes to the source.
After making changes, the extension still needs to be reloaded in the browser. From the Chrome extensions page, click the reload button to load the latest version.
Click the extension icon to show the extension popup, start recording, etc... The dev build of the extension will be grey to differntiate from the production version.

Using the Electron App (in beta)

The Electron app version is in beta and the latest release can be downloaded from the Releases page

To run the Electron app development build:

Clone this repo.
Run yarn install and then yarn run build-dev
Run yarn run start-electron to start the app.

The Electron App version will open recording in a new window. It is is designed to support Flash, better support for IPFS sharing. However, it is still in development and may not work yet until the initial release is out.

Development workflow

After making changes, rerun yarn run build-dev and yarn run start-electron to load the app.

Standalone Build

To create a platform-specific binary, run:

yarn run pack

The standalone app will then be available in the ./dist/ directory.

archiveweb.page's People

Contributors

Stargazers

Watchers

archiveweb.page's Issues

Can't record google meet

In recorded it nerver ready for login to meeting chat group

Website crashes whole browser during recording

Right now, this website seems to be one the only one I know of where this happened to me. It should be easy to reproduce.

Go to https://www.letgo.com/
Start recording
Whole browser process crashes

I took a look at chrome debug log, couldn't find anything useful. The crash seems to happen fairly quick.

List of tested platforms:

Ubuntu 21.04 - Chromium 90.0.4430.93 - ArchiveWeb.page Extension 0.6.5
Windows 10 - ungoogled-cromium 90.0.4430.93 - ArchiveWeb.page Extension 0.6.5
Ubuntu 18.04.5 - Chromium 90.0.4430.72 - ArchiveWeb.page Extension 0.6.5
Windows 10 - ArchiveWeb.page App 0.6.4

Issue rendering filterable content

Hello!

Thanks for your great work on this project. We have recently switched from using Webrecorder desktop and I'm enjoying exploring archiveweb.page.

I hope this question isn't too specific, but hopefully it will be useful to others dealing with similar content.

I'm trying to capture the (I assume database driven) content at the bottom of these pages:

https://www.therenditionproject.org.uk/prisoners/index.html
https://www.therenditionproject.org.uk/documents/index.html
https://www.therenditionproject.org.uk/flights/renditions/index.html

They seem like they're capturing ok but when you play them back it loads the main page but not the filterable content at the bottom. Strangely the more complicated page here (https://www.therenditionproject.org.uk/flights/flight-database.html) does seem to work ok.

I have also tried using Conifer with the remote browsers that support flash and java but seem to have the same problem.

I've found a few instances on the internet archive where the content does render, though it seems to vary from crawl to crawl: https://web.archive.org/web/20210117163104/https:/www.therenditionproject.org.uk/prisoners/index.html

Anyway, I thought I'd flag it in case it's of interest.

All the best,

Jake

Where is the archive stored?

Where is the archive stored?
How do I change/specify another storage location?

Uncaught SyntaxError: Cannot use import statement outside a module

I'm trying to record this SquareSpace page https://www.martinwong.org/ It seems to record fine with ArchiveWeb.page but on playback I see this error in the devtools console, which does not appear when looking at the page natively:

Uncaught SyntaxError: Cannot use import statement outside a module

There don't appear to be any missing resources (404s) so I'm a bit confused about what is going on.

ZST File Support

I Tried To Open youtubedislikes_20211213070444_9307757f.1638107855.megawarc.warc.zst, But They Couldn't Open The File. Could You Please Support ZST Files? That'll Be Great.

specific webiste latin-like letters appear as garbled

I'm facing a problem when saving website from taobao.com (Simplified Chinese)

here is a sample link:
https://item.taobao.com/item.htm?id=585231745608

The left side of picture show below is correct one (normal view) and right side is the garbled one (archive by extension).

I think it is the website problem rather than the extension encoding, saving properly with Traditional Chinese / Japan / Korean from other website.

Also thank you for hard work this tool save my life.

I found new sample that causing the same issue. (Japanese text)

https://www.hmv.co.jp/artist_きまぐれオレンジ-ロード_000000000548287/item_きまぐれオレンジ☆ロード-あの日にかえりたい-【初回生産限定盤】-ピンク・カラーヴァイナル仕様-アナログレコード_11521901

Is exposing ArchiveWeb.page as a CLI / Node JS library feasible?

In the ArchiveBox community we often get users asking for higher-fidelity alternatives to ArchiveBox, and I always direct them to ArchiveWeb.page but oftentimes they come back saying they need something that works as a CLI command.

If ArchiveWeb.page could expose a oneshot $ archivewebpage 'https://example.com' -o example.com.warc CLI or nodejs interface (similar to how SingleFile exposes a CLI), I think a lot more people would be able to integrate it into their workflows and projects.

SingleFile can be launched from the command line by running it into a (headless) browser. It runs through Node.js as a standalone script injected into the web page instead of being embedded into a WebExtension. To connect to the browser, it can use Puppeteer or Selenium WebDriver. `single-file --

I'd love to hear your thoughts on this / see if it's as easy as SingleFile's approach to provide a CLI interface. If it's too much work / not something you'd be keen on building yourself, even providing an outline of how you think it could be implemented would be helpful (to get an idea of how hard it is / what modifications would need to be made).

I also have some selfish reasoning for wanting this ;) Given how stunningly good ArchiveWeb.page is at saving difficult content compared to all the other options, I've been drawn more and more to the idea of integrating it as an extractor plugin in ArchiveBox (similarly to how we did it with SingleFile).

The way ArchiveBox is architected allows us to use arbitrary 3rd party software as "plugins" to save pages, with varying levels of work required to integrate: Python libraries (easiest), JS/node (medium difficulty), and arbitrary binaries via shell commands (hardest). I've considered integrating pywb in the past, but I decided against it because I didn't want to cause any headaches / ill-will by trying to mix two already complicated pieces of software with significant feature and audience overlap.

Correct me if I'm wrong, but I think ArchiveWeb.page might be a better fit than pywb for plugin-style integration into ArchiveBox. It seems to excel at archiving individual pages or small groups of pages, and focuses less on curating huge collections of many archived pages (e.g. someone's entire browsing history over years). If it works well, I could see it becoming a significant part of the ArchiveBox's value proposition, and I'd be happy to market ArchiveWeb.page significantly in our README and docs, and share a portion of our donations with you to further support ArchiveWeb.page development.

Either way, I think providing a CLI interface would have many good use cases besides integration with ArchiveBox.

Related issues:

unable to replay imported HAR files

Description

HAR files appear to be imported successfully but when I attempt to replay them the page is blank. Inspecting via devtools, no HTML content is returned in the response and not seeing any errors in the console.

Environment

archiveweb.page v0.7.7
replayweb.page v1.5.9

HARs captured with:

Chrome 97
Firefox 96

Steps to Reproduce

In the browser, open the devtools and browse to a webpage. Once the page has loaded, export the content as a HAR.
Go to archiveweb.page, create a new archive for testing purposes.
Import the HAR into the archive then go the Pages view. You should see the page was imported along with related metadata (timestamp, title).
Open the page to attempt replay.

option to arhieve youtube video

App always generate the same random number

Is there any way to get the random number generator run normally? Maybe edit the date manually or something?

Video feedback-like effect when viewing a recorded Wayback Machine link

I know this is a pretty cursed thing to do but on both v0.6.7 extension and app, a weird interaction happens when you record any Wayback archived link and then view it.

Here is the link:
https://web.archive.org/web/20210518075817/https://twitter.com/jack/status/20

option to play archived youtube video when youtube browsing

will donate 23 USD in crypto for such embed feature

Option to archive everything as you go

I know this is somewhat outside of the project scope but I have a suggestion: I usually only ever think to archive something when it's already to late (the content is gone). So it would be great if it would be viable to simply have the recorder on all the time and archive every website visited.

I'm not sure how resource intense this extension is currently - without knowing a lot of technical detail I see two issues

Fetching URLs happens twice, once to view the page and once to archive it. I think in service workers you can intercept requests and maybe store the data directly? I'm sure it's not that simple.
The data storage would grow pretty large (a few gigabytes), not sure how well IndexedDB would handle that.
Deduplication of data. Is data stored by URL or by hash? for this it would probably be good to store by hash to prevent the same data from being stored many times

Archives missing after update from 0.5.11 to 0.6.0

As described in the title, I just updated to the most recent release and now after opening the program it tells me, I have no archives ...

I am on macOS 10.13.6, and just installed the update by replacing the previous version with the new one in the Applications folder. Are my archives still stored somewhere?

Is there a way to record all tabs simultaneously?

I need to make an archive that requires a login (I wish I could use Pywb but the OneLogin service has issues with it) and need to save a whole bunch of links, so pressing the start button over and over again would be very tedious.

Usability/Accessibility after selecting "Start Sharing" button

I created an archive in the app (version 0.7.7 from GitHub releases) and selected the "Start Sharing" button. This brings up a message box (title: "Start Sharing?") in the interface with a bit of text.

I began to read the text but after a few second, it disappears without any interaction from my end. If I click the button again, the same behavior occurs.

It might be difficult for some to completely read through this text before it is abruptly closed without any signal as to the reason.

SSL/TLS Certifcate error (on facebook and instagram)

I found this issue only on facebook and Instagram (but it could replicate in other domains that I don't know at the moment)
On www.google.com, web.whatsapp.com or other SSL/TLS sites, the problem does not seem to appear.

On Facebook, when I click "start" , the secure icon disappears and an "i" appears in its place. when clicking on the "i" a security error appears:

Same issue on Instagram:

This could be a problem in forensics activites :(

PS:
the problem with the certificate does not occur when the acquisition is not active.

Furthermore, there is no intermediate proxy

Incorrect handling of multiple headers

While implementing firefox support, I noticed the following:

The CDP Network.Headers interface returns headers as a JSON object of keys and values, while the webRequest API returns a list of keys and values.

HTTP Headers can appear multiple times though, so how does CDP handle multiple of the same header?

I tested this with a tiny test server, and when the response contains two values for the same header:

HTTP/1.1 200 OK
X-Powered-By: Express
Set-Cookie: foo=bar; Path=/; Expires=Sat, 06 Mar 2021 14:28:57 GMT
Set-Cookie: baz=hio; Path=/; Expires=Sat, 06 Mar 2021 14:28:57 GMT
Content-Type: text/html; charset=utf-8
Content-Length: 12
ETag: W/"c-Lve95gjOVATpfV8EL5X4nxwjKHE"
Date: Sat, 06 Mar 2021 14:28:57 GMT
Connection: keep-alive
Keep-Alive: timeout=5

The following is stored in the database:

respHeaders:
Connection: "keep-alive"
Content-Length: "12"
Content-Type: "text/html; charset=utf-8"
Date: "Sat, 06 Mar 2021 14:28:57 GMT"
ETag: "W/"c-Lve95gjOVATpfV8EL5X4nxwjKHE""
Keep-Alive: "timeout=5"
Set-Cookie: "foo=bar; Path=/; Expires=Sat, 06 Mar 2021 14:28:57 GMT, baz=hio; Path=/; Expires=Sat, 06 Mar 2021 14:28:57 GMT"
X-Powered-By: "Express"

The headers are concatted simply with comma. This seems to be "correct" for most headers, but not all of them. For set-cookie for example, the cookies are mangled and not parseable. This means that the resulting web archive will set mangled cookies.

I found some more info here:
https://stackoverflow.com/a/38406581/2639190

[Feature Request] Support for content blocker

A nice feature would be support for adblockers or content blockers such as uBlock Origin. Some sites have ads that take up too much storage (and screen real estate) for no good reason.

Improve Download UI

In extension/browser, the browser provides a download count, but in the electron app, there's no visible indicator of download progress. Now that larger downloads work better, should include an UI with progress bar, probably similar to IPFS upload progress, when download.

Not able to download Selected Archive & Feature request

Thanks for your help again, just like the title said.

I just think I can not do this since 0.6.0 (Maybe)

Also a question about archive Amazon product page:

https://www.amazon.co.jp/PSP-｢LocoRoco｣オリジナル・サウンドトラック-｢ロコロコのうた｣-ゲーム・ミュージック/dp/B000HEWJK8/

click the link inside the purple box, result in the highlighted with a red box pop up. But I can't figure out how to do this with the extension archive.

Importing archive file to an existing collection doesn't import correctly

The import archive file to existing collection feature is not working for me. I just get weird characters loaded instead of the actual website content. Happens in Chrome extension and downloaded app on macOS.

However, if I import the archive file and create a new collection, then the website content loads fine. I couldn't find this issue reported anywhere yet.

This is an imported archive file (.warc) from Conifer. Same thing happens with multiple different archive files for different websites.

App v0.7.4

Originally posted by @MattSuda in #18 (comment)

Can't download Web Archive

First, thanks a lot for publishing this extension, it makes archiving much more straightforward.

I tried to download a Web Archive totalling 2.56 GB as a wacz file. The download starts but then gets stuck and the network inspector shows a stream of failing requests, as you can see in the attached screenshots.

Downloads Tab:
Inspect > Network:

To see whether this is reproducible, I created a Test Web Archive in which I archived https://www.google.com and https://twitter.com. The archive size was 5.04 MB and it downloaded correctly.

Any idea what could be causing this?

archiveweb.page electron app: recording not possible when running with --remote-debugging-port

Recording seems not to work, when started with e.g. '--remote-debugging-port=9222'.
Therefore automated recording with selenium seems not possible.

Happens on Linux and Windows.

Firefox support and implementation details

So it would be great if this extension could support Firefox. You wrote on HN:

What prevents this from working on non-Chromium-based browsers?
At this point, mostly time constraints maintaining two very different implementations.

The archiving is done via the CDP Fetch domain (https://chromedevtools.github.io/devtools-protocol/tot/Fetch...), as it requires intercepting and sometimes modifying the response body of a request to make it more replayable.

Firefox doesn't current support this yet (https://bugzilla.mozilla.org/show_bug.cgi?id=1587426), although, it does have webRequest.StreamFilter instead (https://developer.mozilla.org/en-US/docs/Mozilla/Add-ons/Web...), which is lacking in Chromium.

It probably should be possible to achieve this functionality in Firefox, using this API, but would unfortunately require a new implementation that uses WebRequest instead of CDP. But probably worth looking into!

The archive replay using ReplayWeb.page should work in Firefox and Safari.

Edit: Another limitation on Firefox is lack of service worker support from extensions origins (https://bugzilla.mozilla.org/show_bug.cgi?id=1344561). This is needed to view what you've archived in the extension. Would need to work around this issue somehow until that is supported, so probably a bit of work, unfortunately.

I tried to start implementing it, though I didn't get that far. So here's how I understand the structure of the code and my thoughts for now:

From the popup when pressing "record", in bg.js the startRecorder function is called, which creates a new BrowserRecorder. BrowserRecorder extends Recorder with a few browser-specific things (as opposed to ElectronRecorder). This recorder attaches the Chrome debugger to the currently focused tab, and adds handlers for a lot of different events:

Network.enable are used to intercept and save metadata of request into RequestResponseInfo objects. Finally, in the Network.loadingFinished handler the request info object is saved to the IndexedDB.

This functionality seems very similar to the https://developer.mozilla.org/en-US/docs/Mozilla/Add-ons/WebExtensions/API/webRequest API that seems to be supported in Chrome as well. What's missing from that API here?
Fetch.enable blocks all requests, then uses Fetch.getResponseBody to get the request and response body of each network request and continues them. In some special cases, the responses are modified before being continued in rewriteResponse. These rewrites are mostly from wabac.js. Some of them make sense to me like change video resolution. But there's also stuff like JS rewrites. Are these run during capture or only during replay?

This functionality could maybe be replaced with the StreamFilter API in firefox. Seems simple to use, example at the bottom. Sadly not supported in Chrome, so would need separate code paths.
Media.enable for a single special case of handling some "kLoad" event. Seems to be related to watching for video / audio load events (chromium source code). Maybe could be replaced with just listening to the media events client side in the autofetcher.js injected script with JS Media events?
Page.enable. This is to handle page navigation events etc. These seem like they should all be supported by the normal webextension apis like webNavigation?

There seems to be four methods how requests are fetched:

The requests the page itself makes, intercepted by Fetch.getResponseBody
Some files that the webiste refers to (images etc) in the injected autofetcher.js script. The responses are thrown away since they are then intercepted by (1) as well.
In doAsyncFetchInBrowser a fetch script is injected into the page to request the full data. The data is also captured in (1). This is called for popup windows.
doAsyncFetchDirect. Used to work around partial (HTTP 206) responses, favicons, and the kLoad media event special case. The data here is directly captured and written to the DB instead of going through 1 like all the other methods.

Since 1-3 all use Fetch.getResponseBody, adding StreamFilter should make those work on Firefox. 4. directly uses the fetch api, which is supported in Firefox anyways.

For Firefox support, these things would need to be changed:

recorder.js and browser-recorder.js both are pretty tightly bound to the chrome debugger protocol. Those parts of the code would probably have to be abstracted out into a separate class.
The StreamFilter API would be an alternative implementation to the chrome debugger Fetch.* api, used to get and modify the actual payload / body of the request responses.
All the other chrome api calls could be investigated if they are actually needed or could be replaced with the chrome.webRequest api. Maybe not possible because that strips some security related headers?
Regarding the missing support for service workers in the extension, I've encountered that issue before, and I hope they'll fix that. I think for now the easiest workaround would be to just have the warc export button without the interactive browser. Or have a hosted website (or static html file) that can access the data from the extension via an API?

// example of StreamFilter
browser.webRequest.onBeforeRequest.addListener(
  listener,
  {urls: ["<all_urls>"], tabId: this.debuggee.tabId},
  ["blocking"]
);

function listener(details) {
  let filter = browser.webRequest.filterResponseData(details.requestId);

  let data = [];
  filter.ondata = event => {
    data.push(event.data);
  };

  filter.onstop = async event => {
    let blob = new Blob(data, {type: 'application/octet-stream'});
    let payload = await blob.arrayBuffer();
    // do filter stuff as in chrome code
    filter.write(payload);
    filter.close();
  };
}

Desktop app: "Play" and "Download CDX" buttons on https://play.date/pulp/ don't work

Nothing seems to happen (even in the network log in developer tools) when these buttons are pressed.

(Note that this webapp requires a free login)

Chrome >=94 breaks replay in extension.

Unfortunately, with Chrome >=94, the default csp can not be customized, and blocks all inline scripts, causing replay to break within the extension. The only possible workaround seems to be disable csp via the CDP for the tab, which is not ideal.

The only other workaround is to not replay within the extension, but within a separate domain, which is a lot of work, similar to what's described in #16 for firefox.

This has also been reported on Chromium bug tracker, but probably won't be fixed in time for Chrome 94 release:
https://bugs.chromium.org/p/chromium/issues/detail?id=1245803

Archives don't load

In the chrome extension, the "Current Web Archives" section is stuck on "Loading Archives...". Also, after (trying to) record a page, "View Recorded Page" brings you to a blank page.

Error: Cannot access a chrome-extension:// URL of different extension

On https://ofam.3dprostir.com/, I'm getting an error

_Sorry, there was an error starting recording on this page. Please try again or try a different page.

Error Details: Cannot access a chrome-extension:// URL of different extension

If the error persists, check the Common Errors and Issues page in the guide for known issues and possible solutions._

Another person confirmed the same error on the same page.

Consider refactoring into two libraries

This repo is becoming complex and now has several deployments, including the extension, electron app, and embedded record mode, and a CDP-protocol wrapper for recording

Perhaps it would make sense to refactor this into two libraries:

a core 'recorderlib' which consists of a service worker + CDP and fetch recording library + ipfs writing functionality that extends wabac.js and offers recording mode, using CDP and/or fetch, WACZ/WARC serialization, signing. Could be used in both browser, electron and plain node.
the ui, which extends replayweb.page, and use recorderlib to perform the recording. The various UI manifestations will continue to live here

Can't download images from archive

Hi,
I just discovered this extension and since I wanted to move my homepage and redo it on wordpress this was perfect. I have taken all the pages and then installed wordpress on it.

Now I have seen that every time I want to save the images always a network error is displayed. I can copy and paste the images, but not save. Is there any way I can save them so that I can download all the images at once using the Download all Images extension? There are quite a few images which is why copy and paste does not work. Or are the images stored somewhere on my PC where I can copy them all out at once?

Usage during capture page 404 when trying to archive an unsupported protocol

Hi, with the latest agregore having both ssb rendering and archiveweb.page added in extensions, I thought why not try archiving an ssb page.

Can't record this page.

Which is not unexpected, but I wondered what the user would do next, so clicked the ? icon. It directs to https://archiveweb.page/guide/usage/during-capture.html but the server returns 404.

Super excited to see both ssb rendering and webrecorder in agregore though! Great work on the initial release :)

Use .warc.gz extension for WARCs inside the downloaded WACZ

To avoid any confusion that the WARCs in the downloaded WACZ are compressed set the warc filename to data.warc.gz instead of data.warc

Document where the desktop application stores data

I am going to work with a number of larger collections that I want to store on a external drive. At what path does the desktop application store data? Ideally this could be user-defined, but for now the information of where that path is would be enough to use a mount.

Using the AppImage, I located these directories on my system, but couldn't find any warcs or wacz files inside?

despens@slice:~/.config$ ls -l | grep 'page'
drwx------  3 despens despens     3 Mär 26 07:56 archiveweb.page
drwx------ 15 despens despens    23 Apr 10 08:50 archivewebpage
drwx------  5 despens despens     8 Apr 10 08:48 ArchiveWeb.page

Without installing Chrome extension

I mostly using mobile, and nowadays more traffic come from mobile than desktop, as you already know Chrome android cannot install extension, I hope in the future creating WARC without installing anything will be possible.

Currently using Conifer but it record my username and collection name in the WARC metadata can you make anonymous archiving like in the old wbrc.io or webrecorder.io available again?

Saving page title incorrectly on specific website

I only found this issue on https://jp.mercari.com/ recently, due to its website design change.

it saved with hyperlink only instead with real page title, I can solve this by saved it over and over again, just randomly.

According to my search, it may cause by Shadow DOM, because this website also broke other Chrome extension, for example: Distill / Weava Highlighter / WorldBrain's Memex etc.

I appreciate your kind assistance again, Thank you.

Desktop app: do not fullscreen windows

When starting a capturing session, the desktop app opens a full screen window every time. Depending on the user's display parameters that might be undesirable—for instance on my 27" display it creates an enormous window covering everything 😉

All operating systems offer enough ways to manage windows, so users should be able to switch to fullscreen if they like, and the desktop app should ideally remember users' preferred window size.

The full screen button right next to the "back" navigation button has also thrown me off a few times when accidentally hitting it. I'd like to advocate to remove this button from the app.

extract files from the warc archive

Hi there,
i was wondering, how can i extract the files (images,....etc) from the warc archive?

Is it possible to download web archives as WARC files instead of WACZ?

Either warc or warc.gz?

Ability to hide entry from the page context menu

It would be nice to see an option to change visibility of the "View Web Archives" item.

Use Electron App with chromedriver/selenium

Hi,

It looks like the Electron App delivers best results in recording iframed youtube videos.
That's why I am looking for a way to control it via chromedriver/selenium. This would enable me to use it for crawl automation.

Many thanks and BR,
Stephan

Need a way to standardize IPFS storage write/read (eg. Wrap IPFS interfaces in js-ipfs-fetch?)

At the moment there seems to be several ways to interact with IPFS:

Initializing a js-ipfs node
Using js-ipfs-http-client either to a local gateway or some brave-specific stuff
Using web3.storage to publish to IPFS using CAR files
Using public gateways to download data from IPFS
Built in protocol handlers in Agregore

I propose unifying these under a fetch interface based on js-ipfs-fetch

I'm thinking we could have something along these lines:

Detect Agregore, return the global fetch() instance
Detect Brave, create a js-ipfs-fetch instance from either the window.ipfs object or by talking to the local gateway
Detect config options for a public gateway + web3.storage authentication tokens, wrap the two using a more limited fetch interface (no IPNS?)
If all else fails, set up a local js-ipfs node and pass it to js-ipfs-fetch

If I understand correctly, the main things we need is the following:

GET requests on IPFS URLs (with range support)
POST requests to publish several files into a directory and to get back a CID
Some sort of equivalent to pinning (less certain on how to best approach this point)

This post is more to track this idea for an eventual implementation.

WACZ download fails on Linux appimage version (was: Allow defining "pages" in archive)

If one creates a sizable file (e.g. a 60 GB WARC or WACZ), it can be incredibly challenging to find the "home" page of your archive. It would be useful to be able to configure this.

As an example, upon loading a 60 GB WARC into a local electron version of replayweb.page AppImage, it says in the Page tab,

0 Pages Found
----
No "Pages" are defined in this archive. Browser by URL.

A similar experience occurs in the old Webrecorder-Player AppImage.

Add a smart dedupe feature

If one is recording a site of video content (especially video content which repeats upon, say, a reload or clicking on the link again), the files become huge. Having the ability to intelligently deduplicate already copied info would be useful.

There are certainly use cases where one might want multiple copies of the same page, but as an example, here's a short scenario where one might not, very similar to my experience:

   ACT I :

You're a diligent young archivist and have been visiting a website called videosite.com and hold an account there (it isn't necessary, but having an account lets you visit more pages). This is a website with streaming educational video content loaded from .m3u8 playlists that is not easily crawled through other means.

You start recording, go to videosite.com, the homepage of the site, and log into your videosite.com account.

You click the first link on the page, which takes you to videosite.com/page/1. There are ten pages in total. You click play on a few of the videos and then when the video stream finishes buffering through, you click "Next" and go to videosite.com/page/2.

You repeat this a few times until you hit page 10, and then you click on a user profile. You notice that when you click one of his favorite videos, it takes you to videosite.com/page?id=9 which looks suspiciously like videosite.com/page/9. You then click play on all the videos on the page and let archiveweb.page take care of the rest. You click "next" and now you are at videosite.com/page?id=10, which again looks suspiciously like videosite.com/page/10.

You've seen all these videos, so you hit the back button, and end the recording.

  ACT II:

Oddly, the filesize is pretty big. Oh well. You export a .WARC file. You decide to open it up with your internet off and see if you can view the pages right. You click on videosite.com/page/1. All good. You click on videosite.com/page/4...and you see...nothing except an error message!

You think, "WTF? Page 4 wasn't saved? Well, wait, maybe if I go back to page 1 and then click through it will work."

It does. But the reverse does not. You only went forward. Never back. And when you go to the profile from earlier and click a different favorite video, it tries to load videosite.com/page?id=6 but it also isn't there.

  ACT III:

You decide to try to do this right. This is your favorite site to learn from. You go from page/1 to page/10. You go from page/10 to page/1. You go to page?id=1 to page?id=10 and back.

You play everything on each page just in case not playing it means it won't load later. You end the recording.

  ACT IV:

The file has now nearly quadrupled in size from your original. All the JS, the TS videos, the css, etc. for each page has been copied in quadruplicate. And there's nothing you can do except try again.

How to add new session to existing collection

Hi.
I've previously been using the Webrecorder and Webrecorder Player to record and replay webcollections on my Mac (Big Sur), but WebRecorder stopped working on my Mac, and I discovered that you had been busy :) I have a couple of collections that I add sessions to from time to time, but how is it possible to add sessions to existing collections (local WARC-files) with ArchiveWeb.page.app (desktop-version)? Am I missing something? Thanks! /Claus

Add ability to remove files/store them externally

I am recording the download process of a file. But I'd like to exclude the actual file from the WACZ, and but keep everything else (include the metadata about the removed file, such as HTTP response headers).

Support importing existing WARC/WACZ files

Support importing WARC and WACZ into new archives/collections
Support importing into existing collections.

To support full editing, the import will be a 'full import', where all data is fully stored in the db, slightly different than replayweb.page replay, where the data is loaded on-demand.

Archiveweb.page app crashes while recording specific page

Hi,

I tried to record the following page:

https://robertdavidsteele.com/arise-update-battle-mountain-nv-first-constitutional-county/

Unfortunately Webrecorder crashes while recording this page :-(

Please fix

Greetings

Steve

Support client-zip to have zip64 support and valid downloads of 4GB> WACZ files

Currently WACZ output is limited to around ~4GB size WACZ files, due to lack of zip64 in jszip.
Will need to switch to zip.js to be able to create larger WACZ files.