iftechfoundation / ifarchive-unbox Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 2.0 258 KB

IF Archive Unboxing service

Home Page: https://unbox.ifarchive.org

License: MIT License

JavaScript 91.58% Dockerfile 1.33% Shell 7.09%

interactive-fiction

ifarchive-unbox's People

Contributors

Stargazers

Watchers

Forkers

dfabulich erkyrath

ifarchive-unbox's Issues

Use full timestamps rather than just the day

We should use full timestamps rather than just the day a file was updated, as there's a small chance that could cause issues.

Support unicode file names

I'm running into a locales issue. Asked about it here: https://stackoverflow.com/q/70388840/2854284

Doesn't support files with spaces

Can't handle zipped files with [, ], or ^ in the filename

You know how I said that I'd leave

await exec(`unzip -p ${zip_path} '${escape_shell_single_quoted(file_path)}' | file -i -`)

as-is until it caused a problem? It causes a problem.

In the logs, I see two examples:

Error: unzip|file error: caution: filename not matched:  platypus/options/Icon^M
Error: unzip|file error: caution: filename not matched:  4th1hrComp/agent_4F[1].A.taf

I believe both are caused by the shell getting confused by filenames.

I'm not interested in playing whack-a-mole with shell escapes. We need to use execFile().

(Reading the data and then writing it into a separate execFile('file') is okay.)

Remove cache items when limits exceeded

We can't catch everything so server redirects are still essential, but we can't rewrite basic HTML links, script inclusions, image sources, etc, so that they point to the main domain not the subdomain.

Add script to purge file from cache

Make a CLI script that

gets the list of individual file URLs
shuts down the docker containers
purges the local cache entries (for both the app and nginx)
asks cloudflare to purge its cache
starts the containers

Tar and unzip seem to fail sometimes

list_contents() is failing on some files. Examples that I see:

infocom/compilers/inform6/library/old/inform_library61.tar.gz

Error: tar error: tar: A lone zero block at 533

games/pc/hallowee.zip

Error: Command failed: unzip -Z1 /home/data/cache/2obskzspcc.zip warning [/home/data/cache/2obskzspcc.zip]: 128 extra bytes at beginning or within zipfile (attempting to process anyway)

In both cases, messages appear on stderr but the files unpack correctly anyhow.

I think the correct path is to rely on the exit status rather than stderr. Messages on stderr should be logged, but should not throw errors.

There's a nuisance factor in that tar's exit status is 0 for success, 1 for error. unzip has a big list of exit statuses (see man page); it boils down to 0 for success, 1 for success-with-warnings, higher values for error. So we have to check those values separately for tar vs unzip.

Get file contents in batches when app starts

When the app first starts it spawns zip processes to get all the contents of all files in the cache. If there are lots of cached files, the processes fail. Could be running out of memory or something?

The app starts when there are fewer (80 works), so I'm guessing that spawning the zip processes in batches will work. The server is only single core anyway, so while a little bit of parallel processing might help, it's not like running all of these at once was helping in the first place. It was just simple code to Promise.all(files.map(...))

Semi-smart "Start" button?

When showing a file list, if there's an index.html, we could have a prominent "Start" button that redirects to it. Similarly if there's exactly one .html file.

IFDB won't need this, but it would smooth out the experience of ifarchive.org links.

HTML charset not identified correctly

A file like this is identified as us-ascii, when it really needs to be UTF-8: https://2p287be0si.unbox.ifarchive.org/2p287be0si/IFComp2015/Games/Cape/dist/index.html

It doesn't get identified as UTF-8 because the HTML page itself doesn't contain any non-ASCII characters, but the JS does (or higher characters get inserted by JS, I'm not sure exactly which.)

For HTML files we should check for a <meta charset> tag and use it if present.

Set up caching

Set caching headers
Set up nginx cache

I'm not sure what a good caching time is - 1 day? more?

Document full API

Also document

json
search
open

Cache error pages for only 1 day

Error loading NMR.D$$

https://unbox.ifarchive.org/?url=https%3A%2F%2Fifarchive.org%2Fif-archive%2Fgames%2Fagt%2Fnmr1.zip

https://unbox.ifarchive.org/b7iw91c5w/NMR1%20Orignal%20Play%20Distribution/NMR.D$$

500 error

Error: unzip|file error: caution: filename not matched: NMR1 Orignal Play Distribution/NMR.D1518

Try Cache-control: no-transform for Cloudflare

Cloudflare doesn't compress any of our IF storyfile formats, which is non-ideal, and there doesn't seem to be any way to add custom types to their compression list.

But adding a Cache-control: no-transform header might result in Cloudflare serving our gzipped files. We wouldn't get to take advantage of Cloudflare's brotli compression, but if it works that would definitely be a worthwhile trade-off.

Use <wbr> tags in file paths

Failure if path in the zip to be opened contains a space

For the file https://ifarchive.org/if-archive/games/twine/Paintball_Wizard.zip the main HTML file inside it is "Paintball Wizard/info.html". But the link

https://unbox.ifarchive.org/?url=https://ifarchive.org/if-archive/games/twine/Paintball_Wizard.zip&open=Paintball%20Wizard/info.html

fails with "NotFoundError: Unknown file: https://ifarchive.org/if-archive/games/twine/Paintball_Wizard.zip". Interestingly, the link

https://unbox.ifarchive.org/?url=https://ifarchive.org/if-archive/games/twine/Paintball_Wizard.zip&open=info.html

does work.

Compression

Add compression support. But in Ngnix or node?

Way to indicate that Master-Index.xml has changed

Having thought about #48 for a while, I think it is worth having a way for the Archive to push a "please refetch" message to Unbox.

This will make life easier for the volunteers; they won't have to think about a five-minute polling delay.

I don't think this has to involve cache headers at all. Just a request we can make that triggers check_for_update() immediately. (The request doesn't have to wait for check_for_update() to finish though.)

I want at least a little bit of mischief protection, so this request should be a POST with a shared secret key in the form field. On the Archive side this will be launched from curl or wget running as root, immediately after a new Master-Index.xml is written.

Incorrect index links to files whose names contain `#`

https://unbox.ifarchive.org/?url=/if-archive/games/pc/spanish/zoo.zip

This zip file contains files where the name contains #, e.g. M#ROCKY.1.

To repro: Go to https://unbox.ifarchive.org/?url=/if-archive/games/pc/spanish/zoo.zip and click on the link to M#ROCKY.1

Actual: The link points to https://unbox.ifarchive.org/0impc5w62r/M#ROCKY.1 i.e. https://unbox.ifarchive.org/0impc5w62r/M with a "fragment identifier" of #ROCKY.1

Expected: https://unbox.ifarchive.org/?url=/if-archive/games/pc/spanish/zoo.zip should link to https://unbox.ifarchive.org/0impc5w62r/M%23ROCKY.1 (which does work)

HTML fragments displayed as plain text

Invalid HTML documents like the following currently receive a text/plain mime type. There's probably no harm to set them to text/html.

https://09rayftnzc.unbox.ifarchive.org/09rayftnzc/extras/The%20Bones%20of%20Rosalinda%20Hints%20and%20Walkthrough.html

Redirects aren't being cached

Redirects don't seem to be cached to me, even though I thought by default nginx would cache them.

Problem with Mentula Macanus

https://unbox.ifarchive.org/?url=https%3A%2F%2Fifarchive.org%2Fif-archive%2Fgames%2Fspringthing%2F2011%2FMMA.zip links to https://unbox.ifarchive.org/27mjsmtnuh/MMA/Stiffy%20Makane-%20Apocolocyntosis.gblorb but I get a 500 error trying to view it.

RangeError [ERR_CHILD_PROCESS_STDIO_MAXBUFFER]: stdout maxBuffer length exceeded

Tar paths starting with ./ confuse the app

Depending on how a tar file was created, the tar -tf output might look like

./README.txt
./image.jpeg

list_contents() has no problem with this, and the index page gets generated with links to /HASH/./README.txt etc. However, because of browser URL resolution, when the user clicks on the link, the request comes in as /HASH/README.txt. The contents list does not contain README.txt so we return an error.

I can see a couple of ways to deal with this:

list_contents() could strip initial ./ off the path.
When looking up the file (the details.contents.indexOf(file_path) call in app.js), we could do a fallback check for './'+file_path if the first lookup fails.

(This is a rare problem -- low priority. I first noticed it with http://ifarchive.org/if-archive/games/pc/mansion-19.2.tar.gz . That was when testing my fix for #38 , so you won't be able to observe this until that fix it in. I haven't hunted for other cases.)

open= should match an exact file name before looking in subfolders etc

From slack:

Hmm. This Ishmael.zip has two index.html at different levels, the top-level one of which is what you want to launch the game. But doing the obvious thing in IFDB causes Unbox to say "BadRequestError: Filename is not unique".
Is there a way round this, with the things that you can do in an IFDB record? I didn't see one reading the unbox spec.

Support .tgz

And any other prevalent compressed files in the archive?

This URL fails (parentheses? spaces?)

https://unbox.ifarchive.org/?url=http://ifarchive.org/if-archive/games/competition2020/Games/Jay%20Schillings%20Edge%20of%20Chaos/Chaos%20%28Offline%20Play%29.zip

Error: Command failed: curl https://ifarchive.org/if-archive/games/competition2020/Games/Jay Schillings Edge of Chaos/Chaos (Offline Play).zip -o /home/data/cache/8yypxconu.zip -s -S -D -
curl: (3) URL using bad/illegal format or missing URL

War of the Willows percent encoding

This URL contains percent-encoded spaces https://ifarchive.org/if-archive/games/competition2015/The%20War%20of%20the%20Willows/willows-1.1.zip

When I copy and paste it to unbox, https://unbox.ifarchive.org/?url=https%3A%2F%2Fifarchive.org%2Fif-archive%2Fgames%2Fcompetition2015%2FThe%2520War%2520of%2520the%2520Willows%2Fwillows-1.1.zip returns 400 “BadRequestError: Unknown file”

This link works, with plusses instead of percent encoding: https://unbox.ifarchive.org/?url=https%3A%2F%2Fifarchive.org%2Fif-archive%2Fgames%2Fcompetition2015%2FThe+War+of+the+Willows%2Fwillows-1.1.zip

But the indexes page https://ifarchive.org/indexes/if-archive/games/competition2015/The%20War%20of%20the%20Willows/ links to the percent encoded version, so I think unbox should support it, too.

When open= fails give a link to the zip-index page

Enable log rotating

I see that when you do docker-compose up --build, log info goes to stdout. This should go to /var/log/unbox/unbox.log.

Then we'd have to set up logrotate. Except I don't know exactly how to do that. For Apache, logrotate is configured to do "/etc/init.d/apache2 reload" after rotation so that the server closes and reopens its logging file handle. What is the equivalent here?

Load existing files during cache init

Error with symlinks

https://ifarchive.org/if-archive/games/mini-comps/spanish/retrocomp2004/orfeo2.zip exists

but https://unbox.ifarchive.org/?url=https%3A%2F%2Fifarchive.org%2Fif-archive%2Fgames%2Fmini-comps%2Fspanish%2Fretrocomp2004%2Forfeo2.zip shows an error:

BadRequestError: Unknown file: https://ifarchive.org/if-archive/games/mini-comps/spanish/retrocomp2004/orfeo2.zip

find endsWith returns incorrect results

https://unbox.ifarchive.org/?url=https%3A%2F%2Fifarchive.org%2Fif-archive%2Fgames%2Fspringthing%2F2014%2FBearCreek.zip&find=Bear%20Creek.gblorb

This zip file contains two similar file names:

BearCreek/Bear Creek.gblorb
__MACOSX/BearCreek/._Bear Creek.gblorb

Expected: Since only the first file, Bear Creek.gblorb, matches the find parameter exactly, find should match that and redirect to it.
Actual: Find thinks that this is an ambiguous case and suggests both files as options

Mobile styles

Even though I'm using the same stylesheet as the main archive site, it is non responsive on mobile. Might have to change how the page is structured?