hrx's Introduction

Human Readable Archive (.hrx)

This is a specification for a plain-text, human-friendly format for defining multiple virtual text files in a single physical file, for situations when creating many physical files is undesirable, such as defining test cases for a text format.

Here's a sample HRX that contains two files:

<===> input.scss
ul {
  margin-left: 1em;
  li {
    list-style-type: none;
  }
}

<===> output.css
ul {
  margin-left: 1em;
}
ul li {
  list-style-type: none;
}

HRX files are always encoded in UTF-8.

Goals
- Non-Goals
Syntax
Semantics
- Extracting
Implementations

Goals

The HRX format is intended to make it easy to:

represent multiple chunks of plain-text data in a single physical file;
read, edit, and create by humans with simple text editors;
read and modify programatically, while generating easy-to-understand diffs for version control;
integrate well with tooling for syntax-highlighting or providing other language-specific support for individual files it contains.

Non-Goals

The HRX format is not intended as a wire format. The multipart/mixed format defined by MIME is already well-suited to that need.

The HRX format is not intended to faithfully represent any arbitrary directory structure. In order to ensure simplicity and human-readability, it intentionally lacks support for binary data or complex permissions.

Syntax

The syntax for a HRX file is as follows:

archive        ::= entry* comment?
 
entry          ::= comment? (file | directory)
comment        ::= boundary newline body
file           ::= boundary " "+ path newline body?
directory      ::= boundary " "+ path "/" newline+
boundary       ::= "<" "="+ ">" // must exactly match the first boundary in the archive
newline        ::= U+000A LINE FEED
body           ::= contents newline // no newline at the end of the archive (if the
                                            // archive ends in a body, all trailing
                                            // newlines are part of that body's contents)
contents       ::= any sequence of characters that neither begins with boundary nor
                   includes U+000A LINE FEED followed immediately by boundary
 
path           ::= path-component ("/" path-component)*
path-component ::= path-character+ // not equal to "." or ".."
path-character ::= any character other than U+0000 through U+001F, U+007F DELETE, U+002F
                   SOLIDUS, U+003A COLON, or U+005C REVERSE SOLIDUS

Each path in the file must be codepoint-for-codepoint unique. It's invalid for an initial subsequence of a path's components to be the same as another file's path.

The length of a boundary must be consistent throughout a given HRX file. Longer or shorter boundarys may be included in content blocks, which means a boundary may be chosen for the HRX file that will allow its content blocks to contain arbitrary text.

content blocks don't contain the trailing newline in body. However, the last body in the file doesn't have a trailing newline, so if the file ends with a newline it is included in its content block.

Directories are distinguished from empty files by the trailing "/" in the path name.

Semantics

A HRX file, represented as an archive, contains a sequence of files and/or directories, each represented as an entry. Each entry has an optional comment that's intended to allow an archive author to include documentation or notes relating to that entry. This comment may contain user-defined metadata that applies to the following file.

Each entry has a path which represents the relative path from the root of the archive to that entry. Each path is divided by U+002F SOLIDUS characters into separate path components. Each component except for the last one represents a nested series of directories; the last component represents the entry itself, which is either a file (for a file) or a directory (for a directory). Parent directories are implicitly assumed to exist if they have entries beneath them, even if they aren't explicitly written as directorys.

Each path component is a string that represents the name of that component. Path component names are interpreted exactly as they appear in the archive file.

File entries with an associated body have their contents given by the body's contents block. File entries with no body have empty contents.

Extracting

Although HRX files are primarily intended to be used as a replacement for multiple physical files, they can be extracted to the physical filesystem. When a HRX file is extracted, the extraction process should (by default) create a directory named after the HRX file, with the extension ".hrx" removed.

The permissions of extracted files should match those of the original HRX file.

Implementations

The following packages implement HRX in various languages. If you're writing an implementation, feel free to send a pull request to add it to the list!

Ruby (original/reference implementation)
Python
JavaScript
Rust
Java

Tooling

Tooling for a nicer experience writting HRX files

VSCode Syntax

hrx's People

Contributors

Stargazers

Watchers

hrx's Issues

Testing data set

I recently had the idea of creating a plain text archive format, searched the web and found this project, which is precisely what I was looking for. I needed a Java implementation though and decided I would implement an I/O library for Java for hrx instead of rolling my own custom archive format. I just found the time to do it yesterday and here are the results: https://github.com/topobyte/hrx-java

I implemented a test case for each of the example files, however I'm not 100% sure my implementation behaves exactly as specified / the same as the ruby implementation concerning trailing newlines. I think it would be very useful to have a set of testing files as a reference for current and future implementing developers. As a start, how about creating an extracted version of all the example files, i.e. we could have a directory extracted-examples containing a directory X for each X.hrx from examples. I'm not too fluent in setting up ruby and gems, so I would appreciate if you could create such a reference data set by unpacking all examples into corresponding directories and pushing it here. Thanks!

Examples vs. syntax spec inconsistencies

Hi! I've recently been working on a Rust implementation for this, and found a few corner cases where the syntax from the README didn't match up with the examples:

The straightforwardest one is probably the duplicate-despite-quotes.hrx part of example/invalid/duplicates.hrx – it looks like (syntax says naught, #1) quotes are no longer specialcased?

The next one I ran into was duplicate-files.hrx from that same file – that archive should, according to the spec, (a) be valid and (b) contain <======> file\n.
I think so due to the following: contents is defined as "any sequence of characters that does not include U+000A LINE FEED followed immediately by boundary", and file as boundary " "+ path newline body?.
Now, given a buffer containing

<======> file
A      BCD  EF
<======> file

We can see, that the AB span matches boundary, C – the spaces, DE – path, and F – newline. What is left? To match the optional body, which consists of the following:

<======> file

Note, how this chunk doesn't start with U+000A LINE FEED, despite the line starting with boundary. This means, that the file contents continue until EOF.

The third mismatched example plagues example/empty-file.hrx. Assuming the same symbols as before, we get (after the first comment)

<===> file1
A   BCD   EF
<===>
So is this one.
<===> file2

thereby hitting the first LF+boundary sequence on the line declaring file2 (my parser returns {file1: {cmt: "This file is empty.", ctnt: "<===>\nSo is this one."}, file2: { cmt: null, ctnt: "" }}, which I feel is correct, going solely by the syntax?).

My hunch as to why these weren't noticed earlier is due to the usage of splitting parsers (e.g. in hrx.js and hrx.py), which probably handle these examples as expected.

I'd be more than happy to submit a PR addressing these issues, if deemed valid :)

List of implementations

What do you think about having a list of implementations somewhere in the README?

I have found those so far:

Ruby: https://github.com/google/hrx-ruby
Python: https://github.com/rebeccajae/hrx.py
JavaScript: https://github.com/rebeccajae/hrx.js
Rust: https://github.com/nabijaczleweli/hrx.rs
Java: https://github.com/topobyte/hrx-java

Are there any more that you're aware of?

Recommend Projects

google / hrx Goto Github PK