The idea behind the name “funnel” is that it should allow writing code at a higher level of abstraction, which all gets compiled down to the lowest common denominator of bash scripting.
This currently has a few separate angles:
- A novel programming language transpiling into bash etc.
- with an effect-based type system.
- A virtualized filesystem and process execution environment to convert the knowledge of that type system directly into unheard-of performance.
- most importantly, I/O must not be arbitrarily buffered by the programming language.
- Ideas of input and output of processes should be thrown in the gutter beforehand.
- to allow arbitrary other tools to very performantly execute processes against a bazel remexec remote backend
- this would also be popular within a remote backend due to the improved performance thanks to the VFS.
- most importantly, I/O must not be arbitrarily buffered by the programming language.
- Figure out how to architect a language which doesn’t block other people from merging their code because it might break something else to the maximum extent possible.
- This seems achievable in large part by maximizing concepts of composability (the goal) and orthogonality (the quest).
- This happens to be very similar in abstract to the advantage of the approach of simultaneous-productions over the Turing Machine.
- But “more tech isn’t the answer to social problems.”
- [ ] **The language will remain completely <<optional>>.*
- [ ] The language can parse all common shells and use all their functions and variables.
- [ ] Introduces the concept of <<telescoping-syntax>> from a language X to Y, which accepts any mixture of statements/expressions in X and in Y in the same context.
- [ ] Allows a wide variance of syntaxes and macros (TIMTOWDI) which may feel more or less reasonable to people from other languages.
- [ ] **<<transpiles>> to lowest-common-denominator bash.**
- [ ] transpiling to any of multiple <<backend shells>> in order to <<support multiple platforms>>.
- [ ] **<<types>> are trained upon
--help
output and/or shell completion scripts.**- [ ] **consider a method using strace as well if neither are available or helpful!**
Canonicalize incredibly well-documented (read: well-typed!) shell tooling via the bootstrap project:
Instead of whatever’s on your developer’s machines, or whatever you can fudge into a single docker image.
Deeply document the minute and crucial differences between sh/bash/zsh/ksh/csh/tcsh/fish <<shs>>, even across versions, and respect them.
- [ ] fix longtime bash issues: [0/3]
- [ ] <<modifying the underlying file>> makes the script break at runtime.
- [ ] <<array and map access syntax>> (as well as <<loop syntax>>) is inscrutable.
- [ ] canonicalize variable declararations: [0/2]
- [ ] <<immutable by default>>.
- [ ] introduce new keywords for array and map declarations.
- [ ] have a really simple (and thefore revolutionary) <<module system>>.
Represent common bash patterns with the appropriate level of complexity!
- [ ] Binary vs text <<streams>> are differentiated.
- [ ] Arrays and map creation, access, and modification are typed.
- [ ] Massage a shell output with 0 lines to be an empty array, instead of
( '' )
.
- [ ] Massage a shell output with 0 lines to be an empty array, instead of
- [ ] A return code vs file outputs are represented in the type system for each <<expression>>: [0/3]
- [ ] function calls
- [ ] subshells
- [ ] variable access and modification
The TODO in this section is non-blocking!
- [ ] achieve /much/ stronger typing guarantees than are available in other languages that don’t have such fine requirements for I/O.
- possibly influenced by the way the Ammonite REPL converts filesystem and process actions into structured results that can be manipulated with a few implicits.
- [ ] Consider: what type of syntax will this lead to? e.g.:
- [ ] a perfectly general language definition, which the language itself is written in.
- [ ] a method of introducing spheres of a completely different sub-language.
- [ ] TODO: find that tweet discussion with nora and the other guy who described this!
A transpiler, standard library, and self-bootstrapping tool environment to write more portable and maintainable bash scripts.
The project owes immense inspiration to CoffeeScript, which demonstrated it was possible to write more-complex code with an extended feature set and still work in all browser environments through transpilation, which then inspired the incorporation of those exact features into JavaScript at large.
The more obvious error and detriment to portable bash scripting is simply not having the desired tools. For tools that users may have installed by default, there are still e.g. incompatibilities for macOS again (such as sed
not accepting the -r
flag). In general, though, versions of many shell tools may also be splintered across Linux distributions, depending on how often distributions update their toolchains and how often users upgrade their OS. This can lead to a tradeoff that developers make between maintaining a complete toolchain on all shell environments, and writing lowest-common-denominator bash again, without being able to use tools such as sed
or grep
as expected.
- [ ] PEX just in the vein of pex 2 (deleting its own custom resolve with
spang
)) - [ ] Investigate whether PEX can be leveraged into a self-bootstrapping directory file for **multiple languages?**
It seems very unfortunate that “bash” is likely still synonymous with “shell” for many people, only ebecause bash has also splintered in versions and feature sets across environments, and a lot of this may be due to the fact that macOS won’t update its preinstalled version of bash to 4 or higher, due to concerns about the GPLv3 license used for bash 4. This means users writing bash often have to manually write <<lowest-common-denominator>> bash scripts (scripts which must work on the lowest bash version they need to support, and can’t use new features) to ensure portability, which tends to make these scripts more difficult to write and maintain.
One thing that transpilation also allows you to do is insert an arbitrary amount of code before and/or after the compiled script itself. CoffeeScript, for example, will monkey-patch some array prototype methods before executing the script, to ensure that its compiled output will be able to rely on those array methods (see Prelude / Runtime). In our case, we can consider adding to that prelude a layer which ensures up-to-date versions of not just familiar tools like sed
and grep
, but also extremely useful and portable tools such as gnu parallel (which isn’t very well-known, possibly due to not being installed by default (unlike xargs
, which is less featureful but does some of the same things)).
These portable tools (sed
and grep
) can be said to provide an unmatched level of type safety (similar to test maturity) as a result of their mainenance over decades. We should be able to canonicalize and have the compiler tell the user all of this, instead of leaving it as tribal knowledge. One of the most immediate ways to do this is to infer a real form of type safety from the help text and/or shell completions.
Also of note is that the CoffeeScript compiler will wrap the output in an anonymous function to ensure it won’t pollute the global JavaScript namespace. Analogously, we can also consider introducing a better module system to bash, and perhaps a package manager (?).
Separate from toolchains, many bash semantics can tend to confuse users, even experienced ones. Last week I learned that set -e
doesn’t exit on a failed command if it’s within the body of a function
! Many other shells such as zsh fix issues with e.g. variable declarations, but those other shells are even less likely to be installed by default. ShellCheck is often used in codebases to avoid these pitfalls, but custom checks may still have to be written – the pants repo required this separate check for broken readonly
statements which don’t cause set -e
to fail. This checking requires effort to maintain and still may be incomplete.
While ShellCheck can capture pitfalls and style errors, it seems that the number of pitfalls is so great that we might consider looking at a whitelisting approach instead – not allowing these pitfalls to be expressable at all, perhaps by writing a new language, which transpiles to lowest-common-denominator bash scripts!
- Mainly see
zshexpn(1)
, and especially into the extreme complexity and terseness in the “history expansion section” in particular. - Being able to nest
${${...}}
is also a homogenous and really neat interface!- Immediately becomes incomprehensible when stacked too far.
- If the same transformations can be composed across long command lines, and made safe (even faster), I think we would have built something good. <<spread-out-existing-expansion-techniques>>
- Immediately becomes incomprehensible when stacked too far.
- There is an extremely thorough dialog on ease of keybindings in XTerm on bash vs zsh because of their string handling mechanisms at ~xterm(1)~!!
- Building on
upc
, consider how performance analysis and benchmarking can change overnight if/when it’s not only possible to trace filesystem and network I/O vs RAM pressure vs CPU, but to directly /orchestrate/ it. - Consider the expected/proposed/conjectured utility of a generic process execution engine in ~upc~.
upc
was built on top of years of work to extract the process execution itself from the rest of the build tool, which has resulted in the fantastic ~process_executor~ debugging tool.
Pants, bazel, and other projects have been continuously collaborating on an extensible shared format for specifying a process execution request <<remexec>>. This is used in pants and bazel today to execute processes that create files for build tasks. As a testament to its reproducibility, multiple organizations rely on this API to homogenously execute the same processes remotely, or to pull down a cached result of the same process execution (e.g. with a backend like Scoot.
- [ ] We should be able to produce, from such bash/zsh completion scripts, a form of these idempotent bazel remexec API-compatible
Process
execution requests, and very performantly execute them against a VFS.
A virtual file system using FUSE doesn’t incur too much overhead on Linux [citation needed]. However, a filesystem, by construction, can only use heuristics to optimize its performance (and that “performance” has many axes). What if we could know /~exactly/ which files were about to be read/written at all [times?
- [ ] If we knew every file that was going to be written by a process beforehand, we could allocatae self-growing buffers for each of those paths, avoiding the need to allocate any resources in real time.
- [ ] If we knew the expected size of those future files, we could allocate the appropriate regions immediately.
- [ ] If we knew every file that needed to be read by a process beforehand, we could allocate (perhaps even pool) read-only buffers before the process executes.
- “fast enough IPC is just an FFI”
<<> eugene and zinc vfs https://eed3si9n.com/cached-compilation-for-sbt If this omniscience was achievable, we could expect our processes to run “faster than the speed of light”, i.e. faster than any conceivable heuristic model.
Parsing bash/zsh completion scripts (or obtaining them from e.g. --help
) should accomplish two goals:
- [ ] We can validate the types of arguments before running the script at all.
- [ ] This should improve type safety automatically, in a way that can be run on the script before executing it at all.
- [ ] can shellcheck do this already?
- [ ] This should improve type safety automatically, in a way that can be run on the script before executing it at all.
- [ ] It should be relatively easy to write “stubs” (like mypy) which can fill in the blanks for hand-written scripts. <<mypy-stubs>>
- [ ] This would be an extremely natural place to start eventually developing a more thorough type inference system for shell scripts in general!
- [ ] This should either extend or integrate with shellcheck to provide real type safety for bash shells.
[0/4] Build on top of existing UX investigations into high-performance interactive page and/or serving!
- [ ] e.g.
parallel
(with both man and info pages!) - [ ] See the docstring of
small-temporary-file-directory
(and the globalfiles
defgroup more generally):
(defcustom small-temporary-file-directory
(if (eq system-type 'ms-dos) (getenv "TMPDIR"))
"The directory for writing small temporary files.
If non-nil, this directory is used instead of `temporary-file-directory'
by programs that create small temporary files. This is for systems that
have fast storage with limited space, such as a RAM disk."
:group 'files
:initialize 'custom-initialize-delay
:type 'directory)
- Realization: ~small-temporary-file-directory~, and more generally ~info(emacs)Top>Files>Saving/Backup~, results from “~ 20 years of UX work in calculating which backup pages should stay paged in or not”.
- [ ] this notably mirrors PEX’s ~–always-write-cache~ option.
--unzip, --no-unzip Whether or not the pex file should be unzipped before executing it. If the pex file will be run multiple times under a stable runtime PEX_ROOT the unzipping will only be performed once and subsequent runs will enjoy lower startup latency. [Default: do not unzip.] --always-write-cache Always write the internally cached distributions to disk prior to invoking the pex source code. This can use less memory in RAM constrained environments. [Default: False] --ignore-errors Ignore requirement resolution solver errors when building pexes and later invoking them. [Default: False]
- [ ] UNCLEAR can `man’ and `woman’, and especially `info’, actually be surpassed?
- [ ] learn `xref` commands and determine navigating between everything!
I don’t think anyone at all has been thinking about dynamic-io-control yet. !!
contrast dynamic-io-control with what’s statically-known!
- While this project focuses on making process executions type-safe, cacheable, and extremely fast (<<statically-known>>), the ~learning-progress-bar~ project is more focused on tracing what happens /during/ an execution <<dynamic-io-control>>.
- Both projects:
- focus on “dropping in” to existing command-line invocations and tooling people have already set up (<<dropping-in>>),
- are intended to plug into a build tool.
- Output streaming can be safely delegated to ~learning-progress-bar~, while this one focuses much more on one-shot executions.
Motivating Example: the Fraunces open-source variable font
- [[[fn:Fraunces]https://github.com/cosmicexplorer/Fraunces/blob/56a435d9ddd4ea6e627b282fb6e4c7b8a6f8f561/sources/build.sh#L28-L71][See this highly commented code from my attempt to fix the larger issues with the build system for the Fraunces family of open-source variable fonts.]]
Two workstreams
- [ ] The main thing i’m thinking hasn’t been investigated is the performance of a <<virtual-brute-force>> technique i.e. ripgrep searching a virtual filesystem containing only the logs you care about anyway.
- It feels like it can be viewed as a non-indexed database query of a document database.
- [ ] How can we analyze the CPU cache effects of this faux filesystem?
- But: should it still be considered “brute-force” at all if you’re able to manipulate the input so much?
- [ ] Relatedly, “indexing and memoizing arbitrarily complex grammars across runs” is a feature of my <<currently-secret-patent-pending-parsing-algorithm>> too (which is why i should publish as GPL v3 asap so we can use it).
- [X] With
--json
from ~rg –help | rg -A5 ‘--json’~, we have an <<asynchronous-parsing-database>> which can <<update-incrementally>> and <<expand-surrounding-context>> of a result
- [X] MVP
- *”i just saw a weird compiler error, i want to scan all the builds for other instances of this”?*
- [X] context
- see the ~./pants fetch~ proposal!
- Twitter EE had a huge amount of difficulty reliably accessing their own build logs.
- splunk is now making $$$
Both of these workstreams use the ripgrep json output API to provide streaming output.
In order to get more familiarity with the <<
[/] helm-rg!!!???
[0/1] UnionFS, pants compile logs- [ ] MVP
- Use the ripgrep json output to search R E A L L Y F A S T for Pants compiler logs!
- ~rg –help | rg -A5 ‘--json’~ is helpful for context.
- along with UnionFS, we could create a whole document database
- by doing extremely fast/parallel searches via ~ripgrep~ along with a completely virtual mockup!
- (but <<locally-cacheable>> (or <<pairwise cacheable>>)) filesystem mockup!!!!
- (i have literally no clue about how filesystems cache things)
- but my impression is that read latency comes from:
- having to be an online system
- accepting arbitrary tree traversals
- and swapping things in and out of memory is done by faulty heuristics
- and that
io_uring
is great because it allows the kernel to avoid the false assumption that each IO operation is independent of any other- TODO look at online discussions of
io_uring
! - if there’s value here, it seems to lie again in making use of some more omniscience about the input we want to serve to ripgrep in a unionfs.
- TODO look at online discussions of
what can “super super fast X” be used for?
- distributed asynchronous text scanning / parsing
- in a distributed process execution system, it allows a simply configurable model of fetching process output
- openMP/MPI and ESPECIALLY sycl!!!!
- try making the rust regex crate’s generated automata into distributed objects, which would enable massive parallelism across those very focused UnionFS chroots
Consider the extremely thoughtful and natural API of the ammonite REPL.
- Unlike other shell-like environments, Ammonite has the type safety and well-documented standard library of Scala built-in.
Right now, the “funnel” language’s functionality will be exposed through a single executable fun
.
- [ ] define command-line tools to control (such as
sed
,parallel
,jq
,xmlstarlet
), and create a method to download them on all supported platforms. - [ ] define “all supported platforms”.
- [ ] define a grammar (see the bash grammar).
- [ ] implement the transpiler.
- [ ] figure out whether/how this language can be smart enough to bootstrap itself (i.e. the compiler is written in it)
- ^!!!^
- [ ] begin to consider a module and package system for (portable) bash scripts
- [ ] want something that will work on existing bash/zsh code (e.g. if you put them in a special
directory they can be specially required or loaded)?
- the “Prelude”/”Runtime” for this (the shell script code that it loads) should have a function that is available to bash and zsh scripts that it loads which allows them to load something from the module system with similar ease!
- [ ] want something that will work on existing bash/zsh code (e.g. if you put them in a special
directory they can be specially required or loaded)?
- [ ] figure out whether/how this language can be smart enough to bootstrap itself (i.e. the compiler is written in it)
- [ ] consider using any relevant parts of shellcheck!!
Whether to accept command lines using GNU-style (probably long) options, or BSD options (with different names and some missing functionality).
vWhether to generate code for bash or for zsh. **The output of this compiler should be 100% compatible with code written for the output shell.**
The output of a compile should have some “prelude” or “runtime” which is some script to be evaluated containing e.g. convenience methods.
GPL v3 (or any later version)]]
[fn:Fraunces] To really underline why there’s such a searing need here:
- the build system isn’t even attempting to do anything too difficult with the font variability itself!
- It’s simply trying to convert its design into something that works canonically with existing font file formatter (e.g.
FontForge
, but idk whether that was even what was actually used?!)).
- It’s simply trying to convert its design into something that works canonically with existing font file formatter (e.g.
#!/bin/sh
set -euxo pipefail
# Ensure this script is executed from within its own directory.
GIT_ROOT="$(git rev-parse --show-toplevel)"
cd "${GIT_ROOT}/sources"
# Only use this when necessary, are currently not all instances are defined in the VF designspace
# files. generate static designspace referencing csv and variable designspace file later, this
# might not be done dynamically
# python ../mastering/scripts/generate_static_fonts_designspace.py
## Statics
static_fonts=(
# 3 arguments per line.
Roman/Fraunces_static.designspace ttf ../fonts/static/ttf
Roman/Fraunces_static.designspace otf ../fonts/static/otf/
Italic/FrauncesItalic_static.designspace ttf ../fonts/static/ttf/
Italic/FrauncesItalic_static.designspace otf ../fonts/static/otf/
)
function get_static_instances_from_designspaces {
./extract_instances.sh {Roman,Italic}/*_static.designspace
}
# FIXME: This is a REALLY FANTASTIC CASE where shell scripting is EXCEEDINGLY difficult to work
# with, but JUST AS BAD AS THE PYTHON CODE IN fixNameTable.py and friends!!!! This is a *use case*!!
# NB: Especially take note of:
# (1) The hacky progress bar
# (2) The `stdbuf` unbuffering
# (3) The partial output redirection!
# (4) Being unable to use `xargs` or `parallel` with shell functions means recreating these
# ".../*_static.designspace" globs in get_static_instances_from_designspaces()!
# NB: Looking to address all of the above with https://github.com/cosmicexplorer/funnel
function generate_static_fonts {
# This is really quick to calculate, and lets us know how much progress we're making!
total_num_static_instances="$(get_static_instances_from_designspaces | wc -l)"
echo "Generating Static fonts ($total_num_static_instances in total)"
# (1) Process each .designspace XML file and output format in parallel with `xargs`.
# (2) At this point, we're dealing with a ton of output, so we tee it to stderr so the user can
# redirect to /dev/null if they don't need that finer-grained info.
# (3) However on stdout, we filter for messages that describe successfully writing out a .otf or
# .ttf file, and give a quick progress bar with percentage, since we know how *many* instances
# we'll eventually need to write, even if we're not checking which exact ones those are.
instances_processed=0
printf '%s\n' "${static_fonts[@]}" \
| 2>&1 stdbuf -i0 -o0 -e0 xargs -t -L 3 --max-procs=0 ./generate_font_instances.sh \
| stdbuf -i0 -o0 -eL tee /dev/stderr \
| sed -Ene 's#^INFO:fontmake.font_project:Saving (.*)$#\1#gp' \
| while read just_saved_font; do
instances_processed="$(($instances_processed + 1))"
percent_complete="$((($instances_processed / $total_num_static_instances) / 100.0))"
echo "${percent_complete}% complete: ${instances_processed}/${total_num_static_instances} (${just_saved_font})"
done
}
time generate_static_fonts
exit 0
echo "Post processing"
gftools fix-dsig -a ../fonts/static/ttf/*.ttf
gftools fix-hinting ../fonts/static/ttf/*.ttf
# NB: This script appears to be doing something incredibly complex that it absolutely should not be
# attempting to do on its own.
python ../mastering/scripts/fixNameTable.py ../fonts/static/ttf/*.ttf