funnel

Overview

The idea behind the name “funnel” is that it should allow writing code at a higher level of abstraction, which all gets compiled down to the lowest common denominator of bash scripting.

This currently has a few separate angles:

A novel programming language transpiling into bash etc.
- with an effect-based type system.
A virtualized filesystem and process execution environment to convert the knowledge of that type system directly into unheard-of performance.
- most importantly, I/O must not be arbitrarily buffered by the programming language.
  - Ideas of input and output of processes should be thrown in the gutter beforehand.
- to allow arbitrary other tools to very performantly execute processes against a bazel remexec remote backend
  - this would also be popular within a remote backend due to the improved performance thanks to the VFS.

Language Y (Why?) <<a-specific-language>>

Open Questions

Figure out how to architect a language which doesn’t block other people from merging their code because it might break something else to the maximum extent possible.
- This seems achievable in large part by maximizing concepts of composability (the goal) and orthogonality (the quest).
- This happens to be very similar in abstract to the advantage of the approach of simultaneous-productions over the Turing Machine.
- But “more tech isn’t the answer to social problems.”

Goals

Gradual [0/3]

[ ] **The language will remain completely <<optional>>.*
- [ ] The language can parse all common shells and use all their functions and variables.
- [ ] Introduces the concept of <<telescoping-syntax>> from a language X to Y, which accepts any mixture of statements/expressions in X and in Y in the same context.
- [ ] Allows a wide variance of syntaxes and macros (TIMTOWDI) which may feel more or less reasonable to people from other languages.
[ ] **<<transpiles>> to lowest-common-denominator bash.**
- [ ] transpiling to any of multiple <<backend shells>> in order to <<support multiple platforms>>.
[ ] **<<types>> are trained upon --help output and/or shell completion scripts.**
- [ ] **consider a method using strace as well if neither are available or helpful!**

Maintainable [0/2]

Canonicalize incredibly well-documented (read: well-typed!) shell tooling via the bootstrap project:

Instead of whatever’s on your developer’s machines, or whatever you can fudge into a single docker image.

Deeply document the minute and crucial differences between sh/bash/zsh/ksh/csh/tcsh/fish <<shs>>, even across versions, and respect them.

[ ] fix longtime bash issues: [0/3]
- [ ] <<modifying the underlying file>> makes the script break at runtime.
- [ ] <<array and map access syntax>> (as well as <<loop syntax>>) is inscrutable.
- [ ] canonicalize variable declararations: [0/2]
  - [ ] <<immutable by default>>.
  - [ ] introduce new keywords for array and map declarations.
[ ] have a really simple (and thefore revolutionary) <<module system>>.

POWERFUL [0/3]

Represent common bash patterns with the appropriate level of complexity!

[ ] Binary vs text <<streams>> are differentiated.
[ ] Arrays and map creation, access, and modification are typed.
- [ ] Massage a shell output with 0 lines to be an empty array, instead of ( '' ).
[ ] A return code vs file outputs are represented in the type system for each <<expression>>: [0/3]
- [ ] function calls
- [ ] subshells
- [ ] variable access and modification

Discussion [0/2]

The TODO in this section is non-blocking!

[ ] achieve /much/ stronger typing guarantees than are available in other languages that don’t have such fine requirements for I/O.
- possibly influenced by the way the Ammonite REPL converts filesystem and process actions into structured results that can be manipulated with a few implicits.
[ ] Consider: what type of syntax will this lead to? e.g.:
- [ ] a perfectly general language definition, which the language itself is written in.
- [ ] a method of introducing spheres of a completely different sub-language.
  - [ ] TODO: find that tweet discussion with nora and the other guy who described this!

A transpiler, standard library, and self-bootstrapping tool environment to write more portable and maintainable bash scripts.

The project owes immense inspiration to CoffeeScript, which demonstrated it was possible to write more-complex code with an extended feature set and still work in all browser environments through transpilation, which then inspired the incorporation of those exact features into JavaScript at large.

Problem Toolchain Bootstrapping

The more obvious error and detriment to portable bash scripting is simply not having the desired tools. For tools that users may have installed by default, there are still e.g. incompatibilities for macOS again (such as sed not accepting the -r flag). In general, though, versions of many shell tools may also be splintered across Linux distributions, depending on how often distributions update their toolchains and how often users upgrade their OS. This can lead to a tradeoff that developers make between maintaining a complete toolchain on all shell environments, and writing lowest-common-denominator bash again, without being able to use tools such as sed or grep as expected.

Problem Ecosystem Splintering

[0/2] Implementation Make proof-of-concept PEX/sbang link up

[ ] PEX just in the vein of pex 2 (deleting its own custom resolve with spang))
[ ] Investigate whether PEX can be leveraged into a self-bootstrapping directory file for **multiple languages?**

Problem Bash Splintering <<shell-splintering>>

It seems very unfortunate that “bash” is likely still synonymous with “shell” for many people, only ebecause bash has also splintered in versions and feature sets across environments, and a lot of this may be due to the fact that macOS won’t update its preinstalled version of bash to 4 or higher, due to concerns about the GPLv3 license used for bash 4. This means users writing bash often have to manually write <<lowest-common-denominator>> bash scripts (scripts which must work on the lowest bash version they need to support, and can’t use new features) to ensure portability, which tends to make these scripts more difficult to write and maintain.

Implementation Extending the Language

One thing that transpilation also allows you to do is insert an arbitrary amount of code before and/or after the compiled script itself. CoffeeScript, for example, will monkey-patch some array prototype methods before executing the script, to ensure that its compiled output will be able to rely on those array methods (see Prelude / Runtime). In our case, we can consider adding to that prelude a layer which ensures up-to-date versions of not just familiar tools like sed and grep, but also extremely useful and portable tools such as gnu parallel (which isn’t very well-known, possibly due to not being installed by default (unlike xargs, which is less featureful but does some of the same things)).

These portable tools (sed and grep) can be said to provide an unmatched level of type safety (similar to test maturity) as a result of their mainenance over decades. We should be able to canonicalize and have the compiler tell the user all of this, instead of leaving it as tribal knowledge. One of the most immediate ways to do this is to infer a real form of type safety from the help text and/or shell completions.

Also of note is that the CoffeeScript compiler will wrap the output in an anonymous function to ensure it won’t pollute the global JavaScript namespace. Analogously, we can also consider introducing a better module system to bash, and perhaps a package manager (?).

Problem Avoiding Bash Pitfalls

Separate from toolchains, many bash semantics can tend to confuse users, even experienced ones. Last week I learned that set -e doesn’t exit on a failed command if it’s within the body of a function! Many other shells such as zsh fix issues with e.g. variable declarations, but those other shells are even less likely to be installed by default. ShellCheck is often used in codebases to avoid these pitfalls, but custom checks may still have to be written – the pants repo required this separate check for broken readonly statements which don’t cause set -e to fail. This checking requires effort to maintain and still may be incomplete.

While ShellCheck can capture pitfalls and style errors, it seems that the number of pitfalls is so great that we might consider looking at a whitelisting approach instead – not allowing these pitfalls to be expressable at all, perhaps by writing a new language, which transpiles to lowest-common-denominator bash scripts!

Existing Forays into Fun Expansion

Mainly see zshexpn(1), and especially into the extreme complexity and terseness in the “history expansion section” in particular.
Being able to nest ${${...}} is also a homogenous and really neat interface!
- Immediately becomes incomprehensible when stacked too far.
  - If the same transformations can be composed across long command lines, and made safe (even faster), I think we would have built something good. <<spread-out-existing-expansion-techniques>>
There is an extremely thorough dialog on ease of keybindings in XTerm on bash vs zsh because of their string handling mechanisms at ~xterm(1)~!!

Argument for a Much More Virtual Environment <<virtual-environment>>

Building on upc, consider how performance analysis and benchmarking can change overnight if/when it’s not only possible to trace filesystem and network I/O vs RAM pressure vs CPU, but to directly /orchestrate/ it.
- In particular, the pants project is developing a method for invoking arbitrary subprocesses (typically compilers/etc) within a virtual filesystem with ~brfs~.
Consider the expected/proposed/conjectured utility of a generic process execution engine in ~upc~.
- upc was built on top of years of work to extract the process execution itself from the rest of the build tool, which has resulted in the fantastic ~process_executor~ debugging tool.

[0/1] Cacheable, Serializable Process Executions <<cacheable-executions>>

Pants, bazel, and other projects have been continuously collaborating on an extensible shared format for specifying a process execution request <<remexec>>. This is used in pants and bazel today to execute processes that create files for build tasks. As a testament to its reproducibility, multiple organizations rely on this API to homogenously execute the same processes remotely, or to pull down a cached result of the same process execution (e.g. with a backend like Scoot.

[ ] We should be able to produce, from such bash/zsh completion scripts, a form of these idempotent bazel remexec API-compatible Process execution requests, and very performantly execute them against a VFS.

[0/3] Breaking the Speed of Light by Being Omniscient <<speed-of-light>>

A virtual file system using FUSE doesn’t incur too much overhead on Linux [citation needed]. However, a filesystem, by construction, can only use heuristics to optimize its performance (and that “performance” has many axes). What if we could know /~exactly/ which files were about to be read/written at all [times?

[ ] If we knew every file that was going to be written by a process beforehand, we could allocatae self-growing buffers for each of those paths, avoiding the need to allocate any resources in real time.
[ ] If we knew the expected size of those future files, we could allocate the appropriate regions immediately.
[ ] If we knew every file that needed to be read by a process beforehand, we could allocate (perhaps even pool) read-only buffers before the process executes.
“fast enough IPC is just an FFI”

<<> eugene and zinc vfs https://eed3si9n.com/cached-compilation-for-sbt If this omniscience was achievable, we could expect our processes to run “faster than the speed of light”, i.e. faster than any conceivable heuristic model.

[0/1] Type Safety and Performance by Omniscience <<typesafety>>

Parsing bash/zsh completion scripts (or obtaining them from e.g. --help) should accomplish two goals:

[ ] We can validate the types of arguments before running the script at all.
- [ ] This should improve type safety automatically, in a way that can be run on the script before executing it at all.
  - [ ] can shellcheck do this already?
[ ] It should be relatively easy to write “stubs” (like mypy) which can fill in the blanks for hand-written scripts. <<mypy-stubs>>
- [ ] This would be an extremely natural place to start eventually developing a more thorough type inference system for shell scripts in general!
[ ] This should either extend or integrate with shellcheck to provide real type safety for bash shells.
- [ ] Then see making them a virtual `Process` execution for performance!

HEY THIS SHOULD GO SOMEWHERE

[0/4] Build on top of existing UX investigations into high-performance interactive page and/or serving!

[ ] e.g. parallel (with both man and info pages!)
[ ] See the docstring of small-temporary-file-directory (and the global files defgroup more generally):

(defcustom small-temporary-file-directory
  (if (eq system-type 'ms-dos) (getenv "TMPDIR"))
  "The directory for writing small temporary files.
If non-nil, this directory is used instead of `temporary-file-directory'
by programs that create small temporary files.  This is for systems that
have fast storage with limited space, such as a RAM disk."
  :group 'files
  :initialize 'custom-initialize-delay
  :type 'directory)

Realization: ~small-temporary-file-directory~, and more generally ~info(emacs)Top>Files>Saving/Backup~, results from “~ 20 years of UX work in calculating which backup pages should stay paged in or not”.

[ ] this notably mirrors PEX’s ~–always-write-cache~ option.

--unzip, --no-unzip
                    Whether or not the pex file should be unzipped before
                    executing it. If the pex file will be run multiple
                    times under a stable runtime PEX_ROOT the unzipping
                    will only be performed once and subsequent runs will
                    enjoy lower startup latency. [Default: do not unzip.]
--always-write-cache
                    Always write the internally cached distributions to
                    disk prior to invoking the pex source code.  This can
                    use less memory in RAM constrained environments.
                    [Default: False]
--ignore-errors     Ignore requirement resolution solver errors when
                    building pexes and later invoking them. [Default:
                    False]

[ ] UNCLEAR can `man’ and `woman’, and especially `info’, actually be surpassed?
- [ ] learn `xref` commands and determine navigating between everything!

Subsume `learning-progress-bar`

I don’t think anyone at all has been thinking about dynamic-io-control yet. !!

contrast dynamic-io-control with what’s statically-known!

While this project focuses on making process executions type-safe, cacheable, and extremely fast (<<statically-known>>), the ~learning-progress-bar~ project is more focused on tracing what happens /during/ an execution <<dynamic-io-control>>.
Both projects:
- focus on “dropping in” to existing command-line invocations and tooling people have already set up (<<dropping-in>>),
- are intended to plug into a build tool.
Output streaming can be safely delegated to ~learning-progress-bar~, while this one focuses much more on one-shot executions.

Motivating Example: the Fraunces open-source variable font

[[[fn:Fraunces]https://github.com/cosmicexplorer/Fraunces/blob/56a435d9ddd4ea6e627b282fb6e4c7b8a6f8f561/sources/build.sh#L28-L71][See this highly commented code from my attempt to fix the larger issues with the build system for the Fraunces family of open-source variable fonts.]]

highly distributed log search

Two workstreams

Proposal [1/2]

[ ] The main thing i’m thinking hasn’t been investigated is the performance of a <<virtual-brute-force>> technique i.e. ripgrep searching a virtual filesystem containing only the logs you care about anyway.
- It feels like it can be viewed as a non-indexed database query of a document database.
- [ ] How can we analyze the CPU cache effects of this faux filesystem?
But: should it still be considered “brute-force” at all if you’re able to manipulate the input so much?
- [ ] Relatedly, “indexing and memoizing arbitrarily complex grammars across runs” is a feature of my <<currently-secret-patent-pending-parsing-algorithm>> too (which is why i should publish as GPL v3 asap so we can use it).
[X] With --json from ~rg –help | rg -A5 ‘--json’~, we have an <<asynchronous-parsing-database>> which can <<update-incrementally>> and <<expand-surrounding-context>> of a result

[2/2] Known Use Cases

[X] MVP

*”i just saw a weird compiler error, i want to scan all the builds for other instances of this”?*

[X] context

see the ~./pants fetch~ proposal!

Twitter EE had a huge amount of difficulty reliably accessing their own build logs.
splunk is now making $$$

Workstreams

Both of these workstreams use the ripgrep json output API to provide streaming output.

In order to get more familiarity with the <<

[/] helm-rg!!!

???

[0/1] UnionFS, pants compile logs

[ ] MVP

Use the ripgrep json output to search R E A L L Y F A S T for Pants compiler logs!

~rg –help | rg -A5 ‘--json’~ is helpful for context.
along with UnionFS, we could create a whole document database
by doing extremely fast/parallel searches via ~ripgrep~ along with a completely virtual mockup!
- (but <<locally-cacheable>> (or <<pairwise cacheable>>)) filesystem mockup!!!!

Background

(i have literally no clue about how filesystems cache things)
but my impression is that read latency comes from:
- having to be an online system
- accepting arbitrary tree traversals
and swapping things in and out of memory is done by faulty heuristics
and that io_uring is great because it allows the kernel to avoid the false assumption that each IO operation is independent of any other
- TODO look at online discussions of io_uring!
- if there’s value here, it seems to lie again in making use of some more omniscience about the input we want to serve to ripgrep in a unionfs.

Ideas Right Now

what can “super super fast X” be used for?

distributed asynchronous text scanning / parsing
- in a distributed process execution system, it allows a simply configurable model of fetching process output
openMP/MPI and ESPECIALLY sycl!!!!
- try making the rust regex crate’s generated automata into distributed objects, which would enable massive parallelism across those very focused UnionFS chroots

Goals

A Toolchain to Parallel the Python Stdlib <<rechargeable-batteries-included>>

Becoming Ammonite <<ammonition>>

Consider the extremely thoughtful and natural API of the ammonite REPL.

Unlike other shell-like environments, Ammonite has the type safety and well-documented standard library of Scala built-in.

Becoming Ourselves

Right now, the “funnel” language’s functionality will be exposed through a single executable fun.

[ ] define command-line tools to control (such as sed, parallel, jq, xmlstarlet), and create a method to download them on all supported platforms.
[ ] define “all supported platforms”.
[ ] define a grammar (see the bash grammar).
[ ] implement the transpiler.
- [ ] figure out whether/how this language can be smart enough to bootstrap itself (i.e. the compiler is written in it)
  - ^!!!^
- [ ] begin to consider a module and package system for (portable) bash scripts
  - [ ] want something that will work on existing bash/zsh code (e.g. if you put them in a special directory they can be specially required or loaded)?
    - the “Prelude”/”Runtime” for this (the shell script code that it loads) should have a function that is available to bash and zsh scripts that it loads which allows them to load something from the module system with similar ease!
[ ] consider using any relevant parts of shellcheck!!

Open Questions

GNU / BSD options

Whether to accept command lines using GNU-style (probably long) options, or BSD options (with different names and some missing functionality).

bash / zsh output

vWhether to generate code for bash or for zsh. **The output of this compiler should be 100% compatible with code written for the output shell.**

Code Generation

Prelude / Runtime <<prelude-runtime>>

The output of a compile should have some “prelude” or “runtime” which is some script to be evaluated containing e.g. convenience methods.

License

GPL v3 (or any later version)]]

Footnotes

[fn:Fraunces] To really underline why there’s such a searing need here:

the build system isn’t even attempting to do anything too difficult with the font variability itself!
- It’s simply trying to convert its design into something that works canonically with existing font file formatter (e.g. FontForge, but idk whether that was even what was actually used?!)).

#!/bin/sh
set -euxo pipefail

# Ensure this script is executed from within its own directory.
GIT_ROOT="$(git rev-parse --show-toplevel)"
cd "${GIT_ROOT}/sources"

# Only use this when necessary, are currently not all instances are defined in the VF designspace
# files.  generate static designspace referencing csv and variable designspace file later, this
# might not be done dynamically
# python ../mastering/scripts/generate_static_fonts_designspace.py
## Statics
static_fonts=(
  # 3 arguments per line.
  Roman/Fraunces_static.designspace ttf ../fonts/static/ttf
  Roman/Fraunces_static.designspace otf ../fonts/static/otf/
  Italic/FrauncesItalic_static.designspace ttf ../fonts/static/ttf/
  Italic/FrauncesItalic_static.designspace otf ../fonts/static/otf/
)
function get_static_instances_from_designspaces {
  ./extract_instances.sh {Roman,Italic}/*_static.designspace
}
# FIXME: This is a REALLY FANTASTIC CASE where shell scripting is EXCEEDINGLY difficult to work
# with, but JUST AS BAD AS THE PYTHON CODE IN fixNameTable.py and friends!!!! This is a *use case*!!
# NB: Especially take note of:
# (1) The hacky progress bar
# (2) The `stdbuf` unbuffering
# (3) The partial output redirection!
# (4) Being unable to use `xargs` or `parallel` with shell functions means recreating these
#     ".../*_static.designspace" globs in get_static_instances_from_designspaces()!

# NB: Looking to address all of the above with https://github.com/cosmicexplorer/funnel

function generate_static_fonts {
  # This is really quick to calculate, and lets us know how much progress we're making!
  total_num_static_instances="$(get_static_instances_from_designspaces | wc -l)"
  echo "Generating Static fonts ($total_num_static_instances in total)"

  # (1) Process each .designspace XML file and output format in parallel with `xargs`.
  # (2) At this point, we're dealing with a ton of output, so we tee it to stderr so the user can
  #     redirect to /dev/null if they don't need that finer-grained info.
  # (3) However on stdout, we filter for messages that describe successfully writing out a .otf or
  #     .ttf file, and give a quick progress bar with percentage, since we know how *many* instances
  #     we'll eventually need to write, even if we're not checking which exact ones those are.
  instances_processed=0
  printf '%s\n' "${static_fonts[@]}" \
    | 2>&1 stdbuf -i0 -o0 -e0 xargs -t -L 3 --max-procs=0 ./generate_font_instances.sh \
    | stdbuf -i0 -o0 -eL tee /dev/stderr \
    | sed -Ene 's#^INFO:fontmake.font_project:Saving (.*)$#\1#gp' \
    | while read just_saved_font; do
    instances_processed="$(($instances_processed + 1))"
    percent_complete="$((($instances_processed / $total_num_static_instances) / 100.0))"
    echo "${percent_complete}% complete: ${instances_processed}/${total_num_static_instances} (${just_saved_font})"
  done
}

time generate_static_fonts
exit 0

echo "Post processing"

gftools fix-dsig -a ../fonts/static/ttf/*.ttf
gftools fix-hinting ../fonts/static/ttf/*.ttf
# NB: This script appears to be doing something incredibly complex that it absolutely should not be
# attempting to do on its own.
python ../mastering/scripts/fixNameTable.py ../fonts/static/ttf/*.ttf

cosmicexplorer / funnel Goto Github PK

funnel's Introduction

funnel

Overview

Language Y (Why?) <<a-specific-language>>

Open Questions

Goals

Gradual [0/3]

Maintainable [0/2]

POWERFUL [0/3]

Discussion [0/2]

Problem Toolchain Bootstrapping

Problem Ecosystem Splintering

[0/2] Implementation Make proof-of-concept PEX/sbang link up

Problem Bash Splintering <<shell-splintering>>

Implementation Extending the Language

Problem Avoiding Bash Pitfalls

Existing Forays into Fun Expansion

Argument for a Much More Virtual Environment <<virtual-environment>>

[0/1] Cacheable, Serializable Process Executions <<cacheable-executions>>

[0/3] Breaking the Speed of Light by Being Omniscient <<speed-of-light>>

[0/1] Type Safety and Performance by Omniscience <<typesafety>>

HEY THIS SHOULD GO SOMEWHERE

[0/4] Build on top of existing UX investigations into high-performance interactive page and/or serving!

Subsume learning-progress-bar

contrast dynamic-io-control with what’s statically-known!

Motivating Example: the Fraunces open-source variable font

highly distributed log search

Proposal [1/2]

[2/2] Known Use Cases

Workstreams

Goals

A Toolchain to Parallel the Python Stdlib <<rechargeable-batteries-included>>

Becoming Ammonite <<ammonition>>

Becoming Ourselves

Open Questions

GNU / BSD options

bash / zsh output

Code Generation

Prelude / Runtime <<prelude-runtime>>

License

Footnotes

Recommend Projects

Recommend Topics

Recommend Org

Subsume `learning-progress-bar`