Table of Contents generated with DocToc
This package has been deprecated; for more see
- the Future
- that leads to a new major version of Hollerith and
icql-dba-hollerith
(TBD)
-
Purpose
-
Add a JSON and a BLOB column to your table (called
vnr
andvnr_blob
by default) -
JSON encodes lists of integer numbers called Vectorial NumbeRs (VNRs)
-
BLOB is a generated, indexed column of a binary representation (provided by [
hollerith-code
tng])[https://github.com/loveencounterflow/hollerith-codec/blob/master/src/tng.coffee] of the VNR -
the binary representation can be meaningfully sorted lexicographically by SQLite
-
VNRs support scenarios where you want to transform ordered items such that a given line (item, record) may result in any number of result lines (items, records) and keep the relative ordering
-
e.g. you store lines of a textfile as
{ nr: 1, text: 'helo world', }
{ nr: 2, text: 'fancy stuff here', }
{ nr: 3, text: 'that\'s all', }
-
say you want to transform that to a one-word-per-line format in the same table; now you have
{ nr: 1, text: 'helo world', }
{ nr: ?, text: 'helo', }
{ nr: ?, text: 'world', }
{ nr: 2, text: 'fancy stuff here', }
{ nr: ?, text: 'fancy', }
{ nr: ?, text: 'stuff', }
{ nr: ?, text: 'here', }
{ nr: 3, text: 'that\'s all', }
{ nr: ?, text: 'that's', }
{ nr: ?, text: 'all', }
-
For that to work, you'd have to (1) mark the orginal records as processed (not covered here), and (2) find a way to enumerate the new lines such that they are correctly ordered.
-
What's more, wouldn't it be great if the enumeration could somehow preserve the origin of each new record?
-
We could just use consecutive numbers but then we'd have to renumber all entries whenever a single record got added somewhere in the middle. One could also use floating point numbers which gets messy soon.
-
Instead if we converted the
nr
field to a vector of numbers, we could do the following:{ vnr: [ 1, ], text: 'helo world', }
{ vnr: [ 1, 1, ], text: 'helo', }
{ vnr: [ 1, 2, ], text: 'world', }
{ vnr: [ 2, ], text: 'fancy stuff here', }
{ vnr: [ 2, 1, ], text: 'fancy', }
{ vnr: [ 2, 2, ], text: 'stuff', }
{ vnr: [ 2, 3, ], text: 'here', }
{ vnr: [ 3, ], text: 'that\'s all', }
{ vnr: [ 3, 1, ], text: 'that\'s', }
{ vnr: [ 3, 2, ], text: 'all', }
-
As it stands, we can store the VNRs as JSON (that is, as texts as far as SQLite is concerned). However, that is not bound to give us the intended because the lexicographic sorting of numbers-as-texts is not numerically monotonous (
'1'
comes before'10'
which comes before'2'
etc.) -
This is why we turn the JSON lists into Binary Large OBjects (BLOBs) that SQLite will order correctly out of the box.
-
We do this by adding a
vnr_blob
field:alter table myfiles add column vnr_blob blob generated always as ( vnr_encode( vnr ) ) virtual not null;
-
The
vnr_blob
field is generated but not stored, meaning it will be re-generated on eachselect
that includes it. That sounds inefficient—and it would be, were it not for an index we also add:create unique index myfiles_vnr_blob_idx on myfiles ( vnr_encode( vnr ) );
-
issuing
select * from myfiles order by vnr_blob;
now gives us this display:
┌──────────────────┬───────┬────────────────────────────────────────────────────────────────────┐ │text │vnr │vnr_blob │ ├──────────────────┼───────┼────────────────────────────────────────────────────────────────────┤ │helo world │[1] │<Buffer 80 00 00 01 80 00 00 00 80 00 00 00 80 00 00 00 80 00 00 00>│ │helo │[1,1] │<Buffer 80 00 00 01 80 00 00 01 80 00 00 00 80 00 00 00 80 00 00 00>│ │world │[1,2] │<Buffer 80 00 00 01 80 00 00 02 80 00 00 00 80 00 00 00 80 00 00 00>│ │fancy stuff here │[2] │<Buffer 80 00 00 02 80 00 00 00 80 00 00 00 80 00 00 00 80 00 00 00>│ │fancy │[2,1] │<Buffer 80 00 00 02 80 00 00 01 80 00 00 00 80 00 00 00 80 00 00 00>│ │stuff │[2,2] │<Buffer 80 00 00 02 80 00 00 02 80 00 00 00 80 00 00 00 80 00 00 00>│ │here │[2,3] │<Buffer 80 00 00 02 80 00 00 03 80 00 00 00 80 00 00 00 80 00 00 00>│ │that's all │[3] │<Buffer 80 00 00 03 80 00 00 00 80 00 00 00 80 00 00 00 80 00 00 00>│ │that's │[3,1] │<Buffer 80 00 00 03 80 00 00 01 80 00 00 00 80 00 00 00 80 00 00 00>│ │all │[3,2] │<Buffer 80 00 00 03 80 00 00 02 80 00 00 00 80 00 00 00 80 00 00 00>│ └──────────────────┴───────┴────────────────────────────────────────────────────────────────────┘
- observe that all BLOBs in the above are of equal length. This is due to a restriction of the
better-sqlite3
API which sadly is missing a way to define one's own collations /cmp
functions - While a VNR's elements are (after adjustments) encoded with
Buffer.writeUInt32BE()
and so can each cover a range from-0x80000000
to+0x7fffffff
(-2_147_483_648
to2_147_483_647
, meaning there's space for four billion of them), - the length of VNRs is restricted by the BLOBs having a finite constant length. In the above, we arranged for five places per VNR which may or may not be enough for a given use case.
-
-
Links
- VNRs
hollerith-codec
- actually using the much faster, more restricted
tng
version - tests and benchmarks
- actually using the much faster, more restricted
- documentation
- API
- DB structure