Git Product home page Git Product logo

intershop-lazy's Introduction

InterShop Lazy

Table of Contents generated with DocToc

Purpose

InterShop Lazy provides a way to (almost) transparently store results of costly computations in a table for later retrieval. See the demo, below.

Value Producers and the Cache

  • Lazy value producer: This is a function that will be generated by calling LAZY.create_lazy_producer(). It will try to retrieve to match its arguments against cached entries in table LAZY.cache; if it can't find an entry, it will call its associated eager value producers with the same arguments.

  • An eager value producer is a function that will either return a value to be inserted in the cache, or else insert zero or more values itself, depending on which solution is better suited to your use case.

  • Value producers are assumed to be 'immutable' functions in the PostGreSQL sense, i.e. 'pure functions' in the functional sense. In other words, it is assumed that the return value of a value producer will depend on nothing but its arguments. Therefore,

  • Functions that return random values or values that depend on implicit variables cannot be cached. But observe that caching file contents may be OK if it can be assumed that file contents can be treated as constant for the puproses of your application.

  • It is not allowed to update LAZY.cache, only insertions and deletions are legal.

  • Because the lazy value producer must be able to insert records into the cache, this means that is has to be marked volatile, and this entails that PostGreSQL will not cache results as it would try to do with functions marked immutable. That is, using InterShop Lazy will forgo one cache mechanism in favor of another one, and one that is probably slower. The advantage of InterShop Lazy though is that data can be cached across sessions which immutable functions won't. Be it said that none of the demos constitutes a valid use case as far as value production cost is concerned.

API

Methods to Create Lazy Value Producers

LAZY.create_lazy_producer() (returns void) will create a function that uses table LAZY.cache to produce values in a lazy fashion. Its arguments are:

  • function_name (text)—name of function to be created.
  • parameter_names (text[])—names of arguments to the getter.
  • parameter_types (text[])—types of arguments to the getter.
  • return_type (text,)—applied to jsonb value.
  • bucket (text default null)—name of bucket; defaults to function_name.
  • get_key (text default null)—optional, default is JSON list / object of values.
  • get_update (text default null)—optional, this x-or perform_update must be given.
  • perform_update (text default null)—optional, this x-or get_update must be given.

Points to keep in mind:

  • All names used in calls to create_lazy_producer() will be used as-is without any kind of sanity check or quoting.
  • The same goes for the other arguments.
  • Usage of create_lazy_producer() is inherently unsafe; therefore, no untrusted data (such as coming from a web form as data source) should be used to call this function (although the function that create_lazy_producer() creates is itself deemed safe).

Helper Methods

  • LAZY.nullify( jsonb ) returns jsonb—Given null or any jsonb value, return null when the input is null or 'null'::jsonb (i.e. the null value of JSONB, which is distinct from SQL null), or the value itself otherwise. This method helps to prevent errors like cannot cast jsonb null to type x: instead of ( key->0 )::integer, write ( LAZY.nullify( key->0 ) )::integer to obtain SQL null for cases where a (subvalue of a) cache column might contain null values.

Complete Demo

The below code can be seen in action by running psql -f lazy.demo-1.sql and psql -f lazy.demo-2.sql.

Again, none of the below demos constitutes a valid use case as far as value production cost is concerned. To amortize use of a Lazy cache the computation will typically involve things like reading from the file system or the network, or do some kind of lengthy data aggregation.

In this demo, we want to write two lazy value producers that compute multiples of floats and sums of integers; for the first, we will use an eager value getter that will only procude one value per call; for the sums, we will have a look at an eager value inserter that guesses values that might be used in the future and inserts them into the cache table for later use. Additionally, we will set up custom cache views that make read-only access to cached values easier than looking at the cache table directly. So let's get started.

Demo I: Multiplying Integers

Step 1: Write an Eager Value Producer

The first thing to do when one wants to use lazy evaluation with InterShop Lazy is to provide a function that accepts arguments as required for the task at hand and that returns a value v of type T that will be serialized v::text and deserialized as v::T.

Of course, using lazy evaluation makes only sense when one is dealing with costly computations, so typically an eager value producer would involve stuff like network access, sifting through huge tables or maybe reading in data files, that kind of IO- or CPU-heavy stuff. To keep things simple, let's just multiply integers and throw in the quirk that multiples of 13 will produce null values for no obvious reason. For good measure, we also want to report any calls to the console which is what the raise notice statement is for:

create function MYSCHEMA.compute_product( ¶n float, ¶factor float )
  returns float immutable called on null input language plpgsql as $$ begin
    raise notice 'MYSCHEMA.compute_product( %, % )', ¶n, ¶factor;
    if ( ¶n is null ) or ( ¶factor is null ) then return 0; end if;
    if ¶n != 13 then return ¶n * ¶factor; end if;
    return null; end; $$;

This function is called an eager value producer because it is expected to actually compute a result for each time it gets called. Observe we have defined it as immutable called on null input, meaning that it will be called even if one of its arguments is null; this we exploit to return 0 as the product whenever one of the factors is SQL null. Had we used strict instead, PostGreSQL would have eschewed calling the function at all and filled in a null, which may or may not be what you want.

Step 2: Create a Lazy Producer

Now that we have an eager value producer, let's define a lazy value producer that uses results from the LAZY.cache table where possible and manages updating it where values are missing. To do this, we have to call LAZY.create_lazy_producer():

select LAZY.create_lazy_producer(
  function_name   => 'MYSCHEMA.get_product',
  parameter_names => '{¶n,¶factor}',
  parameter_types => '{float,float}',
  return_type     => 'float',
  get_update      => 'MYSCHEMA.compute_product' );

Observe that the select statement will both create MYSCHEMA.get_product() and return the source text for that new function, which may end up in the console or wherever your SQL output goes, so you might want to use do $$ begin perform LAZY.create_lazy_producer( ... ); end; $$; instead.

The first argument here is the name of the new function to be created; the next 3 arguments basically repeat the declarative part to that function (a future version of InterShop Lazy might auto-generate parameter_names, parameter_types and return_type).

There's just one more required argument, either get_update or perform_update; exactly one of these two must be set to the name of a function that either

  • in the case of get_update(), will return exactly one result for each set of inputs, or
  • in the case of perform_update(), may insert as many rows into LAZY.cache as seen fit when called. In case perform_update() happened to not produce a result line that matches the input arguments, a line with result null will be auto-generated.

Step 3: Use the Lazy Value Producer

We're now ready to put our caching, lazy multiplicator device to use. For this, we set up a table of factors and update it with the computation results:

create table MYSCHEMA.fancy_products (
  n         float,
  factor    float,
  result    float );

insert into MYSCHEMA.fancy_products ( n, factor ) values
  ( 123,  456  ),
  ( 4,    12   ),
  ( 5,    12   ),
  ( 6,    12   ),
  ( 6,    12   ),
  ( 6,    12   ),
  ( 6,    12   ),
  ( 6,    12   ),
  ( 6.3,  12   ),
  ( 60,   3    ),
  ( 13,   13   ),
  ( 1,    null ),
  ( null, null ),
  ( null, 100  );

update MYSCHEMA.fancy_products set result = MYSCHEMA.get_product( n, factor );
select * from MYSCHEMA.fancy_products order by n, factor;
select * from LAZY.cache order by bucket, key;

This will produce the following output:

psql:lazy.demo-1.sql:76: NOTICE:  MYSCHEMA.compute_product( 123, 456 )
psql:lazy.demo-1.sql:76: NOTICE:  MYSCHEMA.compute_product( 4, 12 )
psql:lazy.demo-1.sql:76: NOTICE:  MYSCHEMA.compute_product( 5, 12 )
psql:lazy.demo-1.sql:76: NOTICE:  MYSCHEMA.compute_product( 6, 12 )
psql:lazy.demo-1.sql:76: NOTICE:  MYSCHEMA.compute_product( 6.3, 12 )
psql:lazy.demo-1.sql:76: NOTICE:  MYSCHEMA.compute_product( 60, 3 )
psql:lazy.demo-1.sql:76: NOTICE:  MYSCHEMA.compute_product( 13, 13 )
psql:lazy.demo-1.sql:76: NOTICE:  MYSCHEMA.compute_product( 1, <NULL> )
psql:lazy.demo-1.sql:76: NOTICE:  MYSCHEMA.compute_product( <NULL>, <NULL> )
psql:lazy.demo-1.sql:76: NOTICE:  MYSCHEMA.compute_product( <NULL>, 100 )
╔═════╤════════╤════════╗
║  n  │ factor │ result ║
╠═════╪════════╪════════╣
║   1 │      ∎ │      0 ║
║   4 │     12 │     48 ║
║   5 │     12 │     60 ║
║   6 │     12 │      ∎ ║
║   6 │     12 │      ∎ ║
║   6 │     12 │      ∎ ║
║   6 │     12 │     72 ║
║   6 │     12 │      ∎ ║
║ 6.3 │     12 │   75.6 ║
║  13 │     13 │      ∎ ║
║  60 │      3 │    180 ║
║ 123 │    456 │  56088 ║
║   ∎ │    100 │      0 ║
║   ∎ │      ∎ │      0 ║
╚═════╧════════╧════════╝

╔══════════════════════╤══════════════╤═══════╗
║        bucket        │     key      │ value ║
╠══════════════════════╪══════════════╪═══════╣
║ MYSCHEMA.get_product │ [null, null] │ 0     ║
║ MYSCHEMA.get_product │ [null, 100]  │ 0     ║
║ MYSCHEMA.get_product │ [1, null]    │ 0     ║
║ MYSCHEMA.get_product │ [4, 12]      │ 48    ║
║ MYSCHEMA.get_product │ [5, 12]      │ 60    ║
║ MYSCHEMA.get_product │ [6, 12]      │ 72    ║
║ MYSCHEMA.get_product │ [6.3, 12]    │ 75.6  ║
║ MYSCHEMA.get_product │ [13, 13]     │ ∎     ║
║ MYSCHEMA.get_product │ [60, 3]      │ 180   ║
║ MYSCHEMA.get_product │ [123, 456]   │ 56088 ║
╚══════════════════════╧══════════════╧═══════╝

The above shows that although some inputs were repeated in the fancy_products tables, none of the repetitions led to additional calls to the eager producer or to entries in the cache.

DEMO II: Summing Up, Speculatively

In this demo, let me demonstrate how to write a speculative value producer, that is, a producer that does more work when called than is strictly necessary in order to avoid getting called more often. Such a behavior might shave off some computation time when the act of computing a single value is associated with a high overhead cost that remains more or less constant no matter how many values are produced, and we have either a way to somehow predict what other values might get requested in the future when given a set of inputs, or we can somehow make sure a given subdomain of values can be exhaustively cached. For example, there might be a configuration file that must be laoded and parsed in order to obtain a single configuration setting in the file; in such a case, it might be advantageous to cache all the settings from the file whenever any setting is requested so that future setting requests can be answered by a table lookup.

Step 1: Write a Speculative Eager Value Producer

The eager value producer of the first example was an immutable function that returns a single value. In order to do speculative caching, we'll need a volatile function that updates the cache directly. In its most basic form, such a cache updating function might look like this:

create function MYSCHEMA.insert_sums_single_row( ¶a integer, ¶b integer )
  returns void volatile called on null input language plpgsql as $$ begin
    raise notice 'MYSCHEMA.insert_sums( %, % )', ¶a, ¶b;
    insert into LAZY.cache ( bucket, key, value ) values
      ( 'yeah! sums!', jsonb_build_array( ¶a, ¶b ), ¶a + ¶b );
      -- ^^^^ bucket   ^^^^ key                     ^^^^^^^ value
    end; $$;

This is only the basic shape however and will only insert a single row which, as such, is not very speculative. Here is an improved version that guesses a few sums that might pop up in the future—namely, given [ a, b ], it will compute a + b - 1, a + b, and a + b + 1, minus those sums that already have an cache entry:

create function MYSCHEMA.insert_sums( ¶a integer, ¶b integer )
  returns void volatile called on null input language plpgsql as $$ begin
    raise notice 'MYSCHEMA.insert_sums( %, % )', ¶a, ¶b;
    insert into LAZY.cache ( bucket, key, value ) select
        'yeah! sums!'                                 as bucket,
        r2.key                                        as key,
        ¶a + r1.bb                                    as value
      from generate_series( ¶b - 1, ¶b + 1 )        as r1 ( bb  ),
      lateral jsonb_build_array( ¶a, r1.bb )        as r2 ( key ),
      where not exists ( select 1 from LAZY.cache as r4
        where ( bucket = 'yeah! sums!' ) and ( r4.key = r2.key ) );
    end; $$;

Step 2: Create a Lazy Producer

Given insert_sums(), we can now create its lazy version, get_sum():

select LAZY.create_lazy_producer(
  function_name   => 'MYSCHEMA.get_sum',
  parameter_names => '{¶a,¶b}',
  parameter_types => '{integer,integer}',
  return_type     => 'integer',
  bucket          => 'yeah! sums!',
  perform_update  => 'MYSCHEMA.insert_sums' );

Step 3: Use the Lazy Value Producer

Just as in the first demo, let's use a table to store results. This time round, we generate the data:

create table MYSCHEMA.fancy_sums (
  a         integer,
  b         integer,
  result    integer );

insert into MYSCHEMA.fancy_sums ( a, b )
  select 7, b from generate_series( 1, 10 ) as x ( b );
update MYSCHEMA.fancy_sums set result = MYSCHEMA.get_sum( a, b );
select * from LAZY.cache order by bucket, key;

And this is the output; notice that while the cache has gained 11 entries, only 5 calls to the eager producer was necessary:

psql:lazy.demo-2.sql:72: NOTICE:  MYSCHEMA.insert_sums( 7, 1 )
psql:lazy.demo-2.sql:72: NOTICE:  MYSCHEMA.insert_sums( 7, 3 )
psql:lazy.demo-2.sql:72: NOTICE:  MYSCHEMA.insert_sums( 7, 5 )
psql:lazy.demo-2.sql:72: NOTICE:  MYSCHEMA.insert_sums( 7, 7 )
psql:lazy.demo-2.sql:72: NOTICE:  MYSCHEMA.insert_sums( 7, 9 )
╔═════════════╤═════════╤═══════╗
║   bucket    │   key   │ value ║
╠═════════════╪═════════╪═══════╣
║ yeah! sums! │ [7, 0]  │ 7     ║
║ yeah! sums! │ [7, 1]  │ 8     ║
║ yeah! sums! │ [7, 2]  │ 9     ║
║ yeah! sums! │ [7, 3]  │ 10    ║
║ yeah! sums! │ [7, 4]  │ 11    ║
║ yeah! sums! │ [7, 5]  │ 12    ║
║ yeah! sums! │ [7, 6]  │ 13    ║
║ yeah! sums! │ [7, 7]  │ 14    ║
║ yeah! sums! │ [7, 8]  │ 15    ║
║ yeah! sums! │ [7, 9]  │ 16    ║
║ yeah! sums! │ [7, 10] │ 17    ║
╚═════════════╧═════════╧═══════╝

Note that in case an eager inserting value producer should not insert a value for a given requested key, the lazy value producer will then auto-generate a row with null value. This behavior may become configurable in a future version.

Bonus: Setting up Custom Cache Views

Since it may sometimes be useful to have a view on the data already cached by a given lazy producer, here's how to do it. It is basically straightforward: the key is by default formed by building a jsonb array of the arguments (although one can define one's own keying method); likewise, values are stored as jsonb values so these have to be converted back to the intended data type. Which is not difficult to do, except that jsonb has its own null value which, unlike SQL null, cannot be cast; this is what LAZY.nullify() is for:

create view MYSCHEMA.products as ( select
      ( LAZY.nullify( key->0 ) )::float as n,
      ( LAZY.nullify( key->1 ) )::float as factor,
      ( value                  )::float as product
    from LAZY.cache
    where bucket = 'MYSCHEMA.get_product'
    order by n desc, factor desc );
╔═════╤════════╤═════════╗
║  n  │ factor │ product ║
╠═════╪════════╪═════════╣
║   ∎ │      ∎ │       0 ║
║   ∎ │    100 │       0 ║
║ 123 │    456 │   56088 ║
║  60 │      3 │     180 ║
║  13 │     13 │       ∎ ║
║ 6.3 │     12 │    75.6 ║
║   6 │     12 │      72 ║
║   5 │     12 │      60 ║
║   4 │     12 │      48 ║
║   1 │      ∎ │       0 ║
╚═════╧════════╧═════════╝

Points to Remember

  • Composite Values—Sometimes one will want to cache composite data types. This is fine with InterShop Lazy because composite user-defined types (create type xxx as ( field1 type1, field2 type2 )) are serializable to text just as atomic values. However, when deserializing these values an extra pair of parentheses has to be used (as in (value::xxx).field1) because of restrictions in PostGreSQL's grammar. See lazy.demo-3.sql for an example.

Updates

  • v1.0.0—Values are now stored as text rather than as jsonb, the reason being that
    • all datatypes have a text serialization (this is mandatory as DB dumps could otherwise not be produced), but not all datatypes can be expressed in JSON in a straightforward way, meaning customized conversion functions mytype_from_jsonb() / mytype_to_jsonb() are needed where otherwise simple type casts myvalue::text / textvalue::mytype would be sufficient.
    • While quering jsonb values is much more flexible, one will typically search over keys, not results; where quering against result values is needed, one can still cast the text values (e.g. in a (materialized) view).

To Do

  • Documentation
  • Tests
  • feat:txtcache—consider to optionally use a text-based cache
  • feat:hashidx—implement Hash indexes on bucket, key (only interested in equality, otherwise might use BTree) (dropped in preference for feat:remgetupdate, feat:dedicache)
  • feat:remgetupdate—remove parameter get_update, always use perform_update (rename to insert_rows or producer as function will not perform SQL updates but inserts)
  • feat:remgetkey—remove parameter get_key? This would naturally fall out of feat:dedicache
  • feat:dedicache—consider to always generate a dedicated cache table with the appropriate columns as per value producer arguments, including index; in that case, may also move from using a serialized-value column to using any number of result columns (which could be indexed if need be, but should be done manually by user)
    • feat:dedicache/return_type—abolish, should use input_names, input_types, output_names, output_types, (or else key_names, ... value_names, ...). Input and output names become names for columns of dedicated cache table. Insofar maybe advantageous to adopt feat:dedicache/optcreatecache:never, instead require user sets up dedicated cache table, then just passes in name of that table, we can then derive input parameters from primary key of that table
    • feat:dedicache/optcreatecache—should cache table never / optionally / always be created automatically? Could require user to manually create cache, only provide generated function that simplifies using that cache. Start by implementing only case create_cache = false, allow create_cache = true later
    • feat:dedicache/docdeforder—document that producer must be defined only after create_lazy_producer() has been called and cache table established, otherwise produce would refer to non-existant DB object

intershop-lazy's People

Contributors

loveencounterflow avatar

Watchers

James Cloos avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.