Git Product home page Git Product logo

mentat's Introduction

Project Mentat

Build Status

Project Mentat is a persistent, embedded knowledge base. It draws heavily on DataScript and Datomic.

This project was started by Mozilla, but is no longer being developed or actively maintained by them. Their repository was marked read-only, this fork is an attempt to revive and continue that interesting work. We owe the team at Mozilla more than words can express for inspiring us all and for this project in particular.

Thank you.

Documentation


Motivation

Mentat is a flexible relational (not key-value, not document-oriented) store that makes it easy to describe, grow, and reuse your domain schema.

By abstracting away the storage schema, and by exposing change listeners outside the database (not via triggers), we hope to make domain schemas stable, and allow both the data store itself and embedding applications to use better architectures, meeting performance goals in a way that allows future evolution.

Data storage is hard

We've observed that data storage is a particular area of difficulty for software development teams:

  • It's hard to define storage schemas well. A developer must:

    • Model their domain entities and relationships.
    • Encode that model efficiently and correctly using the features available in the database.
    • Plan for future extensions and performance tuning.

    In a SQL database, the same schema definition defines everything from high-level domain relationships through to numeric field sizes in the same smear of keywords. It's difficult for someone unfamiliar with the domain to determine from such a schema what's a domain fact and what's an implementation concession — are all part numbers always 16 characters long, or are we trying to save space? — or, indeed, whether a missing constraint is deliberate or a bug.

    The developer must think about foreign key constraints, compound uniqueness, and nullability. They must consider indexing, synchronizing, and stable identifiers. Most developers simply don't do enough work in SQL to get all of these things right. Storage thus becomes the specialty of a few individuals.

    Which one of these is correct?

    {:db/id          :person/email
     :db/valueType   :db.type/string
     :db/cardinality :db.cardinality/many     ; People can have multiple email addresses.
     :db/unique      :db.unique/identity      ; For our purposes, each email identifies one person.
     :db/index       true}                    ; We want fast lookups by email.
    {:db/id          :person/friend
     :db/valueType   :db.type/ref
     :db/cardinality :db.cardinality/many}    ; People can have many friends.
    CREATE TABLE people (
      id INTEGER PRIMARY KEY,  -- Bug: because of the primary key, each person can have no more than 1 email.
      email VARCHAR(64),       -- Bug?: no NOT NULL, so a person can have no email.
                               -- Bug: nobody will ever have a long email address, right?
    );
    CREATE TABLE friendships (
      FOREIGN KEY person REFERENCES people(id),  -- Bug?: no indexing, so lookups by friend or person will be slow.
      FOREIGN KEY friend REFERENCES people(id),  -- Bug: no compound uniqueness constraint, so we can have dupe friendships.
    );

    They both have limitations — the Mentat schema allows only for an open world (it's possible to declare friendships with people whose email isn't known), and requires validation code to enforce email string correctness — but we think that even such a tiny SQL example is harder to understand and obscures important domain decisions.

  • Queries are intimately tied to structural storage choices. That not only hides the declarative domain-level meaning of the query — it's hard to tell what a query is trying to do when it's a 100-line mess of subqueries and LEFT OUTER JOINs — but it also means a simple structural schema change requires auditing every query for correctness.

  • Developers often capture less event-shaped than they perhaps should, simply because their initial requirements don't warrant it. It's quite common to later want to know when a fact was recorded, or in which order two facts were recorded (particularly for migrations), or on which device an event took place… or even that a fact was ever recorded and then deleted.

  • Common queries are hard. Storing values only once, upserts, complicated joins, and group-wise maxima are all difficult for non-expert developers to get right.

  • It's hard to evolve storage schemas. Writing a robust SQL schema migration is hard, particularly if a bad migration has ever escaped into the wild! Teams learn to fear and avoid schema changes, and eventually they ship a table called metadata, with three TEXT columns, so they never have to write a migration again. That decision pushes storage complexity into application code. (Or they start storing unversioned JSON blobs in the database…)

  • It's hard to share storage with another component, let alone share data with another component. Conway's Law applies: your software system will often grow to have one database per team.

  • It's hard to build efficient storage and querying architectures. Materialized views require knowledge of triggers, or the implementation of bottleneck APIs. Ad hoc caches are often wrong, are almost never formally designed (do you want a write-back, write-through, or write-around cache? Do you know the difference?), and often aren't reusable. The average developer, faced with a SQL database, has little choice but to build a simple table that tries to meet every need.

Comparison to DataScript

DataScript asks the question: "What if creating a database were as cheap as creating a Hashmap?"

Mentat is not interested in that. Instead, it's focused on persistence and performance, with very little interest in immutable databases/databases as values or throwaway use.

One might say that Mentat's question is: "What if a database could store arbitrary relations, for arbitrary consumers, without them having to coordinate an up-front storage-level schema?"

Consider this a practical approach to facts, to knowledge its storage and access, much like SQLite is a practical RDBMS.

(Note that domain-level schemas are very valuable.)

Another possible question would be: "What if we could bake some of the concepts of CQRS and event sourcing into a persistent relational store, such that the transaction log itself were of value to queries?"

Some thought has been given to how databases as values — long-term references to a snapshot of the store at an instant in time — could work in this model. It's not impossible; it simply has different performance characteristics.

Just like DataScript, Mentat speaks Datalog for querying and takes additions and retractions as input to a transaction.

Unlike DataScript, Mentat exposes free-text indexing, thanks to SQLite/FTS.

Comparison to Datomic

Datomic is a server-side, enterprise-grade data storage system. Datomic has a beautiful conceptual model. It's intended to be backed by a storage cluster, in which it keeps index chunks forever. Index chunks are replicated to peers, allowing it to run queries at the edges. Writes are serialized through a transactor.

Many of these design decisions are inapplicable to deployed desktop software; indeed, the use of multiple JVM processes makes Datomic's use in a small desktop app, or a mobile device, prohibitive.

Comparison to SQLite

SQLite is a traditional SQL database in most respects: schemas conflate semantic, structural, and datatype concerns, as described above; the main interface with the database is human-first textual queries; sparse and graph-structured data are 'unnatural', if not always inefficient; experimenting with and evolving data models are error-prone and complicated activities; and so on.

Mentat aims to offer many of the advantages of SQLite — single-file use, embeddability, and good performance — while building a more relaxed, reusable, and expressive data model on top.


Contributing

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

See CONTRIBUTING.md for further notes.

This project is very new, so we'll probably revise these guidelines. Please comment on an issue before putting significant effort in if you'd like to contribute.


Building

You first need to clone the project. To build and test the project, we are using Cargo.

To build all of the crates in the project use:

cargo build

To run tests use:

# Run tests for everything.
cargo test --all

# Run tests for just the query-algebrizer folder (specify the crate, not the folder),
# printing debug output.
cargo test -p mentat_query_algebrizer -- --nocapture

For most cargo commands you can pass the -p argument to run the command just on that package. So, cargo build -p mentat_query_algebrizer will build just the "query-algebrizer" folder.

What are all of these crates?

We use multiple sub-crates for Mentat for four reasons:

  1. To improve incremental build times.
  2. To encourage encapsulation; writing extern crate feels worse than just use mod.
  3. To simplify the creation of targets that don't use certain features: e.g., a build with no syncing, or with no query system.
  4. To allow for reuse (e.g., the EDN parser is essentially a separate library).

So what are they?

Building blocks

edn

Our EDN parser. It uses rust-peg to parse EDN, which is Clojure/Datomic's richer alternative to JSON. edn's dependencies are all either for representing rich values (chrono, uuid, ordered-float) or for parsing (serde, peg).

In addition, this crate turns a stream of EDN values into a representation suitable to be transacted.

mentat_core

This is the lowest-level Mentat crate. It collects together the following things:

  • Fundamental domain-specific data structures like ValueType and TypedValue.
  • Fundamental SQL-related linkages like SQLValueType. These encode the mapping between Mentat's types and values and their representation in our SQLite format.
  • Conversion to and from EDN types (e.g., edn::Keyword to TypedValue::Keyword).
  • Common utilities (some in the util module, and others that should be moved there or broken out) like Either, InternSet, and RcCounter.
  • Reusable lazy namespaced keywords (e.g., DB_TYPE_DOUBLE) that are used by mentat_db and EDN serialization of core structs.

Types

mentat_query

This crate defines the structs and enums that are the output of the query parser and used by the translator and algebrizer. SrcVar, NonIntegerConstant, FnArg… these all live here.

mentat_query_sql

Similarly, this crate defines an abstract representation of a SQL query as understood by Mentat. This bridges between Mentat's types (e.g., TypedValue) and SQL concepts (ColumnOrExpression, GroupBy). It's produced by the algebrizer and consumed by the translator.

Query processing

mentat_query_algebrizer

This is the biggest piece of the query engine. It takes a parsed query, which at this point is independent of a database, and combines it with the current state of the schema and data. This involves translating keywords into attributes, abstract values into concrete values with a known type, and producing an AlgebraicQuery, which is a representation of how a query's Datalog semantics can be satisfied as SQL table joins and constraints over Mentat's SQL schema. An algebrized query is tightly coupled with both the disk schema and the vocabulary present in the store when the work is done.

mentat_query_projector

A Datalog query projects some of the variables in the query into data structures in the output. This crate takes an algebrized query and a projection list and figures out how to get values out of the running SQL query and into the right format for the consumer.

mentat_query_translator

This crate works with all of the above to turn the output of the algebrizer and projector into the data structures defined in mentat_query_sql.

mentat_sql

This simple crate turns those data structures into SQL text and bindings that can later be executed by rusqlite.

The data layer: mentat_db

This is a big one: it implements the core storage logic on top of SQLite. This crate is responsible for bootstrapping new databases, transacting new data, maintaining the attribute cache, and building and updating in-memory representations of the storage schema.

The main crate

The top-level main crate of Mentat assembles these component crates into something useful. It wraps up a connection to a database file and the associated metadata into a Store, and encapsulates an in-progress transaction (InProgress). It provides modules for programmatically writing (entity_builder.rs) and managing vocabulary (vocabulary.rs).

Syncing

Sync code lives, for referential reasons, in a crate named tolstoy. This code is a work in progress; current state is a proof-of-concept implementation which largely relies on the internal transactor to make progress in most cases and comes with a basic support for timelines. See Tolstoy's documentation for details.

The command-line interface

This is under tools/cli. It's essentially an external consumer of the main mentat crate. This code is ugly, but it mostly works.


SQLite dependencies

Mentat uses partial indices, which are available in SQLite 3.8.0 and higher. It relies on correlation between aggregate and non-aggregate columns in the output, which was added in SQLite 3.7.11.

It also uses FTS4, which is a compile time option.

By default, Mentat specifies the "bundled" feature for rusqlite, which uses a relatively recent version of SQLite. If you want to link against the system version of SQLite, omit "bundled_sqlite3" from Mentat's features.

[dependencies.mentat]
version = "0.6"
# System sqlite is known to be new.
default-features = false

License

Project Mentat is currently licensed under the Apache License v2.0. See the LICENSE file for details.

mentat's People

Contributors

bgrins avatar cdbfoster avatar cpdean avatar dependabot-preview[bot] avatar dependabot[bot] avatar eoger avatar ferjm avatar gburd avatar grigoryk avatar joewalker avatar jsantell avatar maweki avatar mwatts avatar ncalexan avatar rnewman avatar sc13-bioinf avatar thomcc avatar victorporof avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

mentat's Issues

Asserting a schema attribute must require presence of minimal set of attributes

These assertions are legit:

[[:db/add "testProp" :db/ident :testProp]]

and:

[:db/add "testProp" :db/ident :testProp]
[:db/add "testProp :db/valueType :db.type/string]
[:db/add "testProp :add/cardinality :db.cardinality/one]]

However, assertion sets which are missing one of the required attributes, such as:

[:db/add "testProp" :db/ident :testProp]
[:db/add "testProp :db/valueType :db.type/string]]

should be illegal.

Similarly,

[[:db/add "testProp :db/valueType :db.type/string]
[:db/add "testProp :add/cardinality :db.cardinality/one]]

should also be illegal.

Original by @grigoryk

[build] Publish Mentat Android SDK to jcenter/bintray

android-components have a pleasant setup for publishing to jcenter/bintray. I have work in progress on doing this for Mentat, using exactly the same setup; see https://github.com/mozilla/mentat/tree/build-android-sdk. There's some details to work out with getting Rust and the Android NDK configure correctly in a Docker image, but I have no doubt that this can be achieved.

This ticket tracks getting the "final mile" worked out: once we have things building, getting credentials to publish to the right namespace. I'd prefer org.mozilla.mentat, but I don't feel too strongly about it.

@pocmo, can you help me get the TC secrets needed to publish like a-c does? @csadilek tells me that we're publishing to your (private?) bintray repository, which obviously isn't a long term solution, but also isn't a terrible short term solution for us.

Original by @ncalexan

[build] Publish Mentat Android SDK to jcenter/bintray

android-components have a pleasant setup for publishing to jcenter/bintray. I have work in progress on doing this for Mentat, using exactly the same setup; see https://github.com/mozilla/mentat/tree/build-android-sdk. There's some details to work out with getting Rust and the Android NDK configure correctly in a Docker image, but I have no doubt that this can be achieved.

This ticket tracks getting the "final mile" worked out: once we have things building, getting credentials to publish to the right namespace. I'd prefer org.mozilla.mentat, but I don't feel too strongly about it.

@pocmo, can you help me get the TC secrets needed to publish like a-c does? @csadilek tells me that we're publishing to your (private?) bintray repository, which obviously isn't a long term solution, but also isn't a terrible short term solution for us.

Original by @ncalexan

Expand internal representation of an Attribute to represent idea of absence

Vague problem statement:

To better support of retracting schema (as part of the timelines work, see mozilla#783), we need to have more knowledge than is currently available in-memory. Mentat's definition of a schema attribute currently does not allow us to determine if a particular schema attribute is actually present in the datoms, or if it's a derived default value.

E.g. if we did not assert [:db/add 100 :db/index true], then the index field in the Attribute struct for entid=100 will be false. From the point of view of schema retraction, meaning of that false value is not the same as if actually asserted that [:db/add 100 :db/index false].

  • Schema retraction rule 1) If :db/ident is being retracted, and that entity is a schema attribute - that is, it also has :db/valueType and :db/cardinality - these datoms (and any optional ones) must also be retracted. (why leave around dangling schema attributes?)
  • Schema retraction rule 2) if either of the required schema attribute entities are being retracted (:db/valueType, :db/cardinality), then all of the required schema attributes must be retracted, as well as a corresponding :db/ident.

However, currently we can't tell by just inspecting an AttributeMap that a given entity has these attributes. And so to enforce rule 1, we must read datoms from disk.

Implementation of schema retraction introduced in mozilla#783 punts on rule 1, and only enforces rule 2.

Original by @grigoryk

Asserting a schema attribute must require presence of minimal set of attributes

These assertions are legit:

[[:db/add "testProp" :db/ident :testProp]]

and:

[:db/add "testProp" :db/ident :testProp]
[:db/add "testProp :db/valueType :db.type/string]
[:db/add "testProp :add/cardinality :db.cardinality/one]]

However, assertion sets which are missing one of the required attributes, such as:

[:db/add "testProp" :db/ident :testProp]
[:db/add "testProp :db/valueType :db.type/string]]

should be illegal.

Similarly,

[[:db/add "testProp :db/valueType :db.type/string]
[:db/add "testProp :add/cardinality :db.cardinality/one]]

should also be illegal.

Original by @grigoryk

Investigate methods of reducing the on-disk size of Mentat databases

This document highlights many of the issues and concerns https://docs.google.com/document/d/14ywV4PlBAdOsJrxd7QhMcxbo7pw8cdc1kLAb7I4QhFY/edit?ts=5b7b7f87.

Some work to measure the size of the database when storing history is in mozilla/application-services#191. For my places.sqlite (100k places, 150k visits), it gets around 200MB larger.

It's worth noting that 'disk usage' was one of the primary concerns reported by user research for Fenix (although it's not clear if this size increase (relative to places) is the kind of thing that would make a dent relative to stuff like caches and the like -- A very informal poll of some friends of mine found that Fennec typically uses around 500MB of space (app + data), another 200MB isn't a trivial increase, but doesn't substantially change where we are in terms of app size).

Some bugs which may help (suggested by @rnewman):

  • mozilla#29: Pack the various flag columns into one TINYINT -- appears to give a 1%-2% benefit, which is probably not substantial enough to justify the effort IMO.
  • mozilla#69: implement something like place's url_hash automatically. If this could keep strings values out of the aevt/eavt indices it could have a huge benefit.
  • mozilla#32: Interning keywords. Our test schema didn't use these and I'm not sure what the actual use case for them is over :db.type/ref to a { :db/ident :my/keyword }, so this doesn't seem like a high priority.
  • mozilla#33: Store the data canonically in a sql table instead of in datoms. This is interesting but seems like a lot of work.

I think something like sqlite's zipvfs extension would likely help (as the databases compress well), but have not tried it. Implementing it ourselves is likely beyond the scope of this effort (I took a look at the effort required and it wasn't exactly trivial). Additionally, whatever we do would need to somehow integrate with sqlcipher (I also took a look at bolting compression into sqlcipher before the encryption, but the fact that this makes the block output a variable size seemed to make this problematic).

Other notes:

  1. Storing strings as fulltext and using the compress/uncompress options of FTS4 did not help, since the strings in each column are relatively small. Additionally, the performance overhead here was substantial even for a very fast compressor (LZ4).
  2. Most string data seems to be duplicated ~4 times, in datoms, timelined_transactions, and in the indices idx_datoms_eavt, idx_datoms_aevt.
  3. During RustConf, @rnewman suggested that ultimately mentat will likely not want to use sqlite, and instead want to read datoms chunks directly out of something like RKV. These chunks could be compressed more easily. This seems out of scope, as it would be a massive change to mentat, but is worth writing down.

Additional concerns exist around the fact that this problem may be exacerbated by materialized views (perhaps #33 will help or prevent this?)

Original by @thomcc

No way to bind aggregate to variable in query

The datomic docs have an example query of the form:

[:find ?year (median ?namelen) (avg ?namelen) (stddev ?namelen)
 :with ?track
 :where [?track :track/name ?name]
        [(count ?name) ?namelen]
        [?medium :medium/tracks ?track]
        [?release :release/media ?medium]
        [?release :release/year ?year]]

Our query syntax doesn't allow the [(count ?name) ?namelen] clause.

Between this and mozilla#647 I'm not sure we can implement a top-sites like query directly. (That might not be true, and could just a reflection of my ignorance, though!)

Original by @thomcc

lookup-ref syntax is different between Mentat and Datomic

This doesn't seem to be documented anywhere (or maybe I'm misread the datomic docs?), but is probably worth mentioning somwhere.

According to https://docs.datomic.com/on-prem/identity.html and http://blog.datomic.com/2014/02/datomic-lookup-refs.html Datomic does them as [:namespace/attr value], whereas we do it as (lookup-ref :namespace/attr value).

I'm filing an issue rather than just updating https://github.com/mozilla/mentat/wiki/Differences-from-Datomic because i'm actually not 100% certain it's not me misreading the datomic docs, or failing to interpret what we do.

Original by @thomcc

No way to bind aggregate to variable in query

The datomic docs have an example query of the form:

[:find ?year (median ?namelen) (avg ?namelen) (stddev ?namelen)
 :with ?track
 :where [?track :track/name ?name]
        [(count ?name) ?namelen]
        [?medium :medium/tracks ?track]
        [?release :release/media ?medium]
        [?release :release/year ?year]]

Our query syntax doesn't allow the [(count ?name) ?namelen] clause.

Between this and mozilla#647 I'm not sure we can implement a top-sites like query directly. (That might not be true, and could just a reflection of my ignorance, though!)

Original by @thomcc

No way to bind aggregate to variable in query

The datomic docs have an example query of the form:

[:find ?year (median ?namelen) (avg ?namelen) (stddev ?namelen)
 :with ?track
 :where [?track :track/name ?name]
        [(count ?name) ?namelen]
        [?medium :medium/tracks ?track]
        [?release :release/media ?medium]
        [?release :release/year ?year]]

Our query syntax doesn't allow the [(count ?name) ?namelen] clause.

Between this and mozilla#647 I'm not sure we can implement a top-sites like query directly. (That might not be true, and could just a reflection of my ignorance, though!)

Original by @thomcc

Expand internal representation of an Attribute to represent idea of absence

Vague problem statement:

To better support of retracting schema (as part of the timelines work, see mozilla#783), we need to have more knowledge than is currently available in-memory. Mentat's definition of a schema attribute currently does not allow us to determine if a particular schema attribute is actually present in the datoms, or if it's a derived default value.

E.g. if we did not assert [:db/add 100 :db/index true], then the index field in the Attribute struct for entid=100 will be false. From the point of view of schema retraction, meaning of that false value is not the same as if actually asserted that [:db/add 100 :db/index false].

  • Schema retraction rule 1) If :db/ident is being retracted, and that entity is a schema attribute - that is, it also has :db/valueType and :db/cardinality - these datoms (and any optional ones) must also be retracted. (why leave around dangling schema attributes?)
  • Schema retraction rule 2) if either of the required schema attribute entities are being retracted (:db/valueType, :db/cardinality), then all of the required schema attributes must be retracted, as well as a corresponding :db/ident.

However, currently we can't tell by just inspecting an AttributeMap that a given entity has these attributes. And so to enforce rule 1, we must read datoms from disk.

Implementation of schema retraction introduced in mozilla#783 punts on rule 1, and only enforces rule 2.

Original by @grigoryk

Investigate methods of reducing the on-disk size of Mentat databases

This document highlights many of the issues and concerns https://docs.google.com/document/d/14ywV4PlBAdOsJrxd7QhMcxbo7pw8cdc1kLAb7I4QhFY/edit?ts=5b7b7f87.

Some work to measure the size of the database when storing history is in mozilla/application-services#191. For my places.sqlite (100k places, 150k visits), it gets around 200MB larger.

It's worth noting that 'disk usage' was one of the primary concerns reported by user research for Fenix (although it's not clear if this size increase (relative to places) is the kind of thing that would make a dent relative to stuff like caches and the like -- A very informal poll of some friends of mine found that Fennec typically uses around 500MB of space (app + data), another 200MB isn't a trivial increase, but doesn't substantially change where we are in terms of app size).

Some bugs which may help (suggested by @rnewman):

  • mozilla#29: Pack the various flag columns into one TINYINT -- appears to give a 1%-2% benefit, which is probably not substantial enough to justify the effort IMO.
  • mozilla#69: implement something like place's url_hash automatically. If this could keep strings values out of the aevt/eavt indices it could have a huge benefit.
  • mozilla#32: Interning keywords. Our test schema didn't use these and I'm not sure what the actual use case for them is over :db.type/ref to a { :db/ident :my/keyword }, so this doesn't seem like a high priority.
  • mozilla#33: Store the data canonically in a sql table instead of in datoms. This is interesting but seems like a lot of work.

I think something like sqlite's zipvfs extension would likely help (as the databases compress well), but have not tried it. Implementing it ourselves is likely beyond the scope of this effort (I took a look at the effort required and it wasn't exactly trivial). Additionally, whatever we do would need to somehow integrate with sqlcipher (I also took a look at bolting compression into sqlcipher before the encryption, but the fact that this makes the block output a variable size seemed to make this problematic).

Other notes:

  1. Storing strings as fulltext and using the compress/uncompress options of FTS4 did not help, since the strings in each column are relatively small. Additionally, the performance overhead here was substantial even for a very fast compressor (LZ4).
  2. Most string data seems to be duplicated ~4 times, in datoms, timelined_transactions, and in the indices idx_datoms_eavt, idx_datoms_aevt.
  3. During RustConf, @rnewman suggested that ultimately mentat will likely not want to use sqlite, and instead want to read datoms chunks directly out of something like RKV. These chunks could be compressed more easily. This seems out of scope, as it would be a massive change to mentat, but is worth writing down.

Additional concerns exist around the fact that this problem may be exacerbated by materialized views (perhaps #33 will help or prevent this?)

Original by @thomcc

No way to bind aggregate to variable in query

The datomic docs have an example query of the form:

[:find ?year (median ?namelen) (avg ?namelen) (stddev ?namelen)
 :with ?track
 :where [?track :track/name ?name]
        [(count ?name) ?namelen]
        [?medium :medium/tracks ?track]
        [?release :release/media ?medium]
        [?release :release/year ?year]]

Our query syntax doesn't allow the [(count ?name) ?namelen] clause.

Between this and mozilla#647 I'm not sure we can implement a top-sites like query directly. (That might not be true, and could just a reflection of my ignorance, though!)

Original by @thomcc

Investigate methods of reducing the on-disk size of Mentat databases

This document highlights many of the issues and concerns https://docs.google.com/document/d/14ywV4PlBAdOsJrxd7QhMcxbo7pw8cdc1kLAb7I4QhFY/edit?ts=5b7b7f87.

Some work to measure the size of the database when storing history is in mozilla/application-services#191. For my places.sqlite (100k places, 150k visits), it gets around 200MB larger.

It's worth noting that 'disk usage' was one of the primary concerns reported by user research for Fenix (although it's not clear if this size increase (relative to places) is the kind of thing that would make a dent relative to stuff like caches and the like -- A very informal poll of some friends of mine found that Fennec typically uses around 500MB of space (app + data), another 200MB isn't a trivial increase, but doesn't substantially change where we are in terms of app size).

Some bugs which may help (suggested by @rnewman):

  • mozilla#29: Pack the various flag columns into one TINYINT -- appears to give a 1%-2% benefit, which is probably not substantial enough to justify the effort IMO.
  • mozilla#69: implement something like place's url_hash automatically. If this could keep strings values out of the aevt/eavt indices it could have a huge benefit.
  • mozilla#32: Interning keywords. Our test schema didn't use these and I'm not sure what the actual use case for them is over :db.type/ref to a { :db/ident :my/keyword }, so this doesn't seem like a high priority.
  • mozilla#33: Store the data canonically in a sql table instead of in datoms. This is interesting but seems like a lot of work.

I think something like sqlite's zipvfs extension would likely help (as the databases compress well), but have not tried it. Implementing it ourselves is likely beyond the scope of this effort (I took a look at the effort required and it wasn't exactly trivial). Additionally, whatever we do would need to somehow integrate with sqlcipher (I also took a look at bolting compression into sqlcipher before the encryption, but the fact that this makes the block output a variable size seemed to make this problematic).

Other notes:

  1. Storing strings as fulltext and using the compress/uncompress options of FTS4 did not help, since the strings in each column are relatively small. Additionally, the performance overhead here was substantial even for a very fast compressor (LZ4).
  2. Most string data seems to be duplicated ~4 times, in datoms, timelined_transactions, and in the indices idx_datoms_eavt, idx_datoms_aevt.
  3. During RustConf, @rnewman suggested that ultimately mentat will likely not want to use sqlite, and instead want to read datoms chunks directly out of something like RKV. These chunks could be compressed more easily. This seems out of scope, as it would be a massive change to mentat, but is worth writing down.

Additional concerns exist around the fact that this problem may be exacerbated by materialized views (perhaps #33 will help or prevent this?)

Original by @thomcc

[build] Publish Mentat Android SDK to jcenter/bintray

android-components have a pleasant setup for publishing to jcenter/bintray. I have work in progress on doing this for Mentat, using exactly the same setup; see https://github.com/mozilla/mentat/tree/build-android-sdk. There's some details to work out with getting Rust and the Android NDK configure correctly in a Docker image, but I have no doubt that this can be achieved.

This ticket tracks getting the "final mile" worked out: once we have things building, getting credentials to publish to the right namespace. I'd prefer org.mozilla.mentat, but I don't feel too strongly about it.

@pocmo, can you help me get the TC secrets needed to publish like a-c does? @csadilek tells me that we're publishing to your (private?) bintray repository, which obviously isn't a long term solution, but also isn't a terrible short term solution for us.

Original by @ncalexan

Investigate methods of reducing the on-disk size of Mentat databases

This document highlights many of the issues and concerns https://docs.google.com/document/d/14ywV4PlBAdOsJrxd7QhMcxbo7pw8cdc1kLAb7I4QhFY/edit?ts=5b7b7f87.

Some work to measure the size of the database when storing history is in mozilla/application-services#191. For my places.sqlite (100k places, 150k visits), it gets around 200MB larger.

It's worth noting that 'disk usage' was one of the primary concerns reported by user research for Fenix (although it's not clear if this size increase (relative to places) is the kind of thing that would make a dent relative to stuff like caches and the like -- A very informal poll of some friends of mine found that Fennec typically uses around 500MB of space (app + data), another 200MB isn't a trivial increase, but doesn't substantially change where we are in terms of app size).

Some bugs which may help (suggested by @rnewman):

  • mozilla#29: Pack the various flag columns into one TINYINT -- appears to give a 1%-2% benefit, which is probably not substantial enough to justify the effort IMO.
  • mozilla#69: implement something like place's url_hash automatically. If this could keep strings values out of the aevt/eavt indices it could have a huge benefit.
  • mozilla#32: Interning keywords. Our test schema didn't use these and I'm not sure what the actual use case for them is over :db.type/ref to a { :db/ident :my/keyword }, so this doesn't seem like a high priority.
  • mozilla#33: Store the data canonically in a sql table instead of in datoms. This is interesting but seems like a lot of work.

I think something like sqlite's zipvfs extension would likely help (as the databases compress well), but have not tried it. Implementing it ourselves is likely beyond the scope of this effort (I took a look at the effort required and it wasn't exactly trivial). Additionally, whatever we do would need to somehow integrate with sqlcipher (I also took a look at bolting compression into sqlcipher before the encryption, but the fact that this makes the block output a variable size seemed to make this problematic).

Other notes:

  1. Storing strings as fulltext and using the compress/uncompress options of FTS4 did not help, since the strings in each column are relatively small. Additionally, the performance overhead here was substantial even for a very fast compressor (LZ4).
  2. Most string data seems to be duplicated ~4 times, in datoms, timelined_transactions, and in the indices idx_datoms_eavt, idx_datoms_aevt.
  3. During RustConf, @rnewman suggested that ultimately mentat will likely not want to use sqlite, and instead want to read datoms chunks directly out of something like RKV. These chunks could be compressed more easily. This seems out of scope, as it would be a massive change to mentat, but is worth writing down.

Additional concerns exist around the fact that this problem may be exacerbated by materialized views (perhaps #33 will help or prevent this?)

Original by @thomcc

Asserting a schema attribute must require presence of minimal set of attributes

These assertions are legit:

[[:db/add "testProp" :db/ident :testProp]]

and:

[:db/add "testProp" :db/ident :testProp]
[:db/add "testProp :db/valueType :db.type/string]
[:db/add "testProp :add/cardinality :db.cardinality/one]]

However, assertion sets which are missing one of the required attributes, such as:

[:db/add "testProp" :db/ident :testProp]
[:db/add "testProp :db/valueType :db.type/string]]

should be illegal.

Similarly,

[[:db/add "testProp :db/valueType :db.type/string]
[:db/add "testProp :add/cardinality :db.cardinality/one]]

should also be illegal.

Original by @grigoryk

Asserting a schema attribute must require presence of minimal set of attributes

These assertions are legit:

[[:db/add "testProp" :db/ident :testProp]]

and:

[:db/add "testProp" :db/ident :testProp]
[:db/add "testProp :db/valueType :db.type/string]
[:db/add "testProp :add/cardinality :db.cardinality/one]]

However, assertion sets which are missing one of the required attributes, such as:

[:db/add "testProp" :db/ident :testProp]
[:db/add "testProp :db/valueType :db.type/string]]

should be illegal.

Similarly,

[[:db/add "testProp :db/valueType :db.type/string]
[:db/add "testProp :add/cardinality :db.cardinality/one]]

should also be illegal.

Original by @grigoryk

[build] Publish Mentat Android SDK to jcenter/bintray

android-components have a pleasant setup for publishing to jcenter/bintray. I have work in progress on doing this for Mentat, using exactly the same setup; see https://github.com/mozilla/mentat/tree/build-android-sdk. There's some details to work out with getting Rust and the Android NDK configure correctly in a Docker image, but I have no doubt that this can be achieved.

This ticket tracks getting the "final mile" worked out: once we have things building, getting credentials to publish to the right namespace. I'd prefer org.mozilla.mentat, but I don't feel too strongly about it.

@pocmo, can you help me get the TC secrets needed to publish like a-c does? @csadilek tells me that we're publishing to your (private?) bintray repository, which obviously isn't a long term solution, but also isn't a terrible short term solution for us.

Original by @ncalexan

lookup-ref syntax is different between Mentat and Datomic

This doesn't seem to be documented anywhere (or maybe I'm misread the datomic docs?), but is probably worth mentioning somwhere.

According to https://docs.datomic.com/on-prem/identity.html and http://blog.datomic.com/2014/02/datomic-lookup-refs.html Datomic does them as [:namespace/attr value], whereas we do it as (lookup-ref :namespace/attr value).

I'm filing an issue rather than just updating https://github.com/mozilla/mentat/wiki/Differences-from-Datomic because i'm actually not 100% certain it's not me misreading the datomic docs, or failing to interpret what we do.

Original by @thomcc

Expand internal representation of an Attribute to represent idea of absence

Vague problem statement:

To better support of retracting schema (as part of the timelines work, see mozilla#783), we need to have more knowledge than is currently available in-memory. Mentat's definition of a schema attribute currently does not allow us to determine if a particular schema attribute is actually present in the datoms, or if it's a derived default value.

E.g. if we did not assert [:db/add 100 :db/index true], then the index field in the Attribute struct for entid=100 will be false. From the point of view of schema retraction, meaning of that false value is not the same as if actually asserted that [:db/add 100 :db/index false].

  • Schema retraction rule 1) If :db/ident is being retracted, and that entity is a schema attribute - that is, it also has :db/valueType and :db/cardinality - these datoms (and any optional ones) must also be retracted. (why leave around dangling schema attributes?)
  • Schema retraction rule 2) if either of the required schema attribute entities are being retracted (:db/valueType, :db/cardinality), then all of the required schema attributes must be retracted, as well as a corresponding :db/ident.

However, currently we can't tell by just inspecting an AttributeMap that a given entity has these attributes. And so to enforce rule 1, we must read datoms from disk.

Implementation of schema retraction introduced in mozilla#783 punts on rule 1, and only enforces rule 2.

Original by @grigoryk

Double retractions in the transaction log

Some fun transactor behaviour:

  • (given cardinality one, unique/identity :test/ident)
  • first transact: {:test/ident "One"} - say, it gets assigned e=1
  • then transact:
[[:db/add 1 :test/ident "Two"][:db/retract 1 :test/ident "One"]]
  • then observe transaction log:
[[1 :test/ident "Two" 1]
[1 :test/ident "One" 0]
[1 :test/ident "One" 0]]

This is likely happening due to the way retractions are processed out of temp_search_results and into the log. We check for change in values, and synthesize a retraction datom, but we also actually have a retraction datom as well, so we end up with a duplicate retraction when there should have been just one.

Original by @grigoryk

lookup-ref syntax is different between Mentat and Datomic

This doesn't seem to be documented anywhere (or maybe I'm misread the datomic docs?), but is probably worth mentioning somwhere.

According to https://docs.datomic.com/on-prem/identity.html and http://blog.datomic.com/2014/02/datomic-lookup-refs.html Datomic does them as [:namespace/attr value], whereas we do it as (lookup-ref :namespace/attr value).

I'm filing an issue rather than just updating https://github.com/mozilla/mentat/wiki/Differences-from-Datomic because i'm actually not 100% certain it's not me misreading the datomic docs, or failing to interpret what we do.

Original by @thomcc

Investigate methods of reducing the on-disk size of Mentat databases

This document highlights many of the issues and concerns https://docs.google.com/document/d/14ywV4PlBAdOsJrxd7QhMcxbo7pw8cdc1kLAb7I4QhFY/edit?ts=5b7b7f87.

Some work to measure the size of the database when storing history is in mozilla/application-services#191. For my places.sqlite (100k places, 150k visits), it gets around 200MB larger.

It's worth noting that 'disk usage' was one of the primary concerns reported by user research for Fenix (although it's not clear if this size increase (relative to places) is the kind of thing that would make a dent relative to stuff like caches and the like -- A very informal poll of some friends of mine found that Fennec typically uses around 500MB of space (app + data), another 200MB isn't a trivial increase, but doesn't substantially change where we are in terms of app size).

Some bugs which may help (suggested by @rnewman):

  • mozilla#29: Pack the various flag columns into one TINYINT -- appears to give a 1%-2% benefit, which is probably not substantial enough to justify the effort IMO.
  • mozilla#69: implement something like place's url_hash automatically. If this could keep strings values out of the aevt/eavt indices it could have a huge benefit.
  • mozilla#32: Interning keywords. Our test schema didn't use these and I'm not sure what the actual use case for them is over :db.type/ref to a { :db/ident :my/keyword }, so this doesn't seem like a high priority.
  • mozilla#33: Store the data canonically in a sql table instead of in datoms. This is interesting but seems like a lot of work.

I think something like sqlite's zipvfs extension would likely help (as the databases compress well), but have not tried it. Implementing it ourselves is likely beyond the scope of this effort (I took a look at the effort required and it wasn't exactly trivial). Additionally, whatever we do would need to somehow integrate with sqlcipher (I also took a look at bolting compression into sqlcipher before the encryption, but the fact that this makes the block output a variable size seemed to make this problematic).

Other notes:

  1. Storing strings as fulltext and using the compress/uncompress options of FTS4 did not help, since the strings in each column are relatively small. Additionally, the performance overhead here was substantial even for a very fast compressor (LZ4).
  2. Most string data seems to be duplicated ~4 times, in datoms, timelined_transactions, and in the indices idx_datoms_eavt, idx_datoms_aevt.
  3. During RustConf, @rnewman suggested that ultimately mentat will likely not want to use sqlite, and instead want to read datoms chunks directly out of something like RKV. These chunks could be compressed more easily. This seems out of scope, as it would be a massive change to mentat, but is worth writing down.

Additional concerns exist around the fact that this problem may be exacerbated by materialized views (perhaps #33 will help or prevent this?)

Original by @thomcc

[build] Publish Mentat Android SDK to jcenter/bintray

android-components have a pleasant setup for publishing to jcenter/bintray. I have work in progress on doing this for Mentat, using exactly the same setup; see https://github.com/mozilla/mentat/tree/build-android-sdk. There's some details to work out with getting Rust and the Android NDK configure correctly in a Docker image, but I have no doubt that this can be achieved.

This ticket tracks getting the "final mile" worked out: once we have things building, getting credentials to publish to the right namespace. I'd prefer org.mozilla.mentat, but I don't feel too strongly about it.

@pocmo, can you help me get the TC secrets needed to publish like a-c does? @csadilek tells me that we're publishing to your (private?) bintray repository, which obviously isn't a long term solution, but also isn't a terrible short term solution for us.

Original by @ncalexan

Double retractions in the transaction log

Some fun transactor behaviour:

  • (given cardinality one, unique/identity :test/ident)
  • first transact: {:test/ident "One"} - say, it gets assigned e=1
  • then transact:
[[:db/add 1 :test/ident "Two"][:db/retract 1 :test/ident "One"]]
  • then observe transaction log:
[[1 :test/ident "Two" 1]
[1 :test/ident "One" 0]
[1 :test/ident "One" 0]]

This is likely happening due to the way retractions are processed out of temp_search_results and into the log. We check for change in values, and synthesize a retraction datom, but we also actually have a retraction datom as well, so we end up with a duplicate retraction when there should have been just one.

Original by @grigoryk

lookup-ref syntax is different between Mentat and Datomic

This doesn't seem to be documented anywhere (or maybe I'm misread the datomic docs?), but is probably worth mentioning somwhere.

According to https://docs.datomic.com/on-prem/identity.html and http://blog.datomic.com/2014/02/datomic-lookup-refs.html Datomic does them as [:namespace/attr value], whereas we do it as (lookup-ref :namespace/attr value).

I'm filing an issue rather than just updating https://github.com/mozilla/mentat/wiki/Differences-from-Datomic because i'm actually not 100% certain it's not me misreading the datomic docs, or failing to interpret what we do.

Original by @thomcc

Double retractions in the transaction log

Some fun transactor behaviour:

  • (given cardinality one, unique/identity :test/ident)
  • first transact: {:test/ident "One"} - say, it gets assigned e=1
  • then transact:
[[:db/add 1 :test/ident "Two"][:db/retract 1 :test/ident "One"]]
  • then observe transaction log:
[[1 :test/ident "Two" 1]
[1 :test/ident "One" 0]
[1 :test/ident "One" 0]]

This is likely happening due to the way retractions are processed out of temp_search_results and into the log. We check for change in values, and synthesize a retraction datom, but we also actually have a retraction datom as well, so we end up with a duplicate retraction when there should have been just one.

Original by @grigoryk

Asserting a schema attribute must require presence of minimal set of attributes

These assertions are legit:

[[:db/add "testProp" :db/ident :testProp]]

and:

[:db/add "testProp" :db/ident :testProp]
[:db/add "testProp :db/valueType :db.type/string]
[:db/add "testProp :add/cardinality :db.cardinality/one]]

However, assertion sets which are missing one of the required attributes, such as:

[:db/add "testProp" :db/ident :testProp]
[:db/add "testProp :db/valueType :db.type/string]]

should be illegal.

Similarly,

[[:db/add "testProp :db/valueType :db.type/string]
[:db/add "testProp :add/cardinality :db.cardinality/one]]

should also be illegal.

Original by @grigoryk

Expand internal representation of an Attribute to represent idea of absence

Vague problem statement:

To better support of retracting schema (as part of the timelines work, see mozilla#783), we need to have more knowledge than is currently available in-memory. Mentat's definition of a schema attribute currently does not allow us to determine if a particular schema attribute is actually present in the datoms, or if it's a derived default value.

E.g. if we did not assert [:db/add 100 :db/index true], then the index field in the Attribute struct for entid=100 will be false. From the point of view of schema retraction, meaning of that false value is not the same as if actually asserted that [:db/add 100 :db/index false].

  • Schema retraction rule 1) If :db/ident is being retracted, and that entity is a schema attribute - that is, it also has :db/valueType and :db/cardinality - these datoms (and any optional ones) must also be retracted. (why leave around dangling schema attributes?)
  • Schema retraction rule 2) if either of the required schema attribute entities are being retracted (:db/valueType, :db/cardinality), then all of the required schema attributes must be retracted, as well as a corresponding :db/ident.

However, currently we can't tell by just inspecting an AttributeMap that a given entity has these attributes. And so to enforce rule 1, we must read datoms from disk.

Implementation of schema retraction introduced in mozilla#783 punts on rule 1, and only enforces rule 2.

Original by @grigoryk

Expand internal representation of an Attribute to represent idea of absence

Vague problem statement:

To better support of retracting schema (as part of the timelines work, see mozilla#783), we need to have more knowledge than is currently available in-memory. Mentat's definition of a schema attribute currently does not allow us to determine if a particular schema attribute is actually present in the datoms, or if it's a derived default value.

E.g. if we did not assert [:db/add 100 :db/index true], then the index field in the Attribute struct for entid=100 will be false. From the point of view of schema retraction, meaning of that false value is not the same as if actually asserted that [:db/add 100 :db/index false].

  • Schema retraction rule 1) If :db/ident is being retracted, and that entity is a schema attribute - that is, it also has :db/valueType and :db/cardinality - these datoms (and any optional ones) must also be retracted. (why leave around dangling schema attributes?)
  • Schema retraction rule 2) if either of the required schema attribute entities are being retracted (:db/valueType, :db/cardinality), then all of the required schema attributes must be retracted, as well as a corresponding :db/ident.

However, currently we can't tell by just inspecting an AttributeMap that a given entity has these attributes. And so to enforce rule 1, we must read datoms from disk.

Implementation of schema retraction introduced in mozilla#783 punts on rule 1, and only enforces rule 2.

Original by @grigoryk

Double retractions in the transaction log

Some fun transactor behaviour:

  • (given cardinality one, unique/identity :test/ident)
  • first transact: {:test/ident "One"} - say, it gets assigned e=1
  • then transact:
[[:db/add 1 :test/ident "Two"][:db/retract 1 :test/ident "One"]]
  • then observe transaction log:
[[1 :test/ident "Two" 1]
[1 :test/ident "One" 0]
[1 :test/ident "One" 0]]

This is likely happening due to the way retractions are processed out of temp_search_results and into the log. We check for change in values, and synthesize a retraction datom, but we also actually have a retraction datom as well, so we end up with a duplicate retraction when there should have been just one.

Original by @grigoryk

Double retractions in the transaction log

Some fun transactor behaviour:

  • (given cardinality one, unique/identity :test/ident)
  • first transact: {:test/ident "One"} - say, it gets assigned e=1
  • then transact:
[[:db/add 1 :test/ident "Two"][:db/retract 1 :test/ident "One"]]
  • then observe transaction log:
[[1 :test/ident "Two" 1]
[1 :test/ident "One" 0]
[1 :test/ident "One" 0]]

This is likely happening due to the way retractions are processed out of temp_search_results and into the log. We check for change in values, and synthesize a retraction datom, but we also actually have a retraction datom as well, so we end up with a duplicate retraction when there should have been just one.

Original by @grigoryk

lookup-ref syntax is different between Mentat and Datomic

This doesn't seem to be documented anywhere (or maybe I'm misread the datomic docs?), but is probably worth mentioning somwhere.

According to https://docs.datomic.com/on-prem/identity.html and http://blog.datomic.com/2014/02/datomic-lookup-refs.html Datomic does them as [:namespace/attr value], whereas we do it as (lookup-ref :namespace/attr value).

I'm filing an issue rather than just updating https://github.com/mozilla/mentat/wiki/Differences-from-Datomic because i'm actually not 100% certain it's not me misreading the datomic docs, or failing to interpret what we do.

Original by @thomcc

No way to bind aggregate to variable in query

The datomic docs have an example query of the form:

[:find ?year (median ?namelen) (avg ?namelen) (stddev ?namelen)
 :with ?track
 :where [?track :track/name ?name]
        [(count ?name) ?namelen]
        [?medium :medium/tracks ?track]
        [?release :release/media ?medium]
        [?release :release/year ?year]]

Our query syntax doesn't allow the [(count ?name) ?namelen] clause.

Between this and mozilla#647 I'm not sure we can implement a top-sites like query directly. (That might not be true, and could just a reflection of my ignorance, though!)

Original by @thomcc

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.