mindbeam / mindbase Goto Github PK

View Code? Open in Web Editor NEW

18.0 3.0 1.0 9 MB

A database for convergent intersubjectivity

License: Apache License 2.0

Rust 91.87% HTML 1.25% Python 0.98% Jupyter Notebook 5.90%

ontologies database data-science language

mindbase's Introduction

Mindbase

A system for convergent intersubjectivity – To store knowledge, and make sense of the world.

Background

In most existing knowledge graph databases, data is stored as triples or quads: Subject, Predicate, Object, {Context}

Given the example: Alice jumped into the lake. such a system would represent this as something like: Alice, jumped into, the lake, {bag_of_arbitrary_kv_pairs}

Depending on what you're trying to do with this information, there can be some serious problems with this.

For starters, we have to agree on an ontology for each of the terms. We also have to record all of the other contextual information about this statement in a schemaless format which is generally just a bag of kv pairs. These include the recording user, the time the data was recorded, and scoping information about the statement (which lake? who told you this? etc etc)

Ontologies – Betcha can't eat just one

Contraversial statement: Any Ontology which is fixed is automatically wrong.

Another way to say this is that any ontology which is declared by parties other than those who are using it is rife with misalignments. A kind of Ontology-related form of moral hazard.

Ontologies morph and evolve through cultural processes. This is necessary, because our reality is constantly expanding. Science explains and discovers new phenomena. Cultural memes and tropes shift. New consumer products are introduced. It's not just our stuff that's changing, it's also our ontological system.

So what sort of a narcissistic wingnut would dare plant their flag, declaring that this ontology is correct?

Just imagine a hardcore prescriptivist attempting to bonk you on the head with a first edition, 1884 Oxford English dictionary. The english-speaking world speaks a very different language today versus 1884. Sure, the OED has been updated since then, but it'll never be able to keep up with the Urban Dictionary for expressions like "on fleek", "bromance", or "the rona".

As is the case with OED and the Urban Dictionary, most people use multiple ontological systems. That is to say that their Meta-ontological system spans these sources.

So how on earth could so many people get ontologies so wrong? How can it be that Bioinformaticists, data scientists, computer scientists, machine learning specialists, military, governments, and industries could have created such a gnarly cornucopia of competing ontologies?

Two reasons:

Because most humans and organizations make the huge mistake of believing they are the global oracle of objectivity; or at best that their local scope of objectivity is sufficient for their purposes.
Because we don't have a good system of merging multiple, coevolutionary ontologies.

We can't fix #1 directly, but we can make inroads on #2. Mind you, this isn't simply a matter of rooting ontologies in a common "root" ontology like BFO. Root ontologies are fine, but the main problem isn't that we disagree about the nature of continuants vs occurrents. The problem is that the substrate we use to define ontologies is non-interactive. We don't have a good computerized system that mirrors the collaborative, coevolutionary processes that occur in the real world.

We define new terms all the time. Both those with fleeting meaning, and those which we intend to perpetuate must be connected to our ontological system for both publisher and consumer of each datum. How do we achieve this?

A world of Analogies

Douglas Hofstadter famously explained that Analogy is the core of Cognition. Simply put: Analogy is necessary to build connection between any two ideas. There can be no thought, or language, or understanding without it. So what does this have to do with Ontologies? In much the same way that translating between two languages requires analogy, so do ontologies.

Some translations are easier than others: Mi Casa, My House, Chez Moi are all fairly cleanly analagous

Hygge (a Danish word) is kind of like Coziness but it's also a Lifestyle, and Insulated from risk, and Fun all rolled into one.

Similarly, Saudade and Tiáo 条 do not have any direct english translation.

So which language is "correct"? Obviously this is a ridiculous question, but it's exactly the sort of silliness that stumps our researchers who are trying to exchange data. There are two key forces in play here.

The fragmentary nature of our storage and collaboration systems is reflected in our Ontologies. (See Conway's Law)
Even if we had frictionlessly collaborative substrates, our data formats are too fragile. They naively strive for objectivity, but a vastly more potent target would be convergent intersubjectivity.

This is the goal of Mindbase – To serve as a powerful substrate for convergent intersubjectivity. The combination of fragmentary storage and collaboration systems and the naivete of objectivity serves as a critical barrier which we hope to surmount. With it, we may strive to build better personal and professional informatics systems. We may reduce the barriers between open and industrial datasets. We may parse and correlate academic papers. We may even make inroads into explainable AI, and AGI. Of course these are ambitious goals for any system, but we are at least confident that they cannot be achieved with the old techniques.

The "Concept Problem"

TODO: Explain Prof Barry Smith's qualms with Concepts, and discuss why we are/aren't subject to them due to the Artifact / Allegation dichotomy.

As in the case of Tree / Apple, an "Apple" is not a "Tree", but rather a "Fruit" which is related to the "Tree". The key to making this work is that "Tree" is not one thing. Sure, there exists exactly one Artifact for the text string "Tree", but there are many many many possible Allegations which refer to that artifact in different capacities.

Graph representation

Essentially what we're dealing with here is a meta-graph in which the nodes and edges (Symbols) within the meta-graph are each comprised of a set of nodes and edges (Members) within the lower level graph. Those Symbols are constructed as a set of constitutent allegations (Members) documenting ideas of similarity/relatedness by Agents within the system. This allows the expression of relationships between imprecise logical entities (or rather precise-but-nonconverged logical entities) which describe entities within the real world. It is strictly intentional that this not be achieved by deferring to canonical or "objective" representations of symbols within the real world. This is because the real world fundamentally lacks such objectivity, and any such logical representation would thus create a fundamental impedence mismatch with the system which is curable only through out-of-band charismatic initiatives, and significant effort (IE convincing the whole world that your ontology is the "right" one).

mindbase's People

Contributors

Stargazers

Watchers

Forkers

kevinwmatthews

mindbase's Issues

Research Notes - Analogy comparison

Following research notes #1

The goal is to perform symbolic inference over analogies recorded within the system.
Presently, associative analogies are the focus, though this also applies to other forms of analogies, like Catagorical analogies. More on that later.

Those associative analogies essentially contain two Symbols, corresponding to left and right. Each of these two Symbols is not one identifier, but rather a set containing potentially many Members. Members reference ClaimIDs made previously within the Mindbase system.

For a given Associative Analogy, There is no pairwise association between the Member in these two sets – Only a single association between the two sets as a whole. It has been clear for some time that fuzzy sets / fuzzy logic would be necessary as part of the mindbase system.

One way to think about this fuzzyness is in terms of confidence of "sameness" or "correctness" but this is problematic, insofar as it assumes that there exists some god's-eye, or approximation thereof which can be used to adjudicate relative sameness in an absolute sense. This perspective is fraught with problems.

One of the core ideas of this project is that all things which can be observed within the universe are similar (at least pairwise) along some axes, and dissimilar along others. Whether those dimensions are spatial or ontological; whether observed from the world, or made from whole cloth (noting that even fictions are observations first.)

With this in mind, another way to think about fuzzyness is in terms of metric proximity in some dimension within an abstract metric space.

A Symbol for Hygge is proximate to a Symbol for Cozy within this metric space, but they are not the same. We must derive this proximity through measurement of analogies and anti-analogies between those symbols, but also by omission, presumably in the form of a relative-strength analysis with other symbols.

For instance, There may be some categorical analogies made between the English Symbols for Caliente and Picante, but those should necessarily be fewer and/or weaker than those between Hot and Scalding.

The outcome of yesterday's experimentation (#1) strongly implies that this scoring isn't just something which happens at query time. Indeed, such symbols are constructed as the product of queries, but are themselves used construct new Analogies. As such, the partial matching of one side of the associative Analogy should result in ALL of the Members from the other side being retained for their bound symbol variables, albeit with a lesser score than if the opposing side were a perfect match.

Some questions which follow from this:

How should this score be calculated within a single candidate Analogy?
- This scoring can't just be from Left to Right. We have to do the inverse as well. How does bidirectional scoring work?
- How do we reconcile left-side matching=right-side-scoring with right-side matching = left-side-scoring? Hypothesis: Narrowing of the set of Members is done by same-side matching. Scoring of the remaining Members is done by opposite-side matching.
How should we compose the bound symbol variable Members across multiple Analogies which are under consideration?
- What about identical ClaimIDs which originate from different Analogies with different scores?
- Should these be averaged, or otherwise be combined via some weighted sigmoid function?

Clean up mindbase repository and documentation

Code quality and testing at present is not satisfactory for publishing to crates.io, even after the other blocking issues are resolved. This issue intends to track the process of cleaning up and documenting the codebase such that it is of a vaguely usable quality, sufficient for publishing as a public alpha.

Discussion of the FuzzySet Signal-To-Noise-Ratio Problem

from #7 we discussed the FuzzySet Signal-to-noise-ratio problem

One thing within the experimental code which is almost certainly wrong is the way unions are being performed across the output of each candidate Analogy interrogation. We must explore a more appropriate means of composing these candidate Analogy interrogation outputs in a weighted fashion, rather than simply taking the maximum degree of each discrete matching member into the final output FuzzySet. This is screwy, because we likely don't want Members from a small subset of candidate Analogies with a high degree of matching to compete on equal footing with a corpus of thousands with a low matching degree, as a simple maximum-degree of membership union might provide. (current code does this) However, we also don't want to attenuate the signal of such a well-matching subset of candidate Analogies as a simple weighted score would suggest either. Presumably there is some middle ground which must be found, whereby these considerations are balanced. Not a simple weighted score, and not a maximum-degree of FuzzySet membership either.

Let's imagine we interrogate three candidate Analogies and we are left with the following Symbols, which we are constructing manually here, but would be typically be analogy interrogation outputs created by the query tree.

    let io1 = sym![Hot1~0.3,Sticky1~0.2];
    let io2 = sym![Hot1~0.5,Muggy1~0.9];
    let io3 = sym![Hot1~1, Sticky1~1];

    // union the interrogation outputs together
    let u = Symbol::null();
    u.union(io1);
    u.union(io2);
    u.union(io3);

What do we want to have in the end, and why?

Should include the max of each degree?

    u is [Hot1~1,Muggy1~0.9,Sticky~1]

This doesn't seem very good. We want small signals to be boosted, but this might be a bit too much

Hot 1 is present in all input Symbols. Should we average them?

    u is [Hot1~0.6,..]

What about Muggy1, and Sticky1 - which are only present in some of the inputs?
Should we treat the sets which lack them as degree 0, and include those in the average?

    u is [Hot1~0.6, Muggy1~0.3, Sticky1~0.4]

Or should we average them individually based on their non-null set membership?

    u is [Hot1~0.6, Muggy1~0.9, Sticky1~0.6]

Let's take a step back. What do each of these input symbols represent?

Each symbol represents one side of an analogy which a trusted (ground) Agent previously Claimed. Each member of which had its degree determined by some prior query, presumably by that Agent, wherein a partial match of claims was had.

This could come about a number of different ways, but the simplest construction of events is:

a1hot : Symbol = Agent1.query("Hot");
    // TODO - construct a full chain of events (including genesis Claims) by which Symbol members of a degree <1 are constructed, and then Claimed as new Analogies
    // From there we can determine the most prudent implementation of union, such that we optimize the signal-to-noise ratio

Finish symbolvar binding

As it turns out, binding symbol variables has opened an interesting can of worms regarding the matching of previously-defined analogies, and the representation of matched elements as a Fuzzy set within the resultant symbol. See #1 and #3 for details

Research Notes - Analogy comparison

In order to serve its purpose, Mindbase must perform symbolic inference over analogies recorded within the system.

In MBQL this would look like:

$x = Bind("Hot")
$y = Ground($x : "Cold")

This means we wish to construct a Symbol $x containing all Members which are referenced by previously-defined associative analogies opposite the text Artifact "Cold"

Crucially, note that the text Artifact itself is not what is being referenced, but rather an instantiation or Claim of the Artifact. Artifacts themselves are just buckets of bits, and are devoid of any kind of Internal Meaning

Symbol $y captures the set of those Alledged Analogies which contributed to the match – In the above case representing the relationship between the "Hot" and "Cold" symbols.

from mindbase/experiments/analogy_compare/main.rs:

    // In this experiment, we are approxmiating the following MBQL
    // $x = Bind("Hot")
    // $y = Ground($x : "Cold")

    let mut x = Symbol::null();
    let mut y = Symbol::null();

    // For simplicity, lets say these are all the analogies in the system
    let candidates = [Analogy::new("a1", sym!["Hot1"], sym!["Cold1"]),
                      Analogy::new("a2", sym!["Hot2"], sym!["Cold2"]),
                      Analogy::new("a3", sym!["Cold3"], sym!["Hot3"])];

    // NOTE - this should have an unassigned Spin, because it's a match pair
    let search_pair = FuzzySet::from_left_right("Hot", "Cold");
    println!("Searching for {}", search_pair.diag_lr());

    for candidate in &candidates {
        let v = candidate.intersect(&search_pair).expect("All of the above should match");
        x.members.extend(v.left());

        y.members.insert(Member { id:   candidate.id.clone(),
                              spin: Spin::Up, /* This is WRONG for a3. It should be Down because the order of the association
                                               * is reversed. how do we fix this? */
                              side: Side::Middle, });
    }

    println!("symbol x is: [{}]", x.members.diag());
    println!("symbol y is: [{}]", y.members.diag());

Which renders:

This is wrong, because we get an x value of [Cold3˱↓,Hot1˱↑,Hot2˱↑] when we should get [Hot1˱↑,Hot2˱↑,Hot3˲↓]

y is also wrong. We get [a1ᐧ↑,a2ᐧ↑,a3ᐧ↑] when we should get [a1ᐧ↑,a2ᐧ↑,a3ᐧ↓] indicating that a3 is included, but with reversed polarity, because the order of the associations is the opposite of the search criteria.

See this code in the rust playground

Questions and notes for next time:

How do we fix Analogy::intersect to render the right results for x and y?
How does this compose across multiple levels of Associative analogy? Eg ("Smile":"Mouth"):("Wink":"Eye")
An associative Analogy involves two FuzzySets of Members (Left and Right) which are NOT associable pairwise as Members, only as whole sets. This is because each FuzzySet represents a symbol in its own right.
- Given a perfect match of left-handed Members, all right-handed Members can be safely inferred.
- Given a PARTIAL match of left-handed Members, The entirety of the subset the right is inferred, but with a lesser degree of confidence, commensurate with the number of left-handed matches. The inverse is true as well of right handed matches.
- With what data structure do we represent this partial matching, and how does that compose in the aforementioned situation? Presumably via some recursive confidence score

Initial web version of mindbase using IndexedDB

At present, Mindbase uses sled, which works fine for native use cases.

This issue intends to track storage engine modularization, sufficient to use mindbase in HTML5 WASM environments.

Add build/run instructions to the README

Hello, Great work so far! I can't wait to try it out on my desktop.

When you get the chance, could you add build/run instructions so I can try it out?

Research Notes - Analogy comparison experiment

Following on Notes #1 and #3 ...

Within the analogy_compare experiment, I have managed to get Analogy querying working, such that a candidate analogy may be tested against an AnalogyQuery struct.

This now successfully provides the correct results:

fn experiment() {
    // In this experiment, we are approximating the following MBQL
    // $x = Bind("Hot")
    // $y = Ground($x : "Cold")

    let mut x = Symbol::null();
    let mut y = FuzzySet::new();

    // For simplicity, lets say these are all the analogies in the system
    let candidates = [//
                      Analogy::from_left_right("a1", sym!["Hot1", "Hot2", "Heated1"], sym!["Mild1", "Mild2", "Cold3"]),
                      Analogy::from_left_right("a2", sym!["Hot3"], sym!["Cold1", "Cold2"]),
                      Analogy::from_left_right("a3", sym!["Cold3"], sym!["Hot3"])];

    // Imagine we looked up all ClaimIDs for all Claims related to Artifacts "Hot" and "Cold"
    // query is an AnalogyQuery struct, which contains a FuzzySet<AnalogyMember>. 
    // from_left_right constructs an AnalogyQuery with one AnalogyMember per each ClaimID "Hot1", "Hot2", etc 
    let query = AnalogyQuery::from_left_right(sym!["Hot1", "Hot2", "Hot3"], sym!["Cold1", "Cold2", "Cold3"]);
    println!("Query is: {}", query);

    for candidate in &candidates {
        let v: FuzzySet<analogy::AnalogyMember> = candidate.interrogate(&query).expect("All of the above should match");
        println!("v is {}", v);
        x.set.union(v.left());
        y.union(v);
    }

    println!("symbol x is: {}", x);
    println!("symbol y is: {}", y);
}

Which renders the following:

Query is: [Hot1~1.00, Hot2~1.00, Hot3~1.00 <-> Cold1~1.00, Cold2~1.00, Cold3~1.00]
v is [Hot1~0.33, Hot2~0.33 <-> Cold3~0.67]
v is [Hot3~0.67 <-> Cold1~0.33, Cold2~0.33]
v is [Hot3~0.33 <-> Cold3~0.33]
symbol x is: {Hot1~0.33, Hot2~0.33, Hot3~0.67}
symbol y is: [Hot1~0.33, Hot2~0.33, Hot3~0.67 <-> Cold1~0.33, Cold2~0.33, Cold3~0.67]

When both sides of the candidate analogy match the query to at least some degree, we are including the matching terms from each side, but we have to scale its degree within the output set based on the degree of the match on the opposite side.

With that in mind, we can see that the first candidate Analogy experiences a 2/3 match to the members of the left side of the AnalogyQuery and a 1/3 match to the right side. That means we want to scale those matching members of the right by the degree of the match on the left, and vice versa. Thus yielding [Hot1~0.33, Hot2~0.33 <-> Cold3~0.67]

(Of course, if either side of the AnalogyQuery has a zero degree of matching, then the candidate Analogy is fully rejected, and there is no resultant FuzzySet for that candidate.)

So, why do we want to do this scaling of output members within the set by the degree of the opposing side? This is because an Analogy creates an associative relationship between the Left Symbol and the right Symbol (both of which are themselves FuzzySets, at least before we convert them into a "sided" fuzzyset within the Analogy).

We expressly lack pairwise relationships between members of these two FuzzySets, because each one represents some abstract concept. We are inferring the applicability of the right side based on the matching of the left side, and vice versa. It therefore stands to reason that such inference be scaled by the strength of the match of the opposing symbol.

A brief digression on Symbols:

A Symbol represents / abstracts some concept by constraining degrees of freedom within a highly dimensional semantic space with its Members, and their respective degrees.

When I conjure the abstract notion of a "Dog" within my mind, there is some ephemeral meaning which is constrained in its degrees of freedom based on an elaborate unspoken context. Maybe I was thinking of the fur, plus a specific dog from childhood, plus the abstract idea of kinship with another creature, plus several other dimensions which I would have difficulty articulating.

Even if we imagine that a "Perfect" brain-computer interface magically existed, the concept which I conjured would still occupy a more or less fuzzy region within semantic space which has been constrained to some degree. This is not a question of perfect rendering. While the region within said semantic space may be more sharply or dully partitioned, said boundary cannot ever be fully crisp. That we clamp that degree between 0.0 and 1.0 is merely an implementation detail rather than a representation of actually-perfect confidence or non-confidence.

Weighted union of candidate Analogies

One thing within the experimental code which is almost certainly wrong is the way unions are being performed across the output of each candidate Analogy interrogation.

We must explore a more appropriate means of composing these candidate Analogy interrogation outputs in a weighted fashion, rather than simply taking the maximum degree of each discrete matching member into the final output FuzzySet.

This is screwy, because we likely don't want Members from a small subset of candidate Analogies with a high degree of matching to compete on equal footing with a corpus of thousands with a low matching degree, as a simple maximum-degree of membership union might provide. (current code does this)

However, we also don't want to attenuate the signal of such a well-matching subset of candidate Analogies as a simple weighted score would suggest either. Presumably there is some middle ground which must be found, whereby these considerations are balanced. Not a simple weighted score, and not a maximum-degree of FuzzySet membership either.

For the time being, I will call this the Fuzzyset-union signal-to-noise ratio problem.