Git Product home page Git Product logo

das-poc's Introduction

Distributed Atom Space (DAS)

Description:

This repo aims to develop a new design to store all the MeTTa expressions in a database to be accessed through an API. Our first approach is using MongoDB (expressions) + Couchbase (indexes).

Examples:

As a simple example, we have the following expression:

(: Evaluation Type)
(: Predicate Type)
(: Reactome Type)
(: Concept Type)
(: "Predicate:has_name" Predicate)
(: "Reactome:R-HSA-164843" Reactome)
(: "Concept:2-LTR circle formation" Concept)
(
	Evaluation 
	"Predicate:has_name" 
	(
	    Evaluation 
	    "Predicate:has_name" 
	    {"Reactome:R-HSA-164843" "Concept:2-LTR circle formation"}
	)
)

MongoDB:

The _id must be built by hashing (sha256) the documents' fields to avoid duplication. For simplicity, we'll be using integers on this example.

NodeTypes: [
    { _id: 1, type: null, name: "Unknown" },
    { _id: 2, type: null, name: "Type" },
    { _id: 3, type: 2, name: "Evaluation" },
    { _id: 4, type: 2, name: "Predicate" },
    { _id: 5, type: 2, name: "Reactome" },
    { _id: 6, type: 2, name: "Concept" },
]

Nodes: [
    { _id: 7, type: 4, name: "Predicate:has_name" },
    { _id: 8, type: 5, name: "Reactome:R-HSA-164843" },
    { _id: 9, type: 6, name: "Concept:2-LTR circle formation" },
]

Links_1: [{}]

Links_2: [
    {
	    _id: 10,
	    set_from: 1,
	    is_root: false,
	    type: [Reactome, Concept],
	    key1: 8,
	    key2: 9,
    },
]

Links_3: [
    {
	    _id: 11,
	    set_from: null,
	    is_root: false,
	    type: [Type, Predicate, [Reactome, Concept]],
	    key1: 3,
	    key2: 7,
	    key3: 10,
    },
    {
	    _id: 12,
	    set_from: null,
	    is_root: true,
	    type: [Type, Predicate, [Type, Predicate, {Reactome, Concept}]],
	    key1: 3,
	    key2: 7,
	    key3: 11,
    },
]

As an example of how sha256 will be used here:

    _id: XX ->  sha256(sha256(type), sha256(key1), sha256(key2), ...)
    _id: 10 ->  sha256(sha256(set_salt, 5, 6), 8, 9)
    _id: 11 ->  sha256(sha256(2, 4, sha256(set_salt, 5, 6)), 3, 7, 10)
    _id: 12 ->  sha256(sha256(2, 4, sha256(2, 4, sha256(set_salt, 5, 6))), 3, 7, 11)

Notes:

  • The field named is_root is NOT used on hashing.
  • Each document that represents an expression has the field named set_from. This field represents:
    • when equal to null that the keys in document wasn't ordered in anyway;
    • when equal to 1 that the keys in document was ordered alphabetically since their first key;
    • when equal to 2 that the keys in document was ordered alphabetically since their second key;
  • The set_from field will be different of null when:
    • their expression represents a set ({ ... }). So set_from receives 1.
    • the first key in expression points to a Similarity node type. So set_from receives 2.
  • The set_from field is used on hashing.

Couchbase:

IncomingSet:
{
    8: [10],
    9: [10],
    3: 2,
    3_0: [11],
    3_1: [12],
    7: 2,
    7_0: [11],
    7_1: [12],
    10: [11],
    11: [12]
}

RecursiveIncomingSet:
{
     8: [10, 11, 12],
     9: [10, 11, 12],
     3: [11, 12],
     7: [11, 12],
    10: [11, 12],
    11: [12]
}

OutgoingSet:
{
    10: [8, 9],
    11: [3, 7, 10],
    12: [3, 7, 11]
}

RecursiveOutgoingSet:
{
    10: [8, 9],
    11: [3, 7, 10, 8, 9],
    12: [3, 7, 11, 10, 8, 9]
}

At this point, we found a size limitation for values in Couchbase collections. Not rarely some keys in IncomingSet collection will have more than the limit of 20 MB defined by Couchbase under their values. On intend to bypass this limitation was implemented a rule to split the values into sub-keys. For simplicity, the example uses a limit of one value for each key (the real implementation has 500,000 as max number of values). The rule defines that once time a main key have more values than the max limit defined that key will be splitted into two other sub-keys and at the time the last one created sub-key achieve the max limit for their values a new sub-key will be created. The integer number storaged at main key represents the amount of the sub-keys existents under key itself. By their turn the sub-keys has the indentifier composed by the main key plus a counter starts at zero and ends at the integer storaged under main key minus one and both are separeted by underscore (_).

IncomingSet:
{
    8: [10],
    9: [10],
    3: 2,
    3_0: [11],
    3_1: [12],
    7: 2,
    7_0: [11],
    7_1: [12],
    10: [11],
    11: [12]
}

Here is another simple example to show how we create a graph from an expression:

(
    Evaluation
        "Predicate:P1"
        (
            (Evaluation "Predicate:P2" {"Gene:G1" "Gene:G2"})
            ("Concept:CN1" "Concept:CN2")
        )
)

Example_2 Graph

Datasets:

You can find all the Atomese (.scm) files from gene-level-dataset_2020-10-20 already translated to MeTTa (.metta) in the data/bio_atomsapace directory.

The translation script used is in scripts/atomese2metta.

Get it started:

Go to das/ directory to get info about how to set up the necessary environment.

das-poc's People

Contributors

jonatasleon avatar andre-senna avatar arturgontijo avatar vsbogd avatar ricardoyabe avatar

Stargazers

Pascal Gula avatar Douglas R. Miles avatar #://CNXT $://THeXDesK avatar TED Vortex (Teodor-Eugen Duțulescu) avatar Kirill Nikolaev avatar  avatar  avatar  avatar Crypto Wolf ◾️ avatar Crypto Wolf  avatar Jincheng Zhou avatar

Watchers

Douglas R. Miles avatar James Cloos avatar  avatar  avatar Vinay Wadagavi avatar  avatar Anand Rajamani avatar Crypto Wolf ◾️ avatar Vita Potapova avatar

das-poc's Issues

[Redis] Design

Once we get MongoDB set we must create the key-values (using _id hashes) Redis database for quick lookup.

[Arch] Cloud + Local Design

We must design how we will expose both Databases for users to query them.

The following stack should be tested:

  • [AWS] Load Balancer
  • [AWS] ECS (or EKS) with containers/pods running:
    • [AWS] MongoDB (or DynamoDB)
    • [AWS] Redis (or ElasticCache)

[Datasets] Storage Workflow

We must have a script to upload/download datasets files to the cloud (AWS S3) as storing them on the Github is not practical.

[Hashing] Design

We must choose a hash function and validate if it can handle large datasets without too many (none) collisions.

A first implementation would be using SHA256.

Map "joined" tables suggested by Michael in order to reduce the amount of FlyBase data needed to be loaded

See discussioon in #bio-ai Slack channel.

The numbers reference specific data dumps available for download with links here: https://wiki.flybase.org/wiki/FlyBase:Downloads_Overview
minimal flybase tables

3.2.11 genes, transcripts, proteins
3.2.5 physical interactions
3.2.14 sequence ontology
3.2.15 gene map
3.2.20 non coding RNAs
3.3.1 gene ontology
3.2.2 genetic interactions
3.5.4-3.5.6 alleles/mutations
3.7.1, 3.7.2 human disease orthologs

[YACC] MeTTa

Create a syntactic analyzer for the MeTTa language.

[Setup] Single Command

Into scripts/README.md we describe the steps to setup DAS project. Would be practical turn some of these steps into a single command.

docker system not starting correctly

when i follow the readme to set up the docker containers, i get this complaint

...
INFO: Waiting for Couchbase...
ERROR: password - The password must be at least 6 characters long.
ERROR: username - Username must not be empty
INFO: Couchbase is still being set up...
ERROR: Couchbase failed to be set up.

the containers are running

~/snet/das$ docker ps
CONTAINER ID   IMAGE         COMMAND                  CREATED        STATUS        PORTS                                                                                                                                                                          NAMES
181919666a5b   das_service   "python3 service/ser…"   11 hours ago   Up 11 hours                                                                                                                                                                                  das_das_service_1
717937ed397c   couchbase     "/entrypoint.sh couc…"   11 hours ago   Up 11 hours   8096-8097/tcp, 9123/tcp, 0.0.0.0:8091-8095->8091-8095/tcp, :::8091-8095->8091-8095/tcp, 11207/tcp, 11280/tcp, 0.0.0.0:11210->11210/tcp, :::11210->11210/tcp, 18091-18097/tcp   das_couchbase_1
18a518687a1f   mongo         "docker-entrypoint.s…"   11 hours ago   Up 11 hours   0.0.0.0:27017->27017/tcp, :::27017->27017/tcp                                                                                                                                  das_mongo_1

but i get an error trying to initialize an atomspace:

~/snet/das$ ./scripts/das-cli.sh create --new-das-name my_knowledge_base
Setting new Couchbase bucket to DAS 'my_knowledge_base'
                                                       ERROR: Cluster is not initialized, use cluster-init to initialize the cluster
                                                                                                                                    Traceback (most recent call last):
  File "/opt/singnet/das/service/client.py", line 163, in <module>
    main()
  File "/opt/singnet/das/service/client.py", line 83, in main
    response = _check(stub.create(pb2.BindingRequest(name=das_name)))
  File "/usr/local/lib/python3.9/site-packages/grpc/_channel.py", line 946, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/usr/local/lib/python3.9/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNKNOWN
	details = "Exception calling application: <RC=0xD2[LCB_ERR_BUCKET_NOT_FOUND (210)], There was a problem while trying to send/receive your request over the network. This may be a result of a bad network or a misconfigured client or server, C Source=(src/bucket.c,1229)>"
	debug_error_string = "UNKNOWN:Error received from peer ipv4:127.0.0.1:7025 {created_time:"2022-12-12T15:37:46.217781257+00:00", grpc_status:2, grpc_message:"Exception calling application: <RC=0xD2[LCB_ERR_BUCKET_NOT_FOUND (210)], There was a problem while trying to send/receive your request over the network. This may be a result of a bad network or a misconfigured client or server, C Source=(src/bucket.c,1229)>"}"

am i missing something?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.