Git Product home page Git Product logo

grip's Introduction

Build Status License: MIT Godoc Gitter

GRIP

https://bmeg.github.io/grip/

GRIP stands for GRaph Integration Platform. It provides a graph interface on top of a variety of existing database technologies including: MongoDB, Elasticsearch, PostgreSQL, MySQL, MariaDB, Badger, and LevelDB.

Properties of an GRIP graph:

  • Both vertices and edges in a graph can have any number of properties associated with them.
  • There are many types of vertices and edges in a graph. Thus two vertices may have myriad types of edges connecting them reflecting myriad types of relationships.
  • Edges in the graph are directed, meaning they have a source and destination.

GRIP also provides a query API for the traversing, analyzing and manipulating your graphs. Its syntax is inspired by Apache TinkerPop. Learn more here.

Pathway Commons

To load Pathway commons into a local instance of GRIP, first download the Pathway commons source file.

curl -O https://www.pathwaycommons.org/archives/PC2/v12/PathwayCommons12.All.BIOPAX.owl.gz

Start grip server (using Pebble driver)

grip server --driver=pebble

In another terminal, create the graph

grip create pc12

And load the file, using the RDF loader

grip rdf --gzip pc12 PathwayCommons12.All.BIOPAX.owl.gz -m "http://www.biopax.org/release/biopax-level3.owl#=" -m "http://pathwaycommons.org/pc12/#=pc12:"

Once the graph has been loaded into the database, you can view all of the different vertex and edge types in the graph:

grip list labels pc12

Or run an example query, such as count all of the pathways:

grip query pc12 'V().hasLabel("Pathway").count()'

grip's People

Contributors

adamstruck avatar bwalsh avatar dependabot[bot] avatar jordan2lee avatar kellrott avatar matthewpeterkort avatar pagreene avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

grip's Issues

Concurrency control

Concurrent writes are a fact of life. Most databases deal with this in some way. Arachne shouldn't be an exception.

Consider exploring the style of MVCC that Elasticsearch (and many others) employ, where the database will compare a version string before committing the update.

Readonly Option

Add config option to run read-only server, disabling writing API.

mongo: not saving nested fields

I'm struggling to track down why my task writes are not being saved correctly. My best guess is that PackVertex is not correct, and protoutil.AsMap does not correctly convert to a nested map.

db/mongo: memory leak

Mongo has a memory leak somewhere. No clue where, just have observed that a series of queries results in climbing memory, and once the queries stop, the memory level remains. Restarting the server drops the memory back down.

aql/python: query builder is incorrect

import aql

conn = aql.Connection("http://10.50.50.123:8000")

O = conn.graph("mortar")
files = O.query().V().hasLabel("Mortar.File")

print list(O.query().V().count().execute())
print list(O.query().E().count().execute())

print list(O.query().V().hasLabel("TES.Task").count().execute())
print list(O.query().V().hasLabel("TES.Task.Tag").count().execute())
print list(files.count().execute())

for f in files.execute():
    print f

this prints

python test.py
[{u'int_value': 14681}]
[{u'int_value': 26094}]
[{u'int_value': 5236}]
[{u'int_value': 107}]
[{u'int_value': 9294}]
{u'int_value': 9294}

but if I comment out the print list(files.count().execute()) line, it prints out the file vertices as I expected.

'Exponential growth' batch size for mongo and elastic search drivers

The query methods in the mongo and ES drivers (like GetVertexChannel) grab batches to reduce latency. In some tested cases downstream methods, like .limit() only need a few elements. This causes latency because the upstream element still grabs full batches (ie grab 1000 row when it only needs one).
The request would be to start the BatchSize variable at something small and increasing it every request cycle until it hits its max.

Make aql.py installable

I'd like to have super easy access to client libraries and utils. Perhaps pip install aql?

Or, alternatively, wait until the sync with ophion and rely on that client instead?

"where + and" query performance changes depending on representation

  1. list(O.query().V().where(aql.eq("_label", "CNASegment")).where(aql.eq("referenceName", "8")).count())

  2. list(O.query().V().where(aql.and_(aql.eq("_label", "CNASegment"), aql.eq("referenceName", "8"))).count())

The first one returns in 1 minutes. The second returns in 14 minutes. The "count" stays the same.

server: panic on no reachable servers

arachne server --port 5756 --rpc 5757 --mongo 127.0.0.1:27017
2018/02/04 14:10:39 Starting Server
2018/02/04 14:10:51 no reachable servers
2018/02/04 14:10:51 TCP+RPC server listening on 5757
2018/02/04 14:10:51 HTTP proxy connecting to localhost:5757
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x17bfda1]

goroutine 73 [running]:
gopkg.in/mgo%2ev2.(*Collection).Find(0xc4203a0240, 0x0, 0x0, 0xc4202cc6a0)
	/Users/buchanae/src/gopkg.in/mgo.v2/session.go:2115 +0x31
github.com/bmeg/arachne/mongo.(*Graph).GetVertexList.func1(0xc4202c6c60, 0xc4203aa130, 0x1ff9000, 0xc4203a0450)
	/Users/buchanae/src/github.com/bmeg/arachne/mongo/mongo_store.go:157 +0x82
created by github.com/bmeg/arachne/mongo.(*Graph).GetVertexList
	/Users/buchanae/src/github.com/bmeg/arachne/mongo/mongo_store.go:169 +0x7f

CLI doesn't fail on unknown task

RJHB552 smc-het-graph-logs
arachne start -h
2018/01/03 09:56:49 Adding goja JS engine
2018/01/03 09:56:49 Adding otto JS engine

Changes to Proto API broke YAML graph load and example loader

# ./bin/arachne example
2018/05/20 17:02:22 Loading example graph data into example
2018/05/20 17:02:22 Loading example graphql schema into example-schema
Error: failed to unmarshal graph schema: json: cannot unmarshal string into Go struct field Struct.fields of type structpb.Value

variants: consider indexing more fields

This query takes far too long:
list(O.query().V().where(aql.eq("_label", "Variant")).where(aql.eq("chromosome", "1")).where(aql.eq("start", 27100988)).limit(10))

kellrott [12:05 PM]
700,000 variants on chromosome 1 and neither the chromsome or start fields are indexed. so its scanning them all

It's probably worth indexing chromosome, start, end, referenceBases, and alternateBases.

query: distinct on non-existent field does not fail

q.where(aql.eq("_label", "Individual")).where(aql.eq("source", "tcga")).mark("individual").in_("sampleOf").where(aql.eq("disease_code", "BRCA")).mark("sample").distinct(["$.individualz.gid"]).limit(1).execute()[0].to_dict()

In the query above, the distinct field is accidentally wrong, but I still get results.

I guess this touches on the commonly recurring topic of schema vs no schema โ€“ how can you know the field doesn't exist without a schema? I'll say that, as a user, not having the server tell me I'm wrong makes it harder to learn the query language and easier to make mistakes.

Bundle a small example graph

New users could get started with queries very quickly if a small example graph was embedded in arachne. The website docs could use this example graph to demonstrate all queries and functionality.

Lookup by _alias_

For example:

A genomic feature SNP has the following equivalent 'tags'

"synonyms": [
"NC_000009.11:g.133750356A>G",
"NG_012034.1:g.166089A>G",
"CM000671.1:g.133750356A>G",
"CM000671.2:g.130874969A>G",
"NC_000009.10:g.132740177A>G",
"chr9:g.133750356A>G",
"COSM12604",
"chr9:g.130874969A>G",
"NC_000009.12:g.130874969A>G"
],

Proteins have the same issue:

https://www.uniprot.org/help/protein_names

cmd/list: panic (from empty args?)

arachne list
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x143b817]

goroutine 22 [running]:
github.com/bmeg/arachne/aql.Client.GetGraphs.func1(0xc42020ce40, 0x1ffb020, 0xc420086388, 0x1ffe020, 0xc420086390)
	/Users/buchanae/src/github.com/bmeg/arachne/aql/util.go:57 +0x157
created by github.com/bmeg/arachne/aql.Client.GetGraphs
	/Users/buchanae/src/github.com/bmeg/arachne/aql/util.go:59 +0x89

Make IDs unique per label?

I'm finding that it's easy to create conflicting IDs in my application code. To manage that, I started prefixing the IDs by the vertex/edge type (label). That gives me peace of mind, but now I'm getting areas in my code where the ID hasn't been properly prefixed, so the vertex is (silently) not found.

I think it would be great if the database handled all this for me. Unique IDs per table/document type seems like a normal concept, so this feels like a reasonable feature.

arachne load vs arachne mongoload

Ran into performance issue. Unclear if difference in performance was expected :

  • arachne load g2p --vertex /data/mc3.Variant.Vertex.json

    • approx 125 documents / sec
    • default badger
  • arachne mongoload g2p --vertex /data/mc3.Variant.Vertex.json --host mongodb:27017

    • approx 8000 documents / sec

Decide/document how to update a vertex

Currently, updating an existing vertex requires first calling Get, then calling Add.

That's perfectly fine; for example, Google Datastore has this behavior. It should be at least documented.

If you want, you could also provide an upsert option.

Consider merging Query and Edit services

Is there a reason for separating the Query and Edit services? It does have an effect on client code, adding extra effort required to set up a client. If there's not a particular reason, I'd argue that it's better to simplify client creation by merging the services into one.

GraphQL schema reload

Rebuild graph schema when graphql graph changes. Right now, GraphQL schema is only loaded when server starts up.

Simple web UI

For easier development and debugging, let's add a really simple web UI that basically just dumps vertices and edges in table form.

aql/go: automatic (un)marshaling

Most databases provide helpers for (un)marshaling Go struct types defined by the caller. In my experiments, I'm forced to marshal my struct to a string, then unmarshal to the protobuf Struct type, in order to fit the GraphElement type.

Arachne is using an ok approach (similar to what Google Datastore does), it's just missing the nice layer on top that makes it easy to work with.

server: change default ports

We should coordinate the ports used by our projects so they don't conflict. Funnel is already using 8000 and 9090

aql.py: robust handling of server address

conn = aql.Connection("localhost:8000")

Results in:

Traceback (most recent call last):
  File "test.py", line 6, in <module>
    G.addVertex("test-1", "test-1-label")
  File "/Users/buchanae/src/scratch/smc-het-graph-logs/aql.py", line 56, in addVertex
    response = urllib2.urlopen(request)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 154, in urlopen
    return opener.open(url, data, timeout)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 431, in open
    response = self._open(req, data)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 454, in _open
    'unknown_open', req)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 409, in _call_chain
    result = func(*args)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 1265, in unknown_open
    raise URLError('unknown url type: %s' % type)
urllib2.URLError: <urlopen error unknown url type: localhost>

because it's missing "http".

Per graph backend config

Currently graph config is global, with a single backend supporting all the graphs. With per graph config, a single endpoint would hold graphs backed by a number of different databases.

aql: unwrapping "vertex" and "data"

When doing queries, I get rows list this:

 {u'vertex': {u'data': {u'chromosome': u'1',
    u'description': u'hes family bHLH transcription factor 4 [Source:HGNC Symbol%3BAcc:HGNC:24149]',
    u'end': 1000172,
    u'id': u'ENSG00000188290',
    u'seqId': u'1',
    u'start': 998962,
    u'strand': u'-',
    u'symbol': u'HES4'},
   u'gid': u'gene:ENSG00000188290',
   u'label': u'Gene'}}

Basically, all the data I want is always two levels deep. I constantly unwrap row["vertex"]["data"]

Move more of the graph database interface to streaming?

The GraphDataBaseInterface (GDBI) module defines how the Arachne query engine interfaces with a graph database ( https://github.com/bmeg/arachne/blob/master/gdbi/interface.go#L65 ). The method GetVertexListByID was added to allow for query methods to create a bi-directional stream of ids to elements. This really sped up the Mongo Driver, because it could take batches of incoming ids, query Mongo for all of them, and then return a batch of results, rather then having a transaction for every single request. You can see how this is taken advantage of in the Out function in the PipeEngine at https://github.com/bmeg/arachne/blob/master/gdbi/pipe_query_engine.go#L378

Should more of the gbi.GraphDB interface be translated to use streaming?

aql.py: automatically convert to list

My most common use case is based around running queries in an ipython/notebook setting; I'm often exploring the data, running lots of different types of queries. Often I'm not even interested in the full set of results, I just need a few. Also, if I do want the full set of results, it's often small enough that I don't need streaming results.

AQL and Arachne are built to stream data. Currently the python client returns a generator/iterator from a query execution. When doing the type of work I described above, I constantly need to wrap every query in list(). It gets pretty annoying.

I'm not sure what the right API is. graph.query().V().list()?

Since I'm often just exploring, maybe graph.query().V().head() would be really nice too, so I'm less likely to pull down a large amount of data. Same for graph.query().V().first(). These should use a server side result limit, or some other way of efficiently limiting the result stream.

db/bolt: panic

arachne server --bolt arachne.db
2018/02/27 19:39:25 Starting Server
2018/02/27 19:39:25 Starting BOLTDB
2018/02/27 19:39:25 TCP+RPC server listening on 8202
2018/02/27 19:39:25 HTTP proxy connecting to localhost:8202
2018/02/27 19:39:25 Fields: graphql.Fields{}
2018/02/27 19:39:25 GraphQL Schema: {Query <nil> <nil> [] []}
2018/02/27 19:39:25 HTTP API listening on port: 8201
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xa7dbc1]

goroutine 109 [running]:
github.com/bmeg/arachne/timestamp.(*Timestamp).Touch(0x0, 0xc420464230, 0x6)
	/home/ubuntu/src/github.com/bmeg/arachne/timestamp/timestamp.go:20 +0x111
github.com/bmeg/arachne/kvgraph.(*KVGraph).AddGraph(0xc420288940, 0xc420464230, 0x6, 0xc4204664d0, 0xc42046a540)
	/home/ubuntu/src/github.com/bmeg/arachne/kvgraph/kvgraph.go:24 +0x4a
github.com/bmeg/arachne/graphserver.(*GraphEngine).AddGraph(0xc4202539c0, 0xc420464230, 0x6, 0xf5c900, 0xc420466401)
	/home/ubuntu/src/github.com/bmeg/arachne/graphserver/graph_engine.go:31 +0x47
github.com/bmeg/arachne/graphserver.(*ArachneServer).AddGraph(0xc4202539c0, 0xf565e0, 0xc4202df230, 0xc42046a560, 0xc4202539c0, 0x0, 0xc420449a70)
	/home/ubuntu/src/github.com/bmeg/arachne/graphserver/server.go:136 +0x47
github.com/bmeg/arachne/aql._Edit_AddGraph_Handler(0xe0d240, 0xc4202539c0, 0xf565e0, 0xc4202df230, 0xc4204664d0, 0x0, 0x0, 0x0, 0x0, 0x0)
	/home/ubuntu/src/github.com/bmeg/arachne/aql/aql.pb.go:2086 +0x241
google.golang.org/grpc.(*Server).processUnaryRPC(0xc4200db080, 0xf5b5e0, 0xc420484600, 0xc420283a40, 0xc420275710, 0x1504248, 0x0, 0x0, 0x0)
	/home/ubuntu/src/google.golang.org/grpc/server.go:920 +0x848
google.golang.org/grpc.(*Server).handleStream(0xc4200db080, 0xf5b5e0, 0xc420484600, 0xc420283a40, 0x0)
	/home/ubuntu/src/google.golang.org/grpc/server.go:1142 +0x1318
google.golang.org/grpc.(*Server).serveStreams.func1.1(0xc4202ac600, 0xc4200db080, 0xf5b5e0, 0xc420484600, 0xc420283a40)
	/home/ubuntu/src/google.golang.org/grpc/server.go:637 +0x9f
created by google.golang.org/grpc.(*Server).serveStreams.func1
	/home/ubuntu/src/google.golang.org/grpc/server.go:635 +0xa1

Tried to follow README, got error. Also, how to turn on server?

This line

python src/github.com/bmeg/arachne/test/test_amazon_load.py amazon-meta.txt.gz  http://localhost:8000

I believe the second argument needs to be something about the output file, not a host. At least, that is how I got it to work.

Also, could not complete the test as the line

Turn on local arachne server

does not elaborate as to how this is done. I assume a go run of one of these files??

500 error from query

In [36]: q.where(aql.eq("_label", "Individual")).where(aql.eq("source", "tcga")).mark("individual").in_("sampleOf").where(aql.eq("disease_code", "BRCA")).mark("sample").select(["individual", "sample"]).distinct("$.individual.gid").limit(1).execute()[0].to_dict()
---------------------------------------------------------------------------
HTTPError                                 Traceback (most recent call last)
<ipython-input-36-79aa868e8224> in <module>()
----> 1 q.where(aql.eq("_label", "Individual")).where(aql.eq("source", "tcga")).mark("individual").in_("sampleOf").where(aql.eq("disease_code", "BRCA")).mark("sample").select(["individual", "sample"]).distinct("$.individual.gid").limit(1).execute()[0].to_dict()

/mnt/smmart/projects/explore-bmeg-tcga/venv/local/lib/python2.7/site-packages/aql/query.pyc in execute(self, stream)
    263         else:
    264             output = []
--> 265             for r in self.__stream():
    266                 output.append(r)
    267             return output

/mnt/smmart/projects/explore-bmeg-tcga/venv/local/lib/python2.7/site-packages/aql/query.pyc in __stream(self)
    224                                  json={"query": self.query},
    225                                  stream=True)
--> 226         response.raise_for_status()
    227         for result in response.iter_lines():
    228             try:

/mnt/smmart/projects/explore-bmeg-tcga/venv/local/lib/python2.7/site-packages/requests/models.pyc in raise_for_status(self)
    933
    934         if http_error_msg:
--> 935             raise HTTPError(http_error_msg, response=self)
    936
    937     def close(self):

HTTPError: 500 Server Error: Internal Server Error for url: http://arachne.compbio.ohsu.edu/v1/graph/bmeg/query

Panic from query

2017/12/23 19:59:56 http: panic serving [::1]:53740: reflect: call of reflect.Value.Interface on zero Value
goroutine 27 [running]:
net/http.(*conn).serve.func1(0xc4268fc000)
	/usr/local/go/src/net/http/server.go:1721 +0xd0
panic(0x18febc0, 0xc4262e0f00)
	/usr/local/go/src/runtime/panic.go:489 +0x2cf
reflect.valueInterface(0x0, 0x0, 0x0, 0x1, 0x19cebc0, 0x0)
	/usr/local/go/src/reflect/value.go:930 +0x1fa
reflect.Value.Interface(0x0, 0x0, 0x0, 0x0, 0x0)
	/usr/local/go/src/reflect/value.go:925 +0x44
github.com/grpc-ecosystem/grpc-gateway/runtime.(*JSONPb).marshalNonProtoField(0xc4201567b0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x9, 0x193)
	/Users/buchanae/src/github.com/grpc-ecosystem/grpc-gateway/runtime/marshal_jsonpb.go:80 +0x5c4
github.com/grpc-ecosystem/grpc-gateway/runtime.(*JSONPb).Marshal(0xc4201567b0, 0x0, 0x0, 0x6, 0x2088100, 0xc4262ccf30, 0x1a1743f, 0x5)
	/Users/buchanae/src/github.com/grpc-ecosystem/grpc-gateway/runtime/marshal_jsonpb.go:29 +0x174
github.com/bmeg/arachne/graphserver.(*MarshalClean).Marshal(0xc420192230, 0x1908ce0, 0xc4262ccf90, 0xc4262ccf30, 0xc4262ccf90, 0xc4262ccf30, 0x0, 0x0)
	/Users/buchanae/src/github.com/bmeg/arachne/graphserver/webserver.go:39 +0x99
github.com/grpc-ecosystem/grpc-gateway/runtime.handleForwardResponseStreamError(0x1ed5301, 0x1edae80, 0xc420192230, 0x1ed8380, 0xc42018e1c0, 0x1ecbf00, 0xc4262ccf30)
	/Users/buchanae/src/github.com/grpc-ecosystem/grpc-gateway/runtime/handler.go:151 +0xae
github.com/grpc-ecosystem/grpc-gateway/runtime.ForwardResponseStream(0x3162000, 0xc4201fc2d0, 0xc425592050, 0x1edae80, 0xc420192230, 0x1ed8380, 0xc42018e1c0, 0xc4255ae400, 0xc4201308a0, 0x2086e10, ...)
	/Users/buchanae/src/github.com/grpc-ecosystem/grpc-gateway/runtime/handler.go:55 +0x91d
github.com/bmeg/arachne/aql.RegisterQueryHandler.func1(0x1ed8380, 0xc42018e1c0, 0xc4255ae400, 0xc420019f20)
	/Users/buchanae/src/github.com/bmeg/arachne/aql/aql.pb.gw.go:561 +0x3a1
github.com/grpc-ecosystem/grpc-gateway/runtime.(*ServeMux).ServeHTTP(0xc425592050, 0x1ed8380, 0xc42018e1c0, 0xc4255ae400)
	/Users/buchanae/src/github.com/grpc-ecosystem/grpc-gateway/runtime/mux.go:198 +0x10f5
github.com/gorilla/mux.(*Router).ServeHTTP(0xc4201906c0, 0x1ed8380, 0xc42018e1c0, 0xc4255ae400)
	/Users/buchanae/src/github.com/gorilla/mux/mux.go:150 +0x101
net/http.serverHandler.ServeHTTP(0xc420089600, 0x1ed8380, 0xc42018e1c0, 0xc4255ae200)
	/usr/local/go/src/net/http/server.go:2568 +0x92
net/http.(*conn).serve(0xc4268fc000, 0x1ed91c0, 0xc4201b9480)
	/usr/local/go/src/net/http/server.go:1825 +0x612
created by net/http.(*Server).Serve
	/usr/local/go/src/net/http/server.go:2668 +0x2ce

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.