bmeg / grip Goto Github PK

View Code? Open in Web Editor NEW

24.0 8.0 8.0 12.8 MB

Graph Integration Platform

Home Page: https://bmeg.github.io/grip

License: MIT License

Go 76.13% Makefile 0.52% Python 17.62% JavaScript 0.53% Shell 0.15% HTML 0.66% CSS 3.19% Dockerfile 0.08% R 1.12%

graphdb mongodb badger golang

grip's Issues

Merge update option into v0.2 branch?

Merge #45 into #82 ?

aql/go: ensure consistent argument type (e.g. *Vertex vs Vertex)

I noticed the aql API is inconsistent. In some places a Vertex is used, in others a *Vertex

cmd/list: panic (from empty args?)

arachne list
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x143b817]

goroutine 22 [running]:
github.com/bmeg/arachne/aql.Client.GetGraphs.func1(0xc42020ce40, 0x1ffb020, 0xc420086388, 0x1ffe020, 0xc420086390)
	/Users/buchanae/src/github.com/bmeg/arachne/aql/util.go:57 +0x157
created by github.com/bmeg/arachne/aql.Client.GetGraphs
	/Users/buchanae/src/github.com/bmeg/arachne/aql/util.go:59 +0x89

Decide/document how to update a vertex

Currently, updating an existing vertex requires first calling Get, then calling Add.

That's perfectly fine; for example, Google Datastore has this behavior. It should be at least documented.

If you want, you could also provide an upsert option.

Use mongo state to update timestamp

The timestamp method should show when underlying database has changed. Currently mongo driver only watches changes that it makes, but if two arachne servers point at the same mongo db, they won't recognize when the db has changed. Need to use something like https://docs.mongodb.com/manual/changeStreams/ to update timestamp

CLI doesn't fail on unknown task

RJHB552 smc-het-graph-logs
arachne start -h
2018/01/03 09:56:49 Adding goja JS engine
2018/01/03 09:56:49 Adding otto JS engine

db/mongo: memory leak

Mongo has a memory leak somewhere. No clue where, just have observed that a series of queries results in climbing memory, and once the queries stop, the memory level remains. Restarting the server drops the memory back down.

Clean up client exported API

The client API is probably the most important part of a database's documentation: https://godoc.org/github.com/bmeg/arachne/aql

For AQL there's a bunch of unnecessary junk, such as Register* methods generated by protobuf.

cmd/get: add commands for GetVertex, GetEdge, etc.

Having commands like this makes development a lot easier, and allows users to quickly experiment.

@adamstruck I regret not going with in funnel. What do you think? Should arachne stick with arachne vertex get or arachne get vertex

aql.py: automatically convert to list

My most common use case is based around running queries in an ipython/notebook setting; I'm often exploring the data, running lots of different types of queries. Often I'm not even interested in the full set of results, I just need a few. Also, if I do want the full set of results, it's often small enough that I don't need streaming results.

AQL and Arachne are built to stream data. Currently the python client returns a generator/iterator from a query execution. When doing the type of work I described above, I constantly need to wrap every query in list(). It gets pretty annoying.

I'm not sure what the right API is. graph.query().V().list()?

Since I'm often just exploring, maybe graph.query().V().head() would be really nice too, so I'm less likely to pull down a large amount of data. Same for graph.query().V().first(). These should use a server side result limit, or some other way of efficiently limiting the result stream.

query: distinct on non-existent field does not fail

q.where(aql.eq("_label", "Individual")).where(aql.eq("source", "tcga")).mark("individual").in_("sampleOf").where(aql.eq("disease_code", "BRCA")).mark("sample").distinct(["$.individualz.gid"]).limit(1).execute()[0].to_dict()

In the query above, the distinct field is accidentally wrong, but I still get results.

I guess this touches on the commonly recurring topic of schema vs no schema – how can you know the field doesn't exist without a schema? I'll say that, as a user, not having the server tell me I'm wrong makes it harder to learn the query language and easier to make mistakes.

server: panic on no reachable servers

arachne server --port 5756 --rpc 5757 --mongo 127.0.0.1:27017
2018/02/04 14:10:39 Starting Server
2018/02/04 14:10:51 no reachable servers
2018/02/04 14:10:51 TCP+RPC server listening on 5757
2018/02/04 14:10:51 HTTP proxy connecting to localhost:5757
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x17bfda1]

goroutine 73 [running]:
gopkg.in/mgo%2ev2.(*Collection).Find(0xc4203a0240, 0x0, 0x0, 0xc4202cc6a0)
	/Users/buchanae/src/gopkg.in/mgo.v2/session.go:2115 +0x31
github.com/bmeg/arachne/mongo.(*Graph).GetVertexList.func1(0xc4202c6c60, 0xc4203aa130, 0x1ff9000, 0xc4203a0450)
	/Users/buchanae/src/github.com/bmeg/arachne/mongo/mongo_store.go:157 +0x82
created by github.com/bmeg/arachne/mongo.(*Graph).GetVertexList
	/Users/buchanae/src/github.com/bmeg/arachne/mongo/mongo_store.go:169 +0x7f

Make aql.py installable

I'd like to have super easy access to client libraries and utils. Perhaps pip install aql?

Or, alternatively, wait until the sync with ophion and rely on that client instead?

Changes to Proto API broke YAML graph load and example loader

# ./bin/arachne example
2018/05/20 17:02:22 Loading example graph data into example
2018/05/20 17:02:22 Loading example graphql schema into example-schema
Error: failed to unmarshal graph schema: json: cannot unmarshal string into Go struct field Struct.fields of type structpb.Value

Tried to follow README, got error. Also, how to turn on server?

This line

python src/github.com/bmeg/arachne/test/test_amazon_load.py amazon-meta.txt.gz  http://localhost:8000

I believe the second argument needs to be something about the output file, not a host. At least, that is how I got it to work.

Also, could not complete the test as the line

Turn on local arachne server

does not elaborate as to how this is done. I assume a go run of one of these files??

Make IDs unique per label?

I'm finding that it's easy to create conflicting IDs in my application code. To manage that, I started prefixing the IDs by the vertex/edge type (label). That gives me peace of mind, but now I'm getting areas in my code where the ID hasn't been properly prefixed, so the vertex is (silently) not found.

I think it would be great if the database handled all this for me. Unique IDs per table/document type seems like a normal concept, so this feels like a reasonable feature.

arachne load vs arachne mongoload

Ran into performance issue. Unclear if difference in performance was expected :

arachne load g2p --vertex /data/mc3.Variant.Vertex.json
- approx 125 documents / sec
- default badger
arachne mongoload g2p --vertex /data/mc3.Variant.Vertex.json --host mongodb:27017
- approx 8000 documents / sec

mongo: not saving nested fields

I'm struggling to track down why my task writes are not being saved correctly. My best guess is that PackVertex is not correct, and protoutil.AsMap does not correctly convert to a nested map.

variants: consider indexing more fields

This query takes far too long:
list(O.query().V().where(aql.eq("_label", "Variant")).where(aql.eq("chromosome", "1")).where(aql.eq("start", 27100988)).limit(10))

kellrott [12:05 PM]
700,000 variants on chromosome 1 and neither the chromsome or start fields are indexed. so its scanning them all

It's probably worth indexing chromosome, start, end, referenceBases, and alternateBases.

Define and implement aggregation/index API

See #92 for conformance test

Index creation and management
Term aggregation (do counts for term, return top X)
Percentiles
Histogram (binning)

Consider merging Query and Edit services

Is there a reason for separating the Query and Edit services? It does have an effect on client code, adding extra effort required to set up a client. If there's not a particular reason, I'd argue that it's better to simplify client creation by merging the services into one.

developer support: Process monitoring,

Catch place for developer support:

how to monitor queries ?

Lookup Variant by chromosome + start position _range_

What is the equivalent of this ES query in arachne?

  (features.chromosome:8 AND features.start:(128748317 OR 128748316 OR 128748315 OR 128748314 OR 128748313))

db/bolt: panic

arachne server --bolt arachne.db
2018/02/27 19:39:25 Starting Server
2018/02/27 19:39:25 Starting BOLTDB
2018/02/27 19:39:25 TCP+RPC server listening on 8202
2018/02/27 19:39:25 HTTP proxy connecting to localhost:8202
2018/02/27 19:39:25 Fields: graphql.Fields{}
2018/02/27 19:39:25 GraphQL Schema: {Query <nil> <nil> [] []}
2018/02/27 19:39:25 HTTP API listening on port: 8201
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xa7dbc1]

goroutine 109 [running]:
github.com/bmeg/arachne/timestamp.(*Timestamp).Touch(0x0, 0xc420464230, 0x6)
	/home/ubuntu/src/github.com/bmeg/arachne/timestamp/timestamp.go:20 +0x111
github.com/bmeg/arachne/kvgraph.(*KVGraph).AddGraph(0xc420288940, 0xc420464230, 0x6, 0xc4204664d0, 0xc42046a540)
	/home/ubuntu/src/github.com/bmeg/arachne/kvgraph/kvgraph.go:24 +0x4a
github.com/bmeg/arachne/graphserver.(*GraphEngine).AddGraph(0xc4202539c0, 0xc420464230, 0x6, 0xf5c900, 0xc420466401)
	/home/ubuntu/src/github.com/bmeg/arachne/graphserver/graph_engine.go:31 +0x47
github.com/bmeg/arachne/graphserver.(*ArachneServer).AddGraph(0xc4202539c0, 0xf565e0, 0xc4202df230, 0xc42046a560, 0xc4202539c0, 0x0, 0xc420449a70)
	/home/ubuntu/src/github.com/bmeg/arachne/graphserver/server.go:136 +0x47
github.com/bmeg/arachne/aql._Edit_AddGraph_Handler(0xe0d240, 0xc4202539c0, 0xf565e0, 0xc4202df230, 0xc4204664d0, 0x0, 0x0, 0x0, 0x0, 0x0)
	/home/ubuntu/src/github.com/bmeg/arachne/aql/aql.pb.go:2086 +0x241
google.golang.org/grpc.(*Server).processUnaryRPC(0xc4200db080, 0xf5b5e0, 0xc420484600, 0xc420283a40, 0xc420275710, 0x1504248, 0x0, 0x0, 0x0)
	/home/ubuntu/src/google.golang.org/grpc/server.go:920 +0x848
google.golang.org/grpc.(*Server).handleStream(0xc4200db080, 0xf5b5e0, 0xc420484600, 0xc420283a40, 0x0)
	/home/ubuntu/src/google.golang.org/grpc/server.go:1142 +0x1318
google.golang.org/grpc.(*Server).serveStreams.func1.1(0xc4202ac600, 0xc4200db080, 0xf5b5e0, 0xc420484600, 0xc420283a40)
	/home/ubuntu/src/google.golang.org/grpc/server.go:637 +0x9f
created by google.golang.org/grpc.(*Server).serveStreams.func1
	/home/ubuntu/src/google.golang.org/grpc/server.go:635 +0xa1

Finish Falcor endpoint

The current falcor endpoint is just a stub to collect debugging information. See https://netflix.github.io/falcor/documentation/router.html on how to respond to Falcor json path requests

Document setting up GraphQL endpoint

build: ensure all tests are run in TravisCI

@kellrott I'm not very familiar with the full set of testing here (e.g. python conformance tests). Can you ensure make test includes everything?

'Exponential growth' batch size for mongo and elastic search drivers

The query methods in the mongo and ES drivers (like GetVertexChannel) grab batches to reduce latency. In some tested cases downstream methods, like .limit() only need a few elements. This causes latency because the upstream element still grabs full batches (ie grab 1000 row when it only needs one).
The request would be to start the BatchSize variable at something small and increasing it every request cycle until it hits its max.

aql: unwrapping "vertex" and "data"

When doing queries, I get rows list this:

 {u'vertex': {u'data': {u'chromosome': u'1',
    u'description': u'hes family bHLH transcription factor 4 [Source:HGNC Symbol%3BAcc:HGNC:24149]',
    u'end': 1000172,
    u'id': u'ENSG00000188290',
    u'seqId': u'1',
    u'start': 998962,
    u'strand': u'-',
    u'symbol': u'HES4'},
   u'gid': u'gene:ENSG00000188290',
   u'label': u'Gene'}}

Basically, all the data I want is always two levels deep. I constantly unwrap row["vertex"]["data"]

Readonly Option

Add config option to run read-only server, disabling writing API.

aql.py: robust handling of server address

conn = aql.Connection("localhost:8000")

Results in:

Traceback (most recent call last):
  File "test.py", line 6, in <module>
    G.addVertex("test-1", "test-1-label")
  File "/Users/buchanae/src/scratch/smc-het-graph-logs/aql.py", line 56, in addVertex
    response = urllib2.urlopen(request)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 154, in urlopen
    return opener.open(url, data, timeout)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 431, in open
    response = self._open(req, data)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 454, in _open
    'unknown_open', req)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 409, in _call_chain
    result = func(*args)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 1265, in unknown_open
    raise URLError('unknown url type: %s' % type)
urllib2.URLError: <urlopen error unknown url type: localhost>

because it's missing "http".

aql/python: query builder is incorrect

import aql

conn = aql.Connection("http://10.50.50.123:8000")

O = conn.graph("mortar")
files = O.query().V().hasLabel("Mortar.File")

print list(O.query().V().count().execute())
print list(O.query().E().count().execute())

print list(O.query().V().hasLabel("TES.Task").count().execute())
print list(O.query().V().hasLabel("TES.Task.Tag").count().execute())
print list(files.count().execute())

for f in files.execute():
    print f

this prints

python test.py
[{u'int_value': 14681}]
[{u'int_value': 26094}]
[{u'int_value': 5236}]
[{u'int_value': 107}]
[{u'int_value': 9294}]
{u'int_value': 9294}

but if I comment out the print list(files.count().execute()) line, it prints out the file vertices as I expected.

GraphQL schema reload

Rebuild graph schema when graphql graph changes. Right now, GraphQL schema is only loaded when server starts up.

Panic from query

2017/12/23 19:59:56 http: panic serving [::1]:53740: reflect: call of reflect.Value.Interface on zero Value
goroutine 27 [running]:
net/http.(*conn).serve.func1(0xc4268fc000)
	/usr/local/go/src/net/http/server.go:1721 +0xd0
panic(0x18febc0, 0xc4262e0f00)
	/usr/local/go/src/runtime/panic.go:489 +0x2cf
reflect.valueInterface(0x0, 0x0, 0x0, 0x1, 0x19cebc0, 0x0)
	/usr/local/go/src/reflect/value.go:930 +0x1fa
reflect.Value.Interface(0x0, 0x0, 0x0, 0x0, 0x0)
	/usr/local/go/src/reflect/value.go:925 +0x44
github.com/grpc-ecosystem/grpc-gateway/runtime.(*JSONPb).marshalNonProtoField(0xc4201567b0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x9, 0x193)
	/Users/buchanae/src/github.com/grpc-ecosystem/grpc-gateway/runtime/marshal_jsonpb.go:80 +0x5c4
github.com/grpc-ecosystem/grpc-gateway/runtime.(*JSONPb).Marshal(0xc4201567b0, 0x0, 0x0, 0x6, 0x2088100, 0xc4262ccf30, 0x1a1743f, 0x5)
	/Users/buchanae/src/github.com/grpc-ecosystem/grpc-gateway/runtime/marshal_jsonpb.go:29 +0x174
github.com/bmeg/arachne/graphserver.(*MarshalClean).Marshal(0xc420192230, 0x1908ce0, 0xc4262ccf90, 0xc4262ccf30, 0xc4262ccf90, 0xc4262ccf30, 0x0, 0x0)
	/Users/buchanae/src/github.com/bmeg/arachne/graphserver/webserver.go:39 +0x99
github.com/grpc-ecosystem/grpc-gateway/runtime.handleForwardResponseStreamError(0x1ed5301, 0x1edae80, 0xc420192230, 0x1ed8380, 0xc42018e1c0, 0x1ecbf00, 0xc4262ccf30)
	/Users/buchanae/src/github.com/grpc-ecosystem/grpc-gateway/runtime/handler.go:151 +0xae
github.com/grpc-ecosystem/grpc-gateway/runtime.ForwardResponseStream(0x3162000, 0xc4201fc2d0, 0xc425592050, 0x1edae80, 0xc420192230, 0x1ed8380, 0xc42018e1c0, 0xc4255ae400, 0xc4201308a0, 0x2086e10, ...)
	/Users/buchanae/src/github.com/grpc-ecosystem/grpc-gateway/runtime/handler.go:55 +0x91d
github.com/bmeg/arachne/aql.RegisterQueryHandler.func1(0x1ed8380, 0xc42018e1c0, 0xc4255ae400, 0xc420019f20)
	/Users/buchanae/src/github.com/bmeg/arachne/aql/aql.pb.gw.go:561 +0x3a1
github.com/grpc-ecosystem/grpc-gateway/runtime.(*ServeMux).ServeHTTP(0xc425592050, 0x1ed8380, 0xc42018e1c0, 0xc4255ae400)
	/Users/buchanae/src/github.com/grpc-ecosystem/grpc-gateway/runtime/mux.go:198 +0x10f5
github.com/gorilla/mux.(*Router).ServeHTTP(0xc4201906c0, 0x1ed8380, 0xc42018e1c0, 0xc4255ae400)
	/Users/buchanae/src/github.com/gorilla/mux/mux.go:150 +0x101
net/http.serverHandler.ServeHTTP(0xc420089600, 0x1ed8380, 0xc42018e1c0, 0xc4255ae200)
	/usr/local/go/src/net/http/server.go:2568 +0x92
net/http.(*conn).serve(0xc4268fc000, 0x1ed91c0, 0xc4201b9480)
	/usr/local/go/src/net/http/server.go:1825 +0x612
created by net/http.(*Server).Serve
	/usr/local/go/src/net/http/server.go:2668 +0x2ce

Move more of the graph database interface to streaming?

The GraphDataBaseInterface (GDBI) module defines how the Arachne query engine interfaces with a graph database ( https://github.com/bmeg/arachne/blob/master/gdbi/interface.go#L65 ). The method GetVertexListByID was added to allow for query methods to create a bi-directional stream of ids to elements. This really sped up the Mongo Driver, because it could take batches of incoming ids, query Mongo for all of them, and then return a batch of results, rather then having a transaction for every single request. You can see how this is taken advantage of in the Out function in the PipeEngine at https://github.com/bmeg/arachne/blob/master/gdbi/pipe_query_engine.go#L378

Should more of the gbi.GraphDB interface be translated to use streaming?

Bundle a small example graph

New users could get started with queries very quickly if a small example graph was embedded in arachne. The website docs could use this example graph to demonstrate all queries and functionality.

GraphQL schema error checking

Right now the GraphQL projection has little in the way of error checking and no documentation.

Per graph backend config

Currently graph config is global, with a single backend supporting all the graphs. With per graph config, a single endpoint would hold graphs backed by a number of different databases.

Concurrency control

Concurrent writes are a fact of life. Most databases deal with this in some way. Arachne shouldn't be an exception.

Consider exploring the style of MVCC that Elasticsearch (and many others) employ, where the database will compare a version string before committing the update.

aql/go: automatic (un)marshaling

Most databases provide helpers for (un)marshaling Go struct types defined by the caller. In my experiments, I'm forced to marshal my struct to a string, then unmarshal to the protobuf Struct type, in order to fit the GraphElement type.

Arachne is using an ok approach (similar to what Google Datastore does), it's just missing the nice layer on top that makes it easy to work with.

"where + and" query performance changes depending on representation

list(O.query().V().where(aql.eq("_label", "CNASegment")).where(aql.eq("referenceName", "8")).count())
list(O.query().V().where(aql.and_(aql.eq("_label", "CNASegment"), aql.eq("referenceName", "8"))).count())

The first one returns in 1 minutes. The second returns in 14 minutes. The "count" stays the same.

500 error from query

In [36]: q.where(aql.eq("_label", "Individual")).where(aql.eq("source", "tcga")).mark("individual").in_("sampleOf").where(aql.eq("disease_code", "BRCA")).mark("sample").select(["individual", "sample"]).distinct("$.individual.gid").limit(1).execute()[0].to_dict()
---------------------------------------------------------------------------
HTTPError                                 Traceback (most recent call last)
<ipython-input-36-79aa868e8224> in <module>()
----> 1 q.where(aql.eq("_label", "Individual")).where(aql.eq("source", "tcga")).mark("individual").in_("sampleOf").where(aql.eq("disease_code", "BRCA")).mark("sample").select(["individual", "sample"]).distinct("$.individual.gid").limit(1).execute()[0].to_dict()

/mnt/smmart/projects/explore-bmeg-tcga/venv/local/lib/python2.7/site-packages/aql/query.pyc in execute(self, stream)
    263         else:
    264             output = []
--> 265             for r in self.__stream():
    266                 output.append(r)
    267             return output

/mnt/smmart/projects/explore-bmeg-tcga/venv/local/lib/python2.7/site-packages/aql/query.pyc in __stream(self)
    224                                  json={"query": self.query},
    225                                  stream=True)
--> 226         response.raise_for_status()
    227         for result in response.iter_lines():
    228             try:

/mnt/smmart/projects/explore-bmeg-tcga/venv/local/lib/python2.7/site-packages/requests/models.pyc in raise_for_status(self)
    933
    934         if http_error_msg:
--> 935             raise HTTPError(http_error_msg, response=self)
    936
    937     def close(self):

HTTPError: 500 Server Error: Internal Server Error for url: http://arachne.compbio.ohsu.edu/v1/graph/bmeg/query

server: change default ports

We should coordinate the ports used by our projects so they don't conflict. Funnel is already using 8000 and 9090

aql: GraphQuerySet + Match allow query to change graphs

https://github.com/bmeg/arachne/blob/master/aql/aql.proto#L14

Match uses GraphQuerySet as it's message type, but GraphQuerySet -> GraphQuery -> graph allows a match query to change graphs.

aql: what's the point of query.Run?

From the docs: Run takes current query and executes it, ignoring the results

Why is that useful?

Simple web UI

For easier development and debugging, let's add a really simple web UI that basically just dumps vertices and edges in table form.

Lookup by _alias_

For example:

A genomic feature SNP has the following equivalent 'tags'

"synonyms": [
"NC_000009.11:g.133750356A>G",
"NG_012034.1:g.166089A>G",
"CM000671.1:g.133750356A>G",
"CM000671.2:g.130874969A>G",
"NC_000009.10:g.132740177A>G",
"chr9:g.133750356A>G",
"COSM12604",
"chr9:g.130874969A>G",
"NC_000009.12:g.130874969A>G"
],

Proteins have the same issue:

https://www.uniprot.org/help/protein_names

bmeg / grip Goto Github PK

grip's Issues

Recommend Projects

Recommend Topics

Recommend Org