bmeg / grip Goto Github PK
View Code? Open in Web Editor NEWGraph Integration Platform
Home Page: https://bmeg.github.io/grip
License: MIT License
Graph Integration Platform
Home Page: https://bmeg.github.io/grip
License: MIT License
I noticed the aql API is inconsistent. In some places a Vertex
is used, in others a *Vertex
arachne list
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x143b817]
goroutine 22 [running]:
github.com/bmeg/arachne/aql.Client.GetGraphs.func1(0xc42020ce40, 0x1ffb020, 0xc420086388, 0x1ffe020, 0xc420086390)
/Users/buchanae/src/github.com/bmeg/arachne/aql/util.go:57 +0x157
created by github.com/bmeg/arachne/aql.Client.GetGraphs
/Users/buchanae/src/github.com/bmeg/arachne/aql/util.go:59 +0x89
Currently, updating an existing vertex requires first calling Get, then calling Add.
That's perfectly fine; for example, Google Datastore has this behavior. It should be at least documented.
If you want, you could also provide an upsert option.
The timestamp
method should show when underlying database has changed. Currently mongo driver only watches changes that it makes, but if two arachne servers point at the same mongo db, they won't recognize when the db has changed. Need to use something like https://docs.mongodb.com/manual/changeStreams/ to update timestamp
RJHB552 smc-het-graph-logs
arachne start -h
2018/01/03 09:56:49 Adding goja JS engine
2018/01/03 09:56:49 Adding otto JS engine
Mongo has a memory leak somewhere. No clue where, just have observed that a series of queries results in climbing memory, and once the queries stop, the memory level remains. Restarting the server drops the memory back down.
The client API is probably the most important part of a database's documentation: https://godoc.org/github.com/bmeg/arachne/aql
For AQL there's a bunch of unnecessary junk, such as Register*
methods generated by protobuf.
Having commands like this makes development a lot easier, and allows users to quickly experiment.
@adamstruck I regret not going with in funnel. What do you think? Should arachne stick with arachne vertex get
or arachne get vertex
My most common use case is based around running queries in an ipython/notebook setting; I'm often exploring the data, running lots of different types of queries. Often I'm not even interested in the full set of results, I just need a few. Also, if I do want the full set of results, it's often small enough that I don't need streaming results.
AQL and Arachne are built to stream data. Currently the python client returns a generator/iterator from a query execution. When doing the type of work I described above, I constantly need to wrap every query in list()
. It gets pretty annoying.
I'm not sure what the right API is. graph.query().V().list()
?
Since I'm often just exploring, maybe graph.query().V().head()
would be really nice too, so I'm less likely to pull down a large amount of data. Same for graph.query().V().first()
. These should use a server side result limit, or some other way of efficiently limiting the result stream.
q.where(aql.eq("_label", "Individual")).where(aql.eq("source", "tcga")).mark("individual").in_("sampleOf").where(aql.eq("disease_code", "BRCA")).mark("sample").distinct(["$.individualz.gid"]).limit(1).execute()[0].to_dict()
In the query above, the distinct field is accidentally wrong, but I still get results.
I guess this touches on the commonly recurring topic of schema vs no schema โ how can you know the field doesn't exist without a schema? I'll say that, as a user, not having the server tell me I'm wrong makes it harder to learn the query language and easier to make mistakes.
arachne server --port 5756 --rpc 5757 --mongo 127.0.0.1:27017
2018/02/04 14:10:39 Starting Server
2018/02/04 14:10:51 no reachable servers
2018/02/04 14:10:51 TCP+RPC server listening on 5757
2018/02/04 14:10:51 HTTP proxy connecting to localhost:5757
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x17bfda1]
goroutine 73 [running]:
gopkg.in/mgo%2ev2.(*Collection).Find(0xc4203a0240, 0x0, 0x0, 0xc4202cc6a0)
/Users/buchanae/src/gopkg.in/mgo.v2/session.go:2115 +0x31
github.com/bmeg/arachne/mongo.(*Graph).GetVertexList.func1(0xc4202c6c60, 0xc4203aa130, 0x1ff9000, 0xc4203a0450)
/Users/buchanae/src/github.com/bmeg/arachne/mongo/mongo_store.go:157 +0x82
created by github.com/bmeg/arachne/mongo.(*Graph).GetVertexList
/Users/buchanae/src/github.com/bmeg/arachne/mongo/mongo_store.go:169 +0x7f
I'd like to have super easy access to client libraries and utils. Perhaps pip install aql
?
Or, alternatively, wait until the sync with ophion and rely on that client instead?
# ./bin/arachne example
2018/05/20 17:02:22 Loading example graph data into example
2018/05/20 17:02:22 Loading example graphql schema into example-schema
Error: failed to unmarshal graph schema: json: cannot unmarshal string into Go struct field Struct.fields of type structpb.Value
This line
python src/github.com/bmeg/arachne/test/test_amazon_load.py amazon-meta.txt.gz http://localhost:8000
I believe the second argument needs to be something about the output file, not a host. At least, that is how I got it to work.
Also, could not complete the test as the line
Turn on local arachne server
does not elaborate as to how this is done. I assume a go run
of one of these files??
I'm finding that it's easy to create conflicting IDs in my application code. To manage that, I started prefixing the IDs by the vertex/edge type (label). That gives me peace of mind, but now I'm getting areas in my code where the ID hasn't been properly prefixed, so the vertex is (silently) not found.
I think it would be great if the database handled all this for me. Unique IDs per table/document type seems like a normal concept, so this feels like a reasonable feature.
Ran into performance issue. Unclear if difference in performance was expected :
arachne load g2p --vertex /data/mc3.Variant.Vertex.json
arachne mongoload g2p --vertex /data/mc3.Variant.Vertex.json --host mongodb:27017
I'm struggling to track down why my task writes are not being saved correctly. My best guess is that PackVertex is not correct, and protoutil.AsMap does not correctly convert to a nested map.
This query takes far too long:
list(O.query().V().where(aql.eq("_label", "Variant")).where(aql.eq("chromosome", "1")).where(aql.eq("start", 27100988)).limit(10))
kellrott [12:05 PM]
700,000 variants on chromosome 1 and neither the chromsome or start fields are indexed. so its scanning them all
It's probably worth indexing chromosome, start, end, referenceBases, and alternateBases.
See #92 for conformance test
Is there a reason for separating the Query and Edit services? It does have an effect on client code, adding extra effort required to set up a client. If there's not a particular reason, I'd argue that it's better to simplify client creation by merging the services into one.
Catch place for developer support:
What is the equivalent of this ES query in arachne?
(features.chromosome:8 AND features.start:(128748317 OR 128748316 OR 128748315 OR 128748314 OR 128748313))
arachne server --bolt arachne.db
2018/02/27 19:39:25 Starting Server
2018/02/27 19:39:25 Starting BOLTDB
2018/02/27 19:39:25 TCP+RPC server listening on 8202
2018/02/27 19:39:25 HTTP proxy connecting to localhost:8202
2018/02/27 19:39:25 Fields: graphql.Fields{}
2018/02/27 19:39:25 GraphQL Schema: {Query <nil> <nil> [] []}
2018/02/27 19:39:25 HTTP API listening on port: 8201
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xa7dbc1]
goroutine 109 [running]:
github.com/bmeg/arachne/timestamp.(*Timestamp).Touch(0x0, 0xc420464230, 0x6)
/home/ubuntu/src/github.com/bmeg/arachne/timestamp/timestamp.go:20 +0x111
github.com/bmeg/arachne/kvgraph.(*KVGraph).AddGraph(0xc420288940, 0xc420464230, 0x6, 0xc4204664d0, 0xc42046a540)
/home/ubuntu/src/github.com/bmeg/arachne/kvgraph/kvgraph.go:24 +0x4a
github.com/bmeg/arachne/graphserver.(*GraphEngine).AddGraph(0xc4202539c0, 0xc420464230, 0x6, 0xf5c900, 0xc420466401)
/home/ubuntu/src/github.com/bmeg/arachne/graphserver/graph_engine.go:31 +0x47
github.com/bmeg/arachne/graphserver.(*ArachneServer).AddGraph(0xc4202539c0, 0xf565e0, 0xc4202df230, 0xc42046a560, 0xc4202539c0, 0x0, 0xc420449a70)
/home/ubuntu/src/github.com/bmeg/arachne/graphserver/server.go:136 +0x47
github.com/bmeg/arachne/aql._Edit_AddGraph_Handler(0xe0d240, 0xc4202539c0, 0xf565e0, 0xc4202df230, 0xc4204664d0, 0x0, 0x0, 0x0, 0x0, 0x0)
/home/ubuntu/src/github.com/bmeg/arachne/aql/aql.pb.go:2086 +0x241
google.golang.org/grpc.(*Server).processUnaryRPC(0xc4200db080, 0xf5b5e0, 0xc420484600, 0xc420283a40, 0xc420275710, 0x1504248, 0x0, 0x0, 0x0)
/home/ubuntu/src/google.golang.org/grpc/server.go:920 +0x848
google.golang.org/grpc.(*Server).handleStream(0xc4200db080, 0xf5b5e0, 0xc420484600, 0xc420283a40, 0x0)
/home/ubuntu/src/google.golang.org/grpc/server.go:1142 +0x1318
google.golang.org/grpc.(*Server).serveStreams.func1.1(0xc4202ac600, 0xc4200db080, 0xf5b5e0, 0xc420484600, 0xc420283a40)
/home/ubuntu/src/google.golang.org/grpc/server.go:637 +0x9f
created by google.golang.org/grpc.(*Server).serveStreams.func1
/home/ubuntu/src/google.golang.org/grpc/server.go:635 +0xa1
The current falcor endpoint is just a stub to collect debugging information. See https://netflix.github.io/falcor/documentation/router.html on how to respond to Falcor json path requests
@kellrott I'm not very familiar with the full set of testing here (e.g. python conformance tests). Can you ensure make test
includes everything?
The query methods in the mongo and ES drivers (like GetVertexChannel
) grab batches to reduce latency. In some tested cases downstream methods, like .limit()
only need a few elements. This causes latency because the upstream element still grabs full batches (ie grab 1000 row when it only needs one).
The request would be to start the BatchSize
variable at something small and increasing it every request cycle until it hits its max.
When doing queries, I get rows list this:
{u'vertex': {u'data': {u'chromosome': u'1',
u'description': u'hes family bHLH transcription factor 4 [Source:HGNC Symbol%3BAcc:HGNC:24149]',
u'end': 1000172,
u'id': u'ENSG00000188290',
u'seqId': u'1',
u'start': 998962,
u'strand': u'-',
u'symbol': u'HES4'},
u'gid': u'gene:ENSG00000188290',
u'label': u'Gene'}}
Basically, all the data I want is always two levels deep. I constantly unwrap row["vertex"]["data"]
Add config option to run read-only server, disabling writing API.
conn = aql.Connection("localhost:8000")
Results in:
Traceback (most recent call last):
File "test.py", line 6, in <module>
G.addVertex("test-1", "test-1-label")
File "/Users/buchanae/src/scratch/smc-het-graph-logs/aql.py", line 56, in addVertex
response = urllib2.urlopen(request)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 154, in urlopen
return opener.open(url, data, timeout)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 431, in open
response = self._open(req, data)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 454, in _open
'unknown_open', req)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 409, in _call_chain
result = func(*args)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 1265, in unknown_open
raise URLError('unknown url type: %s' % type)
urllib2.URLError: <urlopen error unknown url type: localhost>
because it's missing "http".
import aql
conn = aql.Connection("http://10.50.50.123:8000")
O = conn.graph("mortar")
files = O.query().V().hasLabel("Mortar.File")
print list(O.query().V().count().execute())
print list(O.query().E().count().execute())
print list(O.query().V().hasLabel("TES.Task").count().execute())
print list(O.query().V().hasLabel("TES.Task.Tag").count().execute())
print list(files.count().execute())
for f in files.execute():
print f
this prints
python test.py
[{u'int_value': 14681}]
[{u'int_value': 26094}]
[{u'int_value': 5236}]
[{u'int_value': 107}]
[{u'int_value': 9294}]
{u'int_value': 9294}
but if I comment out the print list(files.count().execute())
line, it prints out the file vertices as I expected.
Rebuild graph schema when graphql graph changes. Right now, GraphQL schema is only loaded when server starts up.
2017/12/23 19:59:56 http: panic serving [::1]:53740: reflect: call of reflect.Value.Interface on zero Value
goroutine 27 [running]:
net/http.(*conn).serve.func1(0xc4268fc000)
/usr/local/go/src/net/http/server.go:1721 +0xd0
panic(0x18febc0, 0xc4262e0f00)
/usr/local/go/src/runtime/panic.go:489 +0x2cf
reflect.valueInterface(0x0, 0x0, 0x0, 0x1, 0x19cebc0, 0x0)
/usr/local/go/src/reflect/value.go:930 +0x1fa
reflect.Value.Interface(0x0, 0x0, 0x0, 0x0, 0x0)
/usr/local/go/src/reflect/value.go:925 +0x44
github.com/grpc-ecosystem/grpc-gateway/runtime.(*JSONPb).marshalNonProtoField(0xc4201567b0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x9, 0x193)
/Users/buchanae/src/github.com/grpc-ecosystem/grpc-gateway/runtime/marshal_jsonpb.go:80 +0x5c4
github.com/grpc-ecosystem/grpc-gateway/runtime.(*JSONPb).Marshal(0xc4201567b0, 0x0, 0x0, 0x6, 0x2088100, 0xc4262ccf30, 0x1a1743f, 0x5)
/Users/buchanae/src/github.com/grpc-ecosystem/grpc-gateway/runtime/marshal_jsonpb.go:29 +0x174
github.com/bmeg/arachne/graphserver.(*MarshalClean).Marshal(0xc420192230, 0x1908ce0, 0xc4262ccf90, 0xc4262ccf30, 0xc4262ccf90, 0xc4262ccf30, 0x0, 0x0)
/Users/buchanae/src/github.com/bmeg/arachne/graphserver/webserver.go:39 +0x99
github.com/grpc-ecosystem/grpc-gateway/runtime.handleForwardResponseStreamError(0x1ed5301, 0x1edae80, 0xc420192230, 0x1ed8380, 0xc42018e1c0, 0x1ecbf00, 0xc4262ccf30)
/Users/buchanae/src/github.com/grpc-ecosystem/grpc-gateway/runtime/handler.go:151 +0xae
github.com/grpc-ecosystem/grpc-gateway/runtime.ForwardResponseStream(0x3162000, 0xc4201fc2d0, 0xc425592050, 0x1edae80, 0xc420192230, 0x1ed8380, 0xc42018e1c0, 0xc4255ae400, 0xc4201308a0, 0x2086e10, ...)
/Users/buchanae/src/github.com/grpc-ecosystem/grpc-gateway/runtime/handler.go:55 +0x91d
github.com/bmeg/arachne/aql.RegisterQueryHandler.func1(0x1ed8380, 0xc42018e1c0, 0xc4255ae400, 0xc420019f20)
/Users/buchanae/src/github.com/bmeg/arachne/aql/aql.pb.gw.go:561 +0x3a1
github.com/grpc-ecosystem/grpc-gateway/runtime.(*ServeMux).ServeHTTP(0xc425592050, 0x1ed8380, 0xc42018e1c0, 0xc4255ae400)
/Users/buchanae/src/github.com/grpc-ecosystem/grpc-gateway/runtime/mux.go:198 +0x10f5
github.com/gorilla/mux.(*Router).ServeHTTP(0xc4201906c0, 0x1ed8380, 0xc42018e1c0, 0xc4255ae400)
/Users/buchanae/src/github.com/gorilla/mux/mux.go:150 +0x101
net/http.serverHandler.ServeHTTP(0xc420089600, 0x1ed8380, 0xc42018e1c0, 0xc4255ae200)
/usr/local/go/src/net/http/server.go:2568 +0x92
net/http.(*conn).serve(0xc4268fc000, 0x1ed91c0, 0xc4201b9480)
/usr/local/go/src/net/http/server.go:1825 +0x612
created by net/http.(*Server).Serve
/usr/local/go/src/net/http/server.go:2668 +0x2ce
The GraphDataBaseInterface (GDBI) module defines how the Arachne query engine interfaces with a graph database ( https://github.com/bmeg/arachne/blob/master/gdbi/interface.go#L65 ). The method GetVertexListByID
was added to allow for query methods to create a bi-directional stream of ids to elements. This really sped up the Mongo Driver, because it could take batches of incoming ids, query Mongo for all of them, and then return a batch of results, rather then having a transaction for every single request. You can see how this is taken advantage of in the Out
function in the PipeEngine
at https://github.com/bmeg/arachne/blob/master/gdbi/pipe_query_engine.go#L378
Should more of the gbi.GraphDB
interface be translated to use streaming?
New users could get started with queries very quickly if a small example graph was embedded in arachne. The website docs could use this example graph to demonstrate all queries and functionality.
Right now the GraphQL projection has little in the way of error checking and no documentation.
Currently graph config is global, with a single backend supporting all the graphs. With per graph config, a single endpoint would hold graphs backed by a number of different databases.
Concurrent writes are a fact of life. Most databases deal with this in some way. Arachne shouldn't be an exception.
Consider exploring the style of MVCC that Elasticsearch (and many others) employ, where the database will compare a version string before committing the update.
Most databases provide helpers for (un)marshaling Go struct types defined by the caller. In my experiments, I'm forced to marshal my struct to a string, then unmarshal to the protobuf Struct type, in order to fit the GraphElement type.
Arachne is using an ok approach (similar to what Google Datastore does), it's just missing the nice layer on top that makes it easy to work with.
list(O.query().V().where(aql.eq("_label", "CNASegment")).where(aql.eq("referenceName", "8")).count())
list(O.query().V().where(aql.and_(aql.eq("_label", "CNASegment"), aql.eq("referenceName", "8"))).count())
The first one returns in 1 minutes. The second returns in 14 minutes. The "count" stays the same.
In [36]: q.where(aql.eq("_label", "Individual")).where(aql.eq("source", "tcga")).mark("individual").in_("sampleOf").where(aql.eq("disease_code", "BRCA")).mark("sample").select(["individual", "sample"]).distinct("$.individual.gid").limit(1).execute()[0].to_dict()
---------------------------------------------------------------------------
HTTPError Traceback (most recent call last)
<ipython-input-36-79aa868e8224> in <module>()
----> 1 q.where(aql.eq("_label", "Individual")).where(aql.eq("source", "tcga")).mark("individual").in_("sampleOf").where(aql.eq("disease_code", "BRCA")).mark("sample").select(["individual", "sample"]).distinct("$.individual.gid").limit(1).execute()[0].to_dict()
/mnt/smmart/projects/explore-bmeg-tcga/venv/local/lib/python2.7/site-packages/aql/query.pyc in execute(self, stream)
263 else:
264 output = []
--> 265 for r in self.__stream():
266 output.append(r)
267 return output
/mnt/smmart/projects/explore-bmeg-tcga/venv/local/lib/python2.7/site-packages/aql/query.pyc in __stream(self)
224 json={"query": self.query},
225 stream=True)
--> 226 response.raise_for_status()
227 for result in response.iter_lines():
228 try:
/mnt/smmart/projects/explore-bmeg-tcga/venv/local/lib/python2.7/site-packages/requests/models.pyc in raise_for_status(self)
933
934 if http_error_msg:
--> 935 raise HTTPError(http_error_msg, response=self)
936
937 def close(self):
HTTPError: 500 Server Error: Internal Server Error for url: http://arachne.compbio.ohsu.edu/v1/graph/bmeg/query
We should coordinate the ports used by our projects so they don't conflict. Funnel is already using 8000 and 9090
https://github.com/bmeg/arachne/blob/master/aql/aql.proto#L14
Match uses GraphQuerySet as it's message type, but GraphQuerySet -> GraphQuery -> graph allows a match query to change graphs.
From the docs: Run takes current query and executes it, ignoring the results
Why is that useful?
For easier development and debugging, let's add a really simple web UI that basically just dumps vertices and edges in table form.
For example:
A genomic feature SNP has the following equivalent 'tags'
"synonyms": [
"NC_000009.11:g.133750356A>G",
"NG_012034.1:g.166089A>G",
"CM000671.1:g.133750356A>G",
"CM000671.2:g.130874969A>G",
"NC_000009.10:g.132740177A>G",
"chr9:g.133750356A>G",
"COSM12604",
"chr9:g.130874969A>G",
"NC_000009.12:g.130874969A>G"
],
Proteins have the same issue:
https://github.com/bmeg/arachne/blob/73cab57bd0205828120f3743fe44d3ac80672bb1/aql.py#L311
first() is hiding a potentially large and expensive query + response + marshal/unmarshal. Use limit() to actually get only one element?
When a vertex/edge/etc isn't found, this should communicated to the caller with a special error type, NotFound
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.