kakao / s2graph Goto Github PK
View Code? Open in Web Editor NEWThis code base is retained for historical interest only, please visit Apache Incubator Repo for latest one
Home Page: https://github.com/apache/incubator-s2graph
License: Other
This code base is retained for historical interest only, please visit Apache Incubator Repo for latest one
Home Page: https://github.com/apache/incubator-s2graph
License: Other
EdgeQualifier store edges operation byte in qualifier. EdgeWithIndexInverted class need to store operation byte, but EdgeWithIndex
s operation byte actually never been used(only possible operation is insert, increment and there is no need for distinguish these two).
so suggest remove opBytes in EdgeQualifier`s serialization and when we deserialize we use insert as operation.
In daumkakao, lots of services use s2graph as recommendation engine.
for recommendation usage, it is important to redirect traffic to multiple logic then provide way for user to distinguish each logic so they can measure click-through-rate on each logic(and decide which is better).
I think it is better to seperate A/B test ability(exactly, redirect traffic to multiple logic) to other project for long term, but for now, it is embedded in s2graph rest project.
It would be better to generalize A/B test ability more and provide seperate project so others who don`t want to use s2graph but interested in traffic redirection can use.
to get how many edges exist for any (vertex, label) pair, current implementation needs to fetch all edges.
it would be helpful to update total number of edges when insert/delete edges so query can return total number of edges without retrieve all edges.
in s2loader project, TransferToHFile class use HFileOutputFormat2.configureIncrementalLoad.
currently it use job.setMapOutputValueClass(classOf[Cell]), which will fail to figure out how to split table(https://github.com/cloudera/hbase/blob/cdh5.3.6-release/hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/HFileOutputFormat2.java#L389-L391)
changing MapOutputValueClass from Cell to KeyValue is necessary for bulk load to work properly.
Some Programming languages such as Javascript cannot support Numbers with > 53bits
https://dev.twitter.com/overview/api/twitter-ids-json-and-snowflake
In Javascript, data parsing problem with property(Long type)
> 94329429622284704
94329429622284700 // 704 to 700
> 10765432100123456789
10765432100123458000 // 789 to 000
Provide work-around for Long data type
{"long_type_column": 10765432100123456789}
↓
{"long_type_column": 10765432100123456789, "long_type_column_str": "10765432100123456789"}
Currently s2graph use L4 codec for HBase table compression and can't change. This property should be configurable.
currently, for edge query, user must know src vertex and which label the query want to traverse.
in some cases, user want to know which label this vertex is currently participating for edge traversing.
for example, currently we can traverse graph like this.
current: src vertex 1 -> liked, traverse liked label that start from vertex 1
proposal: src vertex 1 -> *(all label that src vertex 1 is participating), traverse all labels that start from vertex 1
When client send too many requests in short period(since s2graph directly fire RPC into HBase without throttling), yield spike on read response time since HBase becomes busy to process write requests.
It would be better to provide way to throttle write requests to HBase in s2graph for stability.
getService, getLabel API do not return all information in DB.
currently, long/int/string(only 255 byte) is supported. It would be helpful to store floating number or text datatype on vertex and edge. seems like hbase-commen has OrderedBytes that can deal with natural order in bytes.
related to #30.
if we have following edges with aggregated counts, then it would be helpful to group by and normalize.
item1 -> likedCount -> month:2015-10 -> 1
item1 -> likedCount -> month:2015-09 -> 2
item1 -> likedCount -> day: 2015-10-01 -> 1
item1 -> likedCount -> day: 2015-09-30 -> 1
item1 -> likedCount -> day: 2015-09-29 -> 1
user may only care about group by(from) result like following(with or without timeDecay)
item1 -> likedCount -> item1(1 + 2 + 1 + 1 + 1)
need to provide more flexibility on which edge field can be group by on step and queryParam.
s2graph support multiple services in Daumkakao.
if it is ok to share use cases in daumkakao, then it would be helpful for others to decide if s2graph is sufficient for their needs.
Currently s2graph uses CDH HBase, Hadoop package. Apache package is more common than CDH.
And the version of package should be configurable.
questions
Note: if you only need vertex id, then you don`t need to create vertex explicitly. when you create label,
s2graph create vertex with empty properties according to label schema.
requests
https://github.com/daumkakao/s2graph#rest-api-glossary
I can't browse the routes file in above link. It is connected with this link.
didn't a link change from old to https://github.com/daumkakao/s2graph/blob/master/conf/routes?
In regard to #93, the general workflow of updating an existing label data via bulk upload is expected to be as follows:
Steps 1 and 2 will be achieved by running a spark job (in a127798) and calling getEdges API respectively.
So all we need is an API to take care of step 3.
This example is a wrong.
curl -XPOST localhost:9000/graphs/createLabel -H 'Content-Type: Application/json' -d '
{
"label": "user_article_liked",
"srcServiceName": "s2graph",
"srcColumnName": "user_id",
"srcColumnType": "long",
"tgtServiceName": "s2graph_news",
"tgtColumnName": "article_id",
"tgtColumnType": "string",
"indexProps": {},
"props": {},
"serviceName": "s2graph_news"
}
'
It can't run this, it shows me an error.
[error] application - java.lang.RuntimeException: target service s2graph_news is not created.
java.lang.RuntimeException: target service s2graph_news is not created.
Change the example like this.
There is no s2graph_news
service in a test/script. It is not that good example.
curl -XPOST localhost:9000/graphs/createLabel -H 'Content-Type: Application/json' -d '
{
"label": "user_article_liked",
"srcServiceName": "s2graph",
"srcColumnName": "user_id",
"srcColumnType": "long",
"tgtServiceName": "s2graph",
"tgtColumnName": "article_id",
"tgtColumnType": "string",
"indexProps": {},
"props": {}
}
'
Result
{"message":"user_article_liked is created"}%
Currently, to Increment property`s value on edge, s2graph do following steps.
It would be better to simplify inner logic for simple count value.
current use cases are store aggregated count values per each vertex per some time unit.
ex) actual example events
2015-10-01-00:00:01 -> SteamShon -> liked -> item1
2015-09-30-00:00:01 -> SteamShon -> liked -> item1
2015-09-29-00:00:01 -> SteamShon -> liked -> item1
sometime users want to item1`s count list aggregated by timeUnit like (month, week, day, hour).
if we create new incrementCount operation on edge which expect, ts, operation, from, to, label, indexProps, countValue, then user can issue below operations to build edges in s2graph.
2015-10-01-00:00:01 incrementCount item1 -> likedCount -> 1443625200000 {"timeUnit": "month", "count": 1}
2015-09-30-00:00:01 incrementCount item1 -> likedCount -> 1441033200000 {"timeUnit": "month", "count": 1}
then expected edges would be like below.
item1 -> likedCount -> month:2015-10 -> 1
item1 -> likedCount -> month:2015-09 -> 2
item1 -> likedCount -> day: 2015-10-01 -> 1
item1 -> likedCount -> day: 2015-09-30 -> 1
item1 -> likedCount -> day: 2015-09-29 -> 1
now query can traverse item1`s aggregated count per timeUnit value as edge, so all functionality on query DSL can be applied naturally.
It cannot be handled to get how many edges as final result, especially using multi-step query.
So we need the global scope limit to constrain the number of edges as final results.
Please refer to below example, the number of edges should be 10 at most by the limit clause on top.
{
"limit" : 10,
"srcVertices": [{"serviceName": "s2graph", "columnName": "user_id", "id": 1}],
"steps": [
{
"step": [ {"label": "user_click_item", "direction":"out", "limit": 10, "scoring": {"score": 1}} ]
},
{
"step": [ {"label": "user_click_item", "direction":"in", "limit": 10, "scoring": {"score": 1}} ]
},
{
"step": [ {"label": "user_click_item", "direction":"out", "limit": 10, "scoring": {"score": 1}} ]
}
]
}
Add operator on where clause
Now support (=, !=, between), should add operator(>, >=, <, <=)
ex
# in query dsl ..
"where": "time > 3 or time <=1"
Think of an S2AB bucket defined as below:
{
"srcVertices": [
{
"serviceName": "some_service",
"columnName": "article_id",
"id": [[doc_id]]
}
],
"steps": [
{
"step": [
{
"label": "similar_articles",
"direction": "out",
"offset": 0,
"limit": 10,
"where": "is_blacklisted_article=false and article_id != [[doc_id]]"
}
]
}
]
}
According to the current spec, client will make the following call:
curl -XPOST http://graph-query.iwilab.com:9000/graphs/experiment/{app key}/{experiment name}/{uuid} -H 'Content-Type: Application/json' -d '
{
"[[doc_id]]": "some-string-id"
}
'
And this will return an error due to nested double quotes in the "where"
field.
"where": "is_blacklisted_article=false and article_id != "some-string-id""
Everyone would be happy if
[[doc_id]]
is replaced with some-string-id
instead of "some-string-id"
(without the quotation marks..),{
"srcVertices": [
{
"serviceName": "some_service",
"columnName": "article_id",
"id": "[[doc_id]]" <=== quotation marks added!
}
],
"steps": [
{
"step": [
{
"label": "similar_articles",
"direction": "out",
"offset": 0,
"limit": 10,
"where": "is_blacklisted_article=false and article_id != [[doc_id]]"
}
]
}
]
}
with current implementation, concurrent update on same edge, which has same (source, label, dir, target), lead to broken states in snapshot edge.
for example, if following request comes in one request, then result is not deterministic.
1434380239199 delete e 16 1016 s2graph_label_test
1434380239200 update e 16 1016 s2graph_label_test {"time": 10, "weight": 20}
1434380239198 update e 16 1016 s2graph_label_test {"time": 10}
this is because Edge update snapshot edge without getting lock when they build update based on what they read.
for example, there was no edge between 16 and 1016 then following happens concurrently.
s fetched snapshot edge is none. then build it
s own update and mutate.all of above operations build wrong update since what they read is not valid state when they build update and mutate.
to resolve wrong states in snapshot edge, we first need to validate if each edge`s update is still valid after they read using checkAndSet operation in HBase.
since we already read snapshot edge, instead of fire put, use checkAndSet and check return value. if return false, then other already update on same edge and contention occur so re-read is required. otherwise no contention so we are good to mutate indexedEdge also.
I tried run script/test.sh.
I could find errors at query for vertices.
Here are error queries in test.sh.
curl -XPOST localhost:9000/graphs/vertices/insert/s2graph/user_id -H 'Content-Type: Application/json' -d '
[
{"id":1,"props":{"is_active":true}, "timestamp":1417616431},
{"id":2,"props":{},"timestamp":1417616431}
]
'
curl -XPOST localhost:9000/graphs/getVertices -H 'Content-Type: Application/json' -d '
[
{"serviceName": "s2graph", "columnName": "user_id", "ids": [1, 2, 3]}
]
'
example)
{
...
"indexProps": [
{
"indexName": "pk",
"indexDirection": "both",
"indexProps": [
{"name": "user_type", "defaultValue": "-", "dataType": "string" },
{"name": "action_type", "defaultValue": "v", "dataType": "string" },
{"name": "doc_type", "defaultValue": "-", "dataType": "string" },
{"name": "_timestamp", "defaultValue": 0, "dataType": "long"}
]
},
{
"indexName": "idx_2",
"indexDirection": "in",
"indexProps": [
{"name": "user_type", "defaultValue": "-", "dataType": "string" },
{"name": "_timestamp", "defaultValue": 0, "dataType": "long"},
{"name": "doc_type", "defaultValue": "-", "dataType": "string" },
{"name": "action_type", "defaultValue": "v", "dataType": "string" }
]
}
]
...
}
current implementation only cares about _to field value for filtering out. (https://github.com/kakao/s2graph/blob/develop/app/controllers/PostProcess.scala#L73)
maybe it is more intuitive to filterout queryResult based on (from, label, dir, to) which is reference of edge.
test fail when running test.
sbt test
[info] Loading global plugins from /Users/blueiur/.sbt/0.13/plugins/project
[info] Loading global plugins from /Users/blueiur/.sbt/0.13/staging/bac26239dae466e87fa4/ensime-sbt-cmd/project
[info] Loading global plugins from /Users/blueiur/.sbt/0.13/plugins
[info] Loading project definition from /Users/blueiur/code/s2graph/project
[info] Set current project to s2graph (in build file:/Users/blueiur/code/s2graph/)
[info] Compiling 5 Scala sources to /Users/blueiur/code/s2graph/target/scala-2.10/test-classes...
[error] /Users/blueiur/code/s2graph/test/controllers/GraphSpec.scala:76: not found: value toQuery
[error] val query = toQuery(Json.parse(queryEdges))
[error] ^
[error] /Users/blueiur/code/s2graph/test/controllers/GraphSpec.scala:81: not found: value toEdges
[error] val jsonEdges = toEdges(jsons, "insert")
[error] ^
[error] /Users/blueiur/code/s2graph/test/controllers/IntegritySpec.scala:453: not found: value GraphAggregatorActor
[error] GraphAggregatorActor.init()
[error] ^
[error] /Users/blueiur/code/s2graph/test/controllers/IntegritySpec.scala:455: type mismatch;
[error] found : play.api.Configuration
[error] required: com.typesafe.config.Config
[error] Graph(Config.conf)(ExecutionContext.Implicits.global)
[error] ^
[error] /Users/blueiur/code/s2graph/test/controllers/IntegritySpec.scala:482: not found: value GraphAggregatorActor
[error] GraphAggregatorActor.shutdown()
[error] ^
[error] /Users/blueiur/code/s2graph/test/controllers/QuerySpec.scala:12: object wordnik is not a member of package com
[error] import com.wordnik.swagger.annotations.Api
[error] ^
[error] /Users/blueiur/code/s2graph/test/controllers/QuerySpec.scala:15: object RequestParser is not a member of package controllers
[error] Note: trait RequestParser exists, but it has no companion object.
[error] import controllers.RequestParser._
[error] ^
[error] /Users/blueiur/code/s2graph/test/controllers/QuerySpec.scala:252: not found: value GraphAggregatorActor
[error] GraphAggregatorActor.init()
[error] ^
[error] /Users/blueiur/code/s2graph/test/controllers/QuerySpec.scala:254: type mismatch;
[error] found : play.api.Configuration
[error] required: com.typesafe.config.Config
[error] Graph(Config.conf)(ExecutionContext.Implicits.global)
[error] ^
[error] /Users/blueiur/code/s2graph/test/controllers/QuerySpec.scala:266: not found: value toEdges
[error] val inserts = toEdges(Json.parse(jsArrStr), "insert")
[error] ^
[error] /Users/blueiur/code/s2graph/test/controllers/QuerySpec.scala:272: not found: value toEdges
[error] val inserts2nd = toEdges(Json.parse(jsArrStr2nd), "insert")
[error] ^
[error] /Users/blueiur/code/s2graph/test/controllers/QuerySpec.scala:281: type mismatch;
[error] found : play.api.Configuration
[error] required: com.typesafe.config.Config
[error] Graph(Config.conf)(ExecutionContext.Implicits.global)
[error] ^
[error] /Users/blueiur/code/s2graph/test/controllers/QuerySpec.scala:286: not found: value toEdges
[error] val deletes = toEdges(Json.parse(jsArrStr), "delete")
[error] ^
[error] /Users/blueiur/code/s2graph/test/controllers/QuerySpec.scala:291: not found: value toEdges
[error] val deletes2nd = toEdges(Json.parse(jsArrStr2nd), "delete")
[error] ^
[error] /Users/blueiur/code/s2graph/test/controllers/QuerySpec.scala:299: not found: value GraphAggregatorActor
[error] GraphAggregatorActor.shutdown()
[error] ^
[error] /Users/blueiur/code/s2graph/test/controllers/RequestParserSpec.scala:34: not found: value WhereParser
[error] val whereOpt = WhereParser(label).parse(sql)
[error] ^
[error] /Users/blueiur/code/s2graph/test/controllers/RequestParserSpec.scala:46: value must is not a member of Nothing
[error] checked must beEqualTo(expected)
[error] ^
[error] 17 errors found
[error] (root/test:compile) Compilation failed
[error] Total time: 7 s, completed 2015. 5. 13 오전 10:36:15
Setting up S2Graph on your local machine is still quite a hassle. (Although much has improved since a couple of months ago..)
Vagrant seems like a nice approach to this problem.
"Vagrant will isolate dependencies and their configuration within a single disposable, consistent environment, without sacrificing any of the tools you're used to working with (editors, browsers, debuggers, etc.)."
I tested with HBase 0.98.12-hadoop2 and called the following command which is from README:
curl -XPOST localhost:9000/graphs/edges/insert -H 'Content-Type: Application/json' -d '
[
{"from":1,"to":101,"label":"graph_test","props":{"time":-1, "weight":10},"timestamp":1417616431},
{"from":1,"to":102,"label":"graph_test","props":{"time":0, "weight":11},"timestamp":1417616431},
{"from":1,"to":103,"label":"graph_test","props":{"time":1, "weight":12},"timestamp":1417616431},
{"from":1,"to":104,"label":"graph_test","props":{"time":-2, "weight":1},"timestamp":1417616431}
]
'
I received a success message "1 insert success" but I could not find any rows in the HBase table.
I researched this cause and found that timestamp format in the example is wrong format.
["timestamp":1417616431] value is not a Java timestamp format. After changing to Java timestamp format, s2graph stored several rows in the HBase table.
README file should be changed or verifying logic for timestamp should be added.
It works with Java7, Protobuf 250, but error occurred with
Java 8 Or Protobuf 261
I can't find prerequisites versions (java7, protobuf250) on README.md
Error messages.
../../asynchbase/target/generated-sources/protobuf/java/org/hbase/async/generated/ClientPB.java:[139,30] error: cannot find symbol
[ERROR] symbol: class ProtocolStringList
Currently, multiple index on label can be created only by addIndex api one by one.
It would be better to create multiple index when user create label and provide index name.
example would be following.
curl -XPOST localhost:9000/graphs/createLabel -H 'Content-Type: Application/json' -d '
{
"label": "graph_test",
"srcServiceName": "s2graph",
"srcColumnName": "user_id",
"srcColumnType": "long",
"tgtServiceName": "s2graph",
"tgtColumnName": "item_id",
"tgtColumnType": "string",
"serviceName": "s2graph",
"indexProps": [
{
"indexName": "pk",
"indexProps": [
{"name": "user_type", "defaultValue": "-", "dataType": "string" },
{"name": "action_type", "defaultValue": "v", "dataType": "string" },
{"name": "doc_type", "defaultValue": "-", "dataType": "string" },
{"name": "_timestamp", "defaultValue": 0, "dataType": "long"}
]
},
{
"indexName": "idx_2",
"indexProps": [
{"name": "user_type", "defaultValue": "-", "dataType": "string" },
{"name": "_timestamp", "defaultValue": 0, "dataType": "long"},
{"name": "doc_type", "defaultValue": "-", "dataType": "string" },
{"name": "action_type", "defaultValue": "v", "dataType": "string" }
]
}
],
"props": [
{"name": "doc_created_at", "defaultValue": 0, "dataType": "long"}
]
}
'
then it would be possible for query to specify index name on query dsl. currently "scoring" field has two purpose. one is selecting which index to traverse and other is actual weights for scoring. need to seperate scoring field from index selection.
{
"select": [],
"srcVertices": [
{
"serviceName": "s2graph_test",
"columnName": "user_id",
"id": 5
}
],
"steps": [
{
"step": [
{
"label": "graph_test",
"direction": "out",
"offset": 0,
"limit": 10,
"index": "pk"
}
]
}
]
}
insert data as follow.
curl -XPOST localhost:9000/graphs/edges/insert -H 'Content-Type: Application/json' -d '
[
{"timestamp": 1, "from": 101, "to": "a", "label": "graph_test"},
{"timestamp": 2, "from": 101, "to": "b", "label": "graph_test"},
{"timestamp": 3, "from": 101, "to": "c", "label": "graph_test"}
]
'
following query, expected result would be one edge which has "a".
{
"select": [],
"srcVertices": [
{
"serviceName": "s2graph_test",
"columnName": "user_id",
"id": 101
}
],
"steps": [
{
"step": [
{
"label": "graph_test",
"direction": "out",
"offset": 0,
"limit": 10,
"_to": "a"
}
]
}
]
}
but currently result is empty.
I have compiled asynchbase according to the README file. But I found the following "make" error message and mvn error messages.
[~/s2graph]# cd asynchbase; make; mvn install
Makefile:29: third_party/include.mk: No such file or directory
make: *** No rule to make target `third_party/include.mk'. Stop.
.....
[ERROR] COMPILATION ERROR :
[INFO] -------------------------------------------------------------
[ERROR] /Users/babokim/work/workspace/s2graph/asynchbase/test/TestNSREs.java:[94,23] error: cannot find symbol
[ERROR] symbol: class KeyValue
location: class TestNSREs
/Users/babokim/work/workspace/s2graph/asynchbase/test/TestNSREs.java:[95,23] error: cannot find symbol
[ERROR] symbol: class RegionInfo
location: class TestNSREs
/Users/babokim/work/workspace/s2graph/asynchbase/test/TestNSREs.java:[96,23] error: cannot find symbol
when create Label, labelMeta, labelIndex, serviceColumn is created if they do not exist.
problem is that these operations should be atomic with transaction so if any failure can be reverted.
current implementation does not use transaction so partial failure yield to corruption in DB.
now result json only show edge properties when they have been modified. it would be better to provide all edge properties on result json even though it is not modified.
Provide Union Query,
Query multiple query's in array
[
{
"srcVertices": [
{
"serviceName": "s2graph",
"columnName": "user_id_test",
"id": 0
}
],
"steps": [
[
{
"label": "s2graph_label_test",
"direction": "out",
"offset": 0
}
]
]
}
,
{
"srcVertices": [
{
"serviceName": "s2graph",
"columnName": "user_id_test",
"id": 0
}
],
"steps": [
[
{
"label": "s2graph_label_test",
"direction": "in",
"offset": 0
}
]
]
}
]
Aggrated each query result in array
[
{
"size": 1,
"degrees": [
{
"from": 0,
"label": "s2graph_label_test",
"direction": "in",
"_degree": 1
}
],
"results": [
{
"cacheRemain": -7,
"timestamp": 3003,
"score": 1,
"label": "s2graph_label_test",
"direction": "in",
"to": 2,
"_timestamp": 3003,
"from": 0,
"props": {
"weight": 30,
"is_blocked": false,
"_count": -1,
"_timestamp": 3003,
"is_hidden": false,
"time": 0
}
}
],
"impressionId": 764860958
},
{
"size": 2,
"degrees": [
{
"from": 0,
"label": "s2graph_label_test",
"direction": "out",
"_degree": 2
}
],
"results": [
{
"cacheRemain": -16,
"timestamp": 2002,
"score": 1,
"label": "s2graph_label_test",
"direction": "out",
"to": 2,
"_timestamp": 2002,
"from": 0,
"props": {
"weight": 20,
"is_blocked": false,
"_count": -1,
"_timestamp": 2002,
"is_hidden": false,
"time": 0
}
},
{
"cacheRemain": -16,
"timestamp": 1001,
"score": 1,
"label": "s2graph_label_test",
"direction": "out",
"to": 1,
"_timestamp": 1001,
"from": 0,
"props": {
"weight": 10,
"is_blocked": false,
"_count": -1,
"_timestamp": 1001,
"is_hidden": true,
"time": 0
}
}
],
"impressionId": -1650835965
}
]
This is issue from @hsleep
It can be useful to use different indexProps for label`s directions
example)
{
"indexProps": [
{
"indexName": "pk",
"indexDirection": "both",
"indexProps": [
{"name": "user_type", "defaultValue": "-", "dataType": "string" },
{"name": "action_type", "defaultValue": "v", "dataType": "string" },
{"name": "doc_type", "defaultValue": "-", "dataType": "string" },
{"name": "_timestamp", "defaultValue": 0, "dataType": "long"}
]
},
{
"indexName": "idx_2",
"indexDirection": "in",
"indexProps": [
{"name": "user_type", "defaultValue": "-", "dataType": "string" },
{"name": "_timestamp", "defaultValue": 0, "dataType": "long"},
{"name": "doc_type", "defaultValue": "-", "dataType": "string" },
{"name": "action_type", "defaultValue": "v", "dataType": "string" }
]
}
]
}
out: out direction index only
in: in direction index only
both: out/in direction index
There is a dependency with mysql connector in build.sbt. I don't know exactly the license of mysql connector. The license of mysql connector isn't comparable with Apache license. So we should check the license of mysql connector.
I think that using derby for default meta store is more common and this can avoid the license problem.
Problem:
I've noticed that S2Graph clients quite often shuffle the query results before serving them to users in order to give some randomness to user experience.
An option to randomly sample a set of queried edges will result in a much simpler client code.
For example, lets say a client is running an AB test on S2Graph items with A) a sorted bucket and B) a random bucket.
As is, she will have to identify the random bucket id B and mix up the result.
With the suggested feature, both buckets can be handled uniformly.
Idea:
Right now, I'm considering a step-level integer-type parameter "sample" that will tell S2Graph to randomly sample N edges from the result set of the corresponding step.
Any sort of guidance is welcomed!
degree of a vertex of a graph is the number of edges incident to the vertex.
It could be possible to store vertex degree and label information that this vertex participated while we mutate edge.
It would be very nice to provide vertex degree and what label this vertex has edges.
current reame is not easy to understand. I think this is because all information is on same page. better organization on topics with more examples and diagram will make things more understandable.
after profile through visualvm, filterEdges on Graph contains unnecessary check which use many CPU cycle.
point to improve on Graph.filterEdges is following(develop branch).
Personally, I am not a fan of micro-optimization, but filterEdges goes through every edges that fetched so maybe little bit optimization on this method would be necessary.
Through benchmark, I see lots of CPU cycle is waisted on Graph.toHashKey and just checking for exclude/include.
Currently, all APIs for mutating edge/vertex are non-blocking.
It would be good to provide blocking APIs so user can select according to their needs.
When select column is given in query, it is not necessary to build props map and other fields for result json for every edge.
Current implementation first create result json and then filter out only select columns. Since edgeToJson method runs on every edge(it will be called a lot), small improvement on here would increase performance quite a lot.
quick thought would be skipping unnecessary json object creation when we don`t need to by given query.
There is no direction exist on label but edge/query do have direction on them. This make Edge/Query complicated. it would be better to refactor these, or document in clear way.
following is my test data.
curl -XPOST localhost:9000/graphs/edges/insert -H 'Content-Type: Application/json' -d '
[
{"label":"s2graph_label_test","from":-1,"to":1,"props":{},"timestamp":1},
{"label":"s2graph_label_test","from":-1,"to":2,"props":{},"timestamp":1},
{"label":"s2graph_label_test","from":-1,"to":3,"props":{},"timestamp":1},
{"label":"s2graph_label_test","from":1,"to":10,"props":{},"timestamp":1},
{"label":"s2graph_label_test","from":1,"to":11,"props":{},"timestamp":1},
{"label":"s2graph_label_test","from":2,"to":11,"props":{},"timestamp":1},
{"label":"s2graph_label_test","from":2,"to":12,"props":{},"timestamp":1},
{"label":"s2graph_label_test","from":3,"to":12,"props":{},"timestamp":1},
{"label":"s2graph_label_test","from":3,"to":13,"props":{},"timestamp":1},
{"label":"s2graph_label_test","from":10,"to":100,"props":{},"timestamp":1},
{"label":"s2graph_label_test","from":11,"to":200,"props":{},"timestamp":1},
{"label":"s2graph_label_test","from":12,"to":300,"props":{},"timestamp":1}
]
'
in picture, graph looks like this.
Currently, result json does not contains what is ancestor for result. for example, it is impossible to know where edge (12 -> 300) comes from. it comes from (2 and 3). it would be good to provide way to keep who is ancestor of final step edge for users.
Now where parser
can compare just edge's property variable(_from,
_to, props
) with input value
As is: Just compare with input value
where: "_from = 10" # lhs(`_from`) is edge's property rhs(`10`) is input value
To be: Can compare with Variable
# Compare edge's prop(`_from`) with edge's prop(`age`)
where: "_from = age"
# Compare edge's prop(`cateogry`) with edge's prop(`to`)
where: "category = to"
# Compare edge's prop(`cateogry`) with parent edge's prop(`category`)
where: "category = _parent.category"
# Compare edge's prop(`gender`) with parent of parent edge's prop(`gender`)
where: "_parent._parent.gender = gender"
There are storage overhead on snapshot edge currently.
ex) single request edge with only _timestamp as indexProps.
{"timestamp": t1, "from": 1, "to": 100, "label": "liked", "props": {}, "direction": "out"}
with this single request 2 same edges(logically) need to be created.
(1 -> liked -> out -> 100)
(100 <- liked <- in <- 1)
note that from, to is swapped and direction is toggled.
if label is undirected then 4 same edges need to be created.
(1 -> liked -> out -> 100), (100 -> liked -> out -> 1)
(100 <- liked <- in <- 1), (1 <- liked <- in <- 100)
in Edge class, relatedEdges method will return these edges.
problem is that all of these related edges come from same data, they share same snapshotEdge.
snapshotEdge store each edge`s (from, labelId) as rowKey and to as qualifier.
all of above relatedEdges will have same data on their snapshotEdge, but current implementation store multiple snapshotEdge.
ex)
request edge: (1 -> liked -> out -> 100)
snapshot edge: RowKey(1, liked), Qualifier(100), Value(...)
request edge: (100 <- in <- liked <- 1)
snapshot edge: RowKey(100, liked), Qualifier(1), Value(...)
we can make single rule for snapshotEdge like RowKey(smaller vertexId, label), Qualifier(larger vertexId)
then only need to keep one snapshot edge if we stick to same rule when lookup snapshotEdge.
This will reduce storage usage significantly.
current cache is used for only removing unnecessary I/O request to backend storage. even though we hit cache, we still need to run groupBy and filtering operations on these cached edges.
it would be better to provide step-wise/queryParam-wise result cache so we can also ditch unnecessary CPU bound operations like groupBy and filtering. actually these will yield more efficiency on local cache size too.
As #85, #86, union
is performed by the array of the conventional queries. The result is also the array of the results corresponding to per query.
Some use case needs combining the results. For example, hybrid recommender system
that combines multiple recommendations needs aggregating the score of the union query
results.
For this, I'd like to propose the query as follows
{
"queries": [
{
"srcVertices": [
{
"columnName": "user_id_test",
"id": 0,
"serviceName": "s2graph"
}
],
"steps": [
[
{
"direction": "out",
"label": "s2graph_label_test_0",
"offset": 0
}
]
]
},
{
"srcVertices": [
{
"columnName": "user_id_test",
"id": 0,
"serviceName": "s2graph"
}
],
"steps": [
[
{
"direction": "out",
"label": "s2graph_label_test_1",
"offset": 0
}
]
]
}
],
"weights": [
0.6,
0.4
],
"aggregateBy": ["to"]
}
where weight
0.6
and 0.4
are the weights for aggregating by weighted sum
respectively.
The results of the above query would have weightedSum
which is the weighted sum of the scores whose to
(and keys in aggregateBy
) is the same.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.