kakao / s2graph Goto Github PK

View Code? Open in Web Editor NEW

250.0 250.0 32.0 10.15 MB

This code base is retained for historical interest only, please visit Apache Incubator Repo for latest one

Home Page: https://github.com/apache/incubator-s2graph

License: Other

Scala 98.95% Shell 0.76% Python 0.29%

s2graph's People

Contributors

Stargazers

Watchers

s2graph's Issues

unnecessary byte used in EdgeQualifier

EdgeQualifier store edges operation byte in qualifier. EdgeWithIndexInverted class need to store operation byte, but EdgeWithIndexs operation byte actually never been used(only possible operation is insert, increment and there is no need for distinguish these two).

so suggest remove opBytes in EdgeQualifier`s serialization and when we deserialize we use insert as operation.

provide A/B test ability.

In daumkakao, lots of services use s2graph as recommendation engine.

for recommendation usage, it is important to redirect traffic to multiple logic then provide way for user to distinguish each logic so they can measure click-through-rate on each logic(and decide which is better).

I think it is better to seperate A/B test ability(exactly, redirect traffic to multiple logic) to other project for long term, but for now, it is embedded in s2graph rest project.

It would be better to generalize A/B test ability more and provide seperate project so others who don`t want to use s2graph but interested in traffic redirection can use.

provide total number of edges per source vertex and label

to get how many edges exist for any (vertex, label) pair, current implementation needs to fetch all edges.
it would be helpful to update total number of edges when insert/delete edges so query can return total number of edges without retrieve all edges.

TransferHFile bug

in s2loader project, TransferToHFile class use HFileOutputFormat2.configureIncrementalLoad.

currently it use job.setMapOutputValueClass(classOf[Cell]), which will fail to figure out how to split table(https://github.com/cloudera/hbase/blob/cdh5.3.6-release/hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/HFileOutputFormat2.java#L389-L391)

changing MapOutputValueClass from Cell to KeyValue is necessary for bulk load to work properly.

response json format with Long datatype

Some Programming languages such as Javascript cannot support Numbers with > 53bits
https://dev.twitter.com/overview/api/twitter-ids-json-and-snowflake

In Javascript, data parsing problem with property(Long type)

> 94329429622284704 
94329429622284700 // 704 to 700

> 10765432100123456789 
10765432100123458000 // 789 to 000

Provide work-around for Long data type

{"long_type_column": 10765432100123456789}
                   ↓
{"long_type_column": 10765432100123456789, "long_type_column_str": "10765432100123456789"}

Codec for HBase table compression should be configurable.

Currently s2graph use L4 codec for HBase table compression and can't change. This property should be configurable.

store which labels participate per vertex on their property

currently, for edge query, user must know src vertex and which label the query want to traverse.
in some cases, user want to know which label this vertex is currently participating for edge traversing.

for example, currently we can traverse graph like this.
current: src vertex 1 -> liked, traverse liked label that start from vertex 1
proposal: src vertex 1 -> *(all label that src vertex 1 is participating), traverse all labels that start from vertex 1

Throttling RPC to storage

When client send too many requests in short period(since s2graph directly fire RPC into HBase without throttling), yield spike on read response time since HBase becomes busy to process write requests.

It would be better to provide way to throttle write requests to HBase in s2graph for stability.

provide missing information on result of getLabel, getService

getService, getLabel API do not return all information in DB.

support floating number and text dataType for property

currently, long/int/string(only 255 byte) is supported. It would be helpful to store floating number or text datatype on vertex and edge. seems like hbase-commen has OrderedBytes that can deal with natural order in bytes.

provide normalize and group by on step and queryParam level.

related to #30.

if we have following edges with aggregated counts, then it would be helpful to group by and normalize.

item1 -> likedCount -> month:2015-10 -> 1
item1 -> likedCount -> month:2015-09 -> 2
item1 -> likedCount -> day: 2015-10-01 -> 1
item1 -> likedCount -> day: 2015-09-30 -> 1
item1 -> likedCount -> day: 2015-09-29 -> 1

user may only care about group by(from) result like following(with or without timeDecay)

item1 -> likedCount -> item1(1 + 2 + 1 + 1 + 1)

need to provide more flexibility on which edge field can be group by on step and queryParam.

provide use cases and powered by

s2graph support multiple services in Daumkakao.
if it is ok to share use cases in daumkakao, then it would be helpful for others to decide if s2graph is sufficient for their needs.

The version of HBase and Hadoop package should be apache package.

Currently s2graph uses CDH HBase, Hadoop package. Apache package is more common than CDH.
And the version of package should be configurable.

question about operations and request for README.md correction

questions
- is it possible to order vertices with its property like 'score' ?
- if target column is dummy 'score' and we'd like to create an edge with property 'score' from a column (for ordering), we should clamp 'id' for this functionality. is there any plan of 'auto id' like 'auto increment field' in MySQL?
```
Note: if you only need vertex id, then you don`t need to create vertex explicitly. when you create label,
s2graph create vertex with empty properties according to label schema.
```
- in the section of 'insert vertex and edge'
  - what is the return value and behavior of the operation 'insert' when 'id' already exists? overwrite? or just return?
- in the section of '6. Bulk Loading'
  - there is no field for 'indexProps', why?
requests
- 1.1 label definition
  - in note column, is 'required' omitted? (in indexProps, Props rows, field name is bold highlighted)
- 1. deleteAll - POST /graphs/vertices/delete/:serviceName/:columnName
  - -> 3. deleteAll - POST /graphs/vertices/deleteAll/:serviceName/:columnName

A link of route file is not found.

https://github.com/daumkakao/s2graph#rest-api-glossary

I can't browse the routes file in above link. It is connected with this link.

didn't a link change from old to https://github.com/daumkakao/s2graph/blob/master/conf/routes?

API to Update Label's HTable

In regard to #93, the general workflow of updating an existing label data via bulk upload is expected to be as follows:

Bulk upload new data to new label connected new HTable.
Verify validity of new label.
Update existing label's HTable to the new one.

Steps 1 and 2 will be achieved by running a spark job (in a127798) and calling getEdges API respectively.

So all we need is an API to take care of step 3.

Example label script is something wrong from README.rd file.

This example is a wrong.

curl -XPOST localhost:9000/graphs/createLabel -H 'Content-Type: Application/json' -d '
{
    "label": "user_article_liked",
    "srcServiceName": "s2graph",
    "srcColumnName": "user_id",
    "srcColumnType": "long",
    "tgtServiceName": "s2graph_news",
    "tgtColumnName": "article_id",
    "tgtColumnType": "string",
    "indexProps": {}, 
    "props": {},
   "serviceName": "s2graph_news"
}
'

It can't run this, it shows me an error.

[error] application - java.lang.RuntimeException: target service s2graph_news is not created.
java.lang.RuntimeException: target service s2graph_news is not created.

Change the example like this.
There is no s2graph_news service in a test/script. It is not that good example.

curl -XPOST localhost:9000/graphs/createLabel -H 'Content-Type: Application/json' -d '
{
    "label": "user_article_liked",
    "srcServiceName": "s2graph",
    "srcColumnName": "user_id",
    "srcColumnType": "long",
    "tgtServiceName": "s2graph",
    "tgtColumnName": "article_id",
    "tgtColumnType": "string",
    "indexProps": {}, 
    "props": {}
}
'

Result

{"message":"user_article_liked is created"}%

provide count type value on Edge using Increment.

Currently, to Increment property`s value on edge, s2graph do following steps.

fetch snapshot edge.
check validation on current request edge`s increment operation.
delete old indexed edge by backtracking old snapshot edge`s properties.
build new snapshot edge.
build new indexed edges.

It would be better to simplify inner logic for simple count value.

current use cases are store aggregated count values per each vertex per some time unit.

ex) actual example events

2015-10-01-00:00:01 -> SteamShon -> liked -> item1
2015-09-30-00:00:01 -> SteamShon -> liked -> item1
2015-09-29-00:00:01 -> SteamShon -> liked -> item1

sometime users want to item1`s count list aggregated by timeUnit like (month, week, day, hour).

if we create new incrementCount operation on edge which expect, ts, operation, from, to, label, indexProps, countValue, then user can issue below operations to build edges in s2graph.

2015-10-01-00:00:01 incrementCount item1 -> likedCount -> 1443625200000 {"timeUnit": "month", "count": 1}
2015-09-30-00:00:01 incrementCount item1 -> likedCount -> 1441033200000 {"timeUnit": "month", "count": 1}

then expected edges would be like below.

item1 -> likedCount -> month:2015-10 -> 1
item1 -> likedCount -> month:2015-09 -> 2
item1 -> likedCount -> day: 2015-10-01 -> 1
item1 -> likedCount -> day: 2015-09-30 -> 1
item1 -> likedCount -> day: 2015-09-29 -> 1

now query can traverse item1`s aggregated count per timeUnit value as edge, so all functionality on query DSL can be applied naturally.

Add global limit clause

It cannot be handled to get how many edges as final result, especially using multi-step query.
So we need the global scope limit to constrain the number of edges as final results.

Please refer to below example, the number of edges should be 10 at most by the limit clause on top.

{
  "limit" : 10,
  "srcVertices": [{"serviceName": "s2graph", "columnName": "user_id", "id": 1}],
  "steps": [
    {
      "step": [ {"label": "user_click_item", "direction":"out", "limit": 10, "scoring": {"score": 1}} ]
    },
    {
      "step": [ {"label": "user_click_item", "direction":"in", "limit": 10, "scoring": {"score": 1}} ]
    },
    {
      "step": [ {"label": "user_click_item", "direction":"out", "limit": 10, "scoring": {"score": 1}} ]
    }
  ]
}

Add primitive operator on where clause

Add operator on where clause

Now support (=, !=, between), should add operator(>, >=, <, <=)

# in query dsl ..
"where": "time > 3 or time <=1"

S2AB string variable handling can cause error in "where" options.

Problem

Think of an S2AB bucket defined as below:

{
    "srcVertices": [
        {
            "serviceName": "some_service",
            "columnName": "article_id",
            "id": [[doc_id]]
        }
    ],
    "steps": [
        {
            "step": [
                {
                    "label": "similar_articles",
                    "direction": "out",
                    "offset": 0,
                    "limit": 10,
                    "where": "is_blacklisted_article=false and article_id != [[doc_id]]"
                }
            ]
        }
    ]
}

According to the current spec, client will make the following call:

curl -XPOST http://graph-query.iwilab.com:9000/graphs/experiment/{app key}/{experiment name}/{uuid} -H 'Content-Type: Application/json' -d '
{
  "[[doc_id]]": "some-string-id"
}
'

And this will return an error due to nested double quotes in the "where" field.

"where": "is_blacklisted_article=false and article_id != "some-string-id""

Solution

Everyone would be happy if

the variable [[doc_id]] is replaced with some-string-id instead of "some-string-id" (without the quotation marks..),
and the bucket is defined as such:

{
    "srcVertices": [
        {
            "serviceName": "some_service",
            "columnName": "article_id",
            "id": "[[doc_id]]"     <=== quotation marks added!
        }
    ],
    "steps": [
        {
            "step": [
                {
                    "label": "similar_articles",
                    "direction": "out",
                    "offset": 0,
                    "limit": 10,
                    "where": "is_blacklisted_article=false and article_id != [[doc_id]]"
                }
            ]
        }
    ]
}

concurrent update on same edge break states.

with current implementation, concurrent update on same edge, which has same (source, label, dir, target), lead to broken states in snapshot edge.

for example, if following request comes in one request, then result is not deterministic.

1434380239199   delete  e   16  1016    s2graph_label_test
1434380239200   update  e   16  1016    s2graph_label_test  {"time": 10, "weight": 20}
1434380239198   update  e   16  1016    s2graph_label_test  {"time": 10}

this is because Edge update snapshot edge without getting lock when they build update based on what they read.

for example, there was no edge between 16 and 1016 then following happens concurrently.

first operation read snapshot edge, and snapshot edge is none, it build it`s own update and mutate them without any checking.
second operation also read snapshot edge, and snapshot edge could be none or first operation can be applied. it is not deterministic. for contention, assume its fetched snapshot edge is none. then build its own update and mutate.
third operation also read snapshot edge, and it was none.

all of above operations build wrong update since what they read is not valid state when they build update and mutate.

to resolve wrong states in snapshot edge, we first need to validate if each edge`s update is still valid after they read using checkAndSet operation in HBase.

since we already read snapshot edge, instead of fire put, use checkAndSet and check return value. if return false, then other already update on same edge and contention occur so re-read is required. otherwise no contention so we are good to mutate indexedEdge also.

script/test.sh is broken.

I tried run script/test.sh.
I could find errors at query for vertices.

Here are error queries in test.sh.

insert vertex data

curl -XPOST localhost:9000/graphs/vertices/insert/s2graph/user_id -H 'Content-Type: Application/json' -d '
[
  {"id":1,"props":{"is_active":true}, "timestamp":1417616431},
  {"id":2,"props":{},"timestamp":1417616431}
]
'

select vertices

curl -XPOST localhost:9000/graphs/getVertices -H 'Content-Type: Application/json' -d '
[
    {"serviceName": "s2graph", "columnName": "user_id", "ids": [1, 2, 3]}
]
'

Add a parameter to specify index direction

example)

{
...
 "indexProps": [
            {
                "indexName": "pk",
                "indexDirection": "both",
                "indexProps": [
                    {"name": "user_type", "defaultValue": "-", "dataType": "string" },
                    {"name": "action_type", "defaultValue": "v", "dataType": "string" },
                    {"name": "doc_type", "defaultValue": "-", "dataType": "string" },
                    {"name": "_timestamp", "defaultValue": 0, "dataType": "long"}
                ]
            },      
            {
                "indexName": "idx_2",
                "indexDirection": "in",
                "indexProps": [
                    {"name": "user_type", "defaultValue": "-", "dataType": "string" },
                    {"name": "_timestamp", "defaultValue": 0, "dataType": "long"},
                    {"name": "doc_type", "defaultValue": "-", "dataType": "string" },
                    {"name": "action_type", "defaultValue": "v", "dataType": "string" }
                ]
            }            
    ]
...
}

out: out direction index only
in: in direction index only
both: out/in direction index

FilterOut only care about _to field on edge.

current implementation only cares about _to field value for filtering out. (https://github.com/kakao/s2graph/blob/develop/app/controllers/PostProcess.scala#L73)

maybe it is more intuitive to filterout queryResult based on (from, label, dir, to) which is reference of edge.

test failure

test fail when running test.

sbt test

[info] Loading global plugins from /Users/blueiur/.sbt/0.13/plugins/project
[info] Loading global plugins from /Users/blueiur/.sbt/0.13/staging/bac26239dae466e87fa4/ensime-sbt-cmd/project
[info] Loading global plugins from /Users/blueiur/.sbt/0.13/plugins
[info] Loading project definition from /Users/blueiur/code/s2graph/project
[info] Set current project to s2graph (in build file:/Users/blueiur/code/s2graph/)
[info] Compiling 5 Scala sources to /Users/blueiur/code/s2graph/target/scala-2.10/test-classes...
[error] /Users/blueiur/code/s2graph/test/controllers/GraphSpec.scala:76: not found: value toQuery
[error]   val query = toQuery(Json.parse(queryEdges))
[error]               ^
[error] /Users/blueiur/code/s2graph/test/controllers/GraphSpec.scala:81: not found: value toEdges
[error]       val jsonEdges = toEdges(jsons, "insert")
[error]                       ^
[error] /Users/blueiur/code/s2graph/test/controllers/IntegritySpec.scala:453: not found: value GraphAggregatorActor
[error]       GraphAggregatorActor.init()
[error]       ^
[error] /Users/blueiur/code/s2graph/test/controllers/IntegritySpec.scala:455: type mismatch;
[error]  found   : play.api.Configuration
[error]  required: com.typesafe.config.Config
[error]       Graph(Config.conf)(ExecutionContext.Implicits.global)
[error]                    ^
[error] /Users/blueiur/code/s2graph/test/controllers/IntegritySpec.scala:482: not found: value GraphAggregatorActor
[error]       GraphAggregatorActor.shutdown()
[error]       ^
[error] /Users/blueiur/code/s2graph/test/controllers/QuerySpec.scala:12: object wordnik is not a member of package com
[error] import com.wordnik.swagger.annotations.Api
[error]            ^
[error] /Users/blueiur/code/s2graph/test/controllers/QuerySpec.scala:15: object RequestParser is not a member of package controllers
[error] Note: trait RequestParser exists, but it has no companion object.
[error] import controllers.RequestParser._
[error]                    ^
[error] /Users/blueiur/code/s2graph/test/controllers/QuerySpec.scala:252: not found: value GraphAggregatorActor
[error]       GraphAggregatorActor.init()
[error]       ^
[error] /Users/blueiur/code/s2graph/test/controllers/QuerySpec.scala:254: type mismatch;
[error]  found   : play.api.Configuration
[error]  required: com.typesafe.config.Config
[error]       Graph(Config.conf)(ExecutionContext.Implicits.global)
[error]                    ^
[error] /Users/blueiur/code/s2graph/test/controllers/QuerySpec.scala:266: not found: value toEdges
[error]       val inserts = toEdges(Json.parse(jsArrStr), "insert")
[error]                     ^
[error] /Users/blueiur/code/s2graph/test/controllers/QuerySpec.scala:272: not found: value toEdges
[error]       val inserts2nd = toEdges(Json.parse(jsArrStr2nd), "insert")
[error]                        ^
[error] /Users/blueiur/code/s2graph/test/controllers/QuerySpec.scala:281: type mismatch;
[error]  found   : play.api.Configuration
[error]  required: com.typesafe.config.Config
[error]       Graph(Config.conf)(ExecutionContext.Implicits.global)
[error]                    ^
[error] /Users/blueiur/code/s2graph/test/controllers/QuerySpec.scala:286: not found: value toEdges
[error]       val deletes = toEdges(Json.parse(jsArrStr), "delete")
[error]                     ^
[error] /Users/blueiur/code/s2graph/test/controllers/QuerySpec.scala:291: not found: value toEdges
[error]       val deletes2nd = toEdges(Json.parse(jsArrStr2nd), "delete")
[error]                        ^
[error] /Users/blueiur/code/s2graph/test/controllers/QuerySpec.scala:299: not found: value GraphAggregatorActor
[error]       GraphAggregatorActor.shutdown()
[error]       ^
[error] /Users/blueiur/code/s2graph/test/controllers/RequestParserSpec.scala:34: not found: value WhereParser
[error]       val whereOpt = WhereParser(label).parse(sql)
[error]                      ^
[error] /Users/blueiur/code/s2graph/test/controllers/RequestParserSpec.scala:46: value must is not a member of Nothing
[error]     checked must beEqualTo(expected)
[error]             ^
[error] 17 errors found
[error] (root/test:compile) Compilation failed
[error] Total time: 7 s, completed 2015. 5. 13 오전 10:36:15

Vagrant dev environment

Setting up S2Graph on your local machine is still quite a hassle. (Although much has improved since a couple of months ago..)

Vagrant seems like a nice approach to this problem.

"Vagrant will isolate dependencies and their configuration within a single disposable, consistent environment, without sacrificing any of the tools you're used to working with (editors, browsers, debuggers, etc.)."

Wrong timestamp format in README example.

I tested with HBase 0.98.12-hadoop2 and called the following command which is from README:

curl -XPOST localhost:9000/graphs/edges/insert -H 'Content-Type: Application/json' -d '
[
  {"from":1,"to":101,"label":"graph_test","props":{"time":-1, "weight":10},"timestamp":1417616431},
  {"from":1,"to":102,"label":"graph_test","props":{"time":0, "weight":11},"timestamp":1417616431},
  {"from":1,"to":103,"label":"graph_test","props":{"time":1, "weight":12},"timestamp":1417616431},
  {"from":1,"to":104,"label":"graph_test","props":{"time":-2, "weight":1},"timestamp":1417616431}
]
'

I received a success message "1 insert success" but I could not find any rows in the HBase table.
I researched this cause and found that timestamp format in the example is wrong format.
["timestamp":1417616431] value is not a Java timestamp format. After changing to Java timestamp format, s2graph stored several rows in the HBase table.
README file should be changed or verifying logic for timestamp should be added.

Can't compile asyncbase with Java8 or Protobuf 260

It works with Java7, Protobuf 250, but error occurred with
Java 8 Or Protobuf 261

I can't find prerequisites versions (java7, protobuf250) on README.md

Error messages.

../../asynchbase/target/generated-sources/protobuf/java/org/hbase/async/generated/ClientPB.java:[139,30] error: cannot find symbol
[ERROR]   symbol:   class ProtocolStringList

Multiple Index with Index name on createLabel

Currently, multiple index on label can be created only by addIndex api one by one.
It would be better to create multiple index when user create label and provide index name.

example would be following.

curl -XPOST localhost:9000/graphs/createLabel -H 'Content-Type: Application/json' -d '
{
    "label": "graph_test",
    "srcServiceName": "s2graph",
    "srcColumnName": "user_id",
    "srcColumnType": "long",
    "tgtServiceName": "s2graph",
    "tgtColumnName": "item_id",
    "tgtColumnType": "string",
    "serviceName": "s2graph",
    "indexProps": [
            {
                "indexName": "pk", 
                "indexProps": [
                    {"name": "user_type", "defaultValue": "-", "dataType": "string" },
                    {"name": "action_type", "defaultValue": "v", "dataType": "string" },
                    {"name": "doc_type", "defaultValue": "-", "dataType": "string" },
                    {"name": "_timestamp", "defaultValue": 0, "dataType": "long"}
                ]
            },      
            {
                "indexName": "idx_2",
                "indexProps": [
                    {"name": "user_type", "defaultValue": "-", "dataType": "string" },
                    {"name": "_timestamp", "defaultValue": 0, "dataType": "long"},
                    {"name": "doc_type", "defaultValue": "-", "dataType": "string" },
                    {"name": "action_type", "defaultValue": "v", "dataType": "string" }
                ]
            }            
    ],
    "props": [
        {"name": "doc_created_at", "defaultValue": 0, "dataType": "long"}
    ]
}
'

then it would be possible for query to specify index name on query dsl. currently "scoring" field has two purpose. one is selecting which index to traverse and other is actual weights for scoring. need to seperate scoring field from index selection.

{
    "select": [],
    "srcVertices": [
        {
            "serviceName": "s2graph_test",
            "columnName": "user_id",
            "id": 5
        }
    ],
    "steps": [
        {
            "step": [
                {
                    "label": "graph_test",
                    "direction": "out",
                    "offset": 0,
                    "limit": 10, 
                    "index": "pk"
                }
            ]
        }
    ]
}

_to filter is not working.

insert data as follow.

curl -XPOST localhost:9000/graphs/edges/insert -H 'Content-Type: Application/json' -d '
[
    {"timestamp": 1, "from": 101, "to": "a", "label": "graph_test"},
    {"timestamp": 2, "from": 101, "to": "b", "label": "graph_test"},
    {"timestamp": 3, "from": 101, "to": "c", "label": "graph_test"}
]
'

following query, expected result would be one edge which has "a".

{
    "select": [],
    "srcVertices": [
        {
            "serviceName": "s2graph_test",
            "columnName": "user_id",
            "id": 101
        }
    ],
    "steps": [
        {
            "step": [
                {
                    "label": "graph_test",
                    "direction": "out",
                    "offset": 0,
                    "limit": 10, 
                    "_to": "a"
                }
            ]
        }
    ]
}

but currently result is empty.

Can't compile asynchbase sub project.

I have compiled asynchbase according to the README file. But I found the following "make" error message and mvn error messages.

[~/s2graph]# cd asynchbase; make; mvn install
Makefile:29: third_party/include.mk: No such file or directory
make: *** No rule to make target `third_party/include.mk'.  Stop.
.....
[ERROR] COMPILATION ERROR :
[INFO] -------------------------------------------------------------
[ERROR] /Users/babokim/work/workspace/s2graph/asynchbase/test/TestNSREs.java:[94,23] error: cannot find symbol
[ERROR]   symbol:   class KeyValue
  location: class TestNSREs
/Users/babokim/work/workspace/s2graph/asynchbase/test/TestNSREs.java:[95,23] error: cannot find symbol
[ERROR]   symbol:   class RegionInfo
  location: class TestNSREs
/Users/babokim/work/workspace/s2graph/asynchbase/test/TestNSREs.java:[96,23] error: cannot find symbol

Transaction on Label create.

when create Label, labelMeta, labelIndex, serviceColumn is created if they do not exist.

problem is that these operations should be atomic with transaction so if any failure can be reverted.

current implementation does not use transaction so partial failure yield to corruption in DB.

show all edge properties on result.

now result json only show edge properties when they have been modified. it would be better to provide all edge properties on result json even though it is not modified.

Provide Union Query

Provide Union Query,

Query example

Query multiple query's in array

[
{
    "srcVertices": [
        {
            "serviceName": "s2graph",
            "columnName": "user_id_test",
            "id": 0
        }
    ],
    "steps": [
        [
            {
                "label": "s2graph_label_test",
                "direction": "out",
                "offset": 0
            }
        ]
    ]
}
,
{
    "srcVertices": [
        {
            "serviceName": "s2graph",
            "columnName": "user_id_test",
            "id": 0
        }
    ],
    "steps": [
        [
            {
                "label": "s2graph_label_test",
                "direction": "in",
                "offset": 0
            }
        ]
    ]
}
]

Result

Aggrated each query result in array

[
    {
        "size": 1,
        "degrees": [
            {
                "from": 0,
                "label": "s2graph_label_test",
                "direction": "in",
                "_degree": 1
            }
        ],
        "results": [
            {
                "cacheRemain": -7,
                "timestamp": 3003,
                "score": 1,
                "label": "s2graph_label_test",
                "direction": "in",
                "to": 2,
                "_timestamp": 3003,
                "from": 0,
                "props": {
                    "weight": 30,
                    "is_blocked": false,
                    "_count": -1,
                    "_timestamp": 3003,
                    "is_hidden": false,
                    "time": 0
                }
            }
        ],
        "impressionId": 764860958
    },
    {
        "size": 2,
        "degrees": [
            {
                "from": 0,
                "label": "s2graph_label_test",
                "direction": "out",
                "_degree": 2
            }
        ],
        "results": [
            {
                "cacheRemain": -16,
                "timestamp": 2002,
                "score": 1,
                "label": "s2graph_label_test",
                "direction": "out",
                "to": 2,
                "_timestamp": 2002,
                "from": 0,
                "props": {
                    "weight": 20,
                    "is_blocked": false,
                    "_count": -1,
                    "_timestamp": 2002,
                    "is_hidden": false,
                    "time": 0
                }
            },
            {
                "cacheRemain": -16,
                "timestamp": 1001,
                "score": 1,
                "label": "s2graph_label_test",
                "direction": "out",
                "to": 1,
                "_timestamp": 1001,
                "from": 0,
                "props": {
                    "weight": 10,
                    "is_blocked": false,
                    "_count": -1,
                    "_timestamp": 1001,
                    "is_hidden": true,
                    "time": 0
                }
            }
        ],
        "impressionId": -1650835965
    }
]

Add a parameter to specify index direction

This is issue from @hsleep
It can be useful to use different indexProps for label`s directions

example)

{
 "indexProps": [
            {
                "indexName": "pk",
                "indexDirection": "both",
                "indexProps": [
                    {"name": "user_type", "defaultValue": "-", "dataType": "string" },
                    {"name": "action_type", "defaultValue": "v", "dataType": "string" },
                    {"name": "doc_type", "defaultValue": "-", "dataType": "string" },
                    {"name": "_timestamp", "defaultValue": 0, "dataType": "long"}
                ]
            },      
            {
                "indexName": "idx_2",
                "indexDirection": "in",
                "indexProps": [
                    {"name": "user_type", "defaultValue": "-", "dataType": "string" },
                    {"name": "_timestamp", "defaultValue": 0, "dataType": "long"},
                    {"name": "doc_type", "defaultValue": "-", "dataType": "string" },
                    {"name": "action_type", "defaultValue": "v", "dataType": "string" }
                ]
            }            
    ]
}

out: out direction index only
in: in direction index only
both: out/in direction index

Check the license of mysql connector.

There is a dependency with mysql connector in build.sbt. I don't know exactly the license of mysql connector. The license of mysql connector isn't comparable with Apache license. So we should check the license of mysql connector.
I think that using derby for default meta store is more common and this can avoid the license problem.

Option to provide randomness to a query result

Problem:

I've noticed that S2Graph clients quite often shuffle the query results before serving them to users in order to give some randomness to user experience.

An option to randomly sample a set of queried edges will result in a much simpler client code.

For example, lets say a client is running an AB test on S2Graph items with A) a sorted bucket and B) a random bucket.

As is, she will have to identify the random bucket id B and mix up the result.

With the suggested feature, both buckets can be handled uniformly.

Idea:

Right now, I'm considering a step-level integer-type parameter "sample" that will tell S2Graph to randomly sample N edges from the result set of the corresponding step.

Any sort of guidance is welcomed!

update benchmark on readme

provide degree of vertex

degree of a vertex of a graph is the number of edges incident to the vertex.

It could be possible to store vertex degree and label information that this vertex participated while we mutate edge.

It would be very nice to provide vertex degree and what label this vertex has edges.

make readme readable

current reame is not easy to understand. I think this is because all information is on same page. better organization on topics with more examples and diagram will make things more understandable.

Refactor filterEdges

after profile through visualvm, filterEdges on Graph contains unnecessary check which use many CPU cycle.

point to improve on Graph.filterEdges is following(develop branch).

from design of rowKey, qualifier, degree edge can only exist at the very beginning of cells in HBase.
checking if edge is degree edge should be done on one edge, not all fetched edges.
(https://github.com/kakao/s2graph/blob/develop/s2core/src/main/scala/com/kakao/s2graph/core/Graph.scala#L506)
duplicate policy check is unnecessary for label with strong consistencyLevel.
(https://github.com/kakao/s2graph/blob/develop/s2core/src/main/scala/com/kakao/s2graph/core/Graph.scala#L524)
expensive hashCode for BigDecimal.
since only vertexId is considered on this scope, only possible datatype is string or long so instead of using BigDecimal.hashCode, switch to BigDecimal.longValue.hashCode would increase performance.
(https://github.com/kakao/s2graph/blob/develop/s2core/src/main/scala/com/kakao/s2graph/core/Graph.scala#L501)

Personally, I am not a fan of micro-optimization, but filterEdges goes through every edges that fetched so maybe little bit optimization on this method would be necessary.

Through benchmark, I see lots of CPU cycle is waisted on Graph.toHashKey and just checking for exclude/include.

Provide blocking APIs

Currently, all APIs for mutating edge/vertex are non-blocking.

It would be good to provide blocking APIs so user can select according to their needs.

Unnecessary Json object creation on PostProcess.edgeToJson

When select column is given in query, it is not necessary to build props map and other fields for result json for every edge.

Current implementation first create result json and then filter out only select columns. Since edgeToJson method runs on every edge(it will be called a lot), small improvement on here would increase performance quite a lot.

quick thought would be skipping unnecessary json object creation when we don`t need to by given query.

Confusing with direction on Edge/Query/Label.

There is no direction exist on label but edge/query do have direction on them. This make Edge/Query complicated. it would be better to refactor these, or document in clear way.

feature to propagate ancestor of final step edge.

following is my test data.

curl -XPOST localhost:9000/graphs/edges/insert -H 'Content-Type: Application/json' -d '
[
{"label":"s2graph_label_test","from":-1,"to":1,"props":{},"timestamp":1},
{"label":"s2graph_label_test","from":-1,"to":2,"props":{},"timestamp":1},
{"label":"s2graph_label_test","from":-1,"to":3,"props":{},"timestamp":1},
{"label":"s2graph_label_test","from":1,"to":10,"props":{},"timestamp":1},
{"label":"s2graph_label_test","from":1,"to":11,"props":{},"timestamp":1},
{"label":"s2graph_label_test","from":2,"to":11,"props":{},"timestamp":1},
{"label":"s2graph_label_test","from":2,"to":12,"props":{},"timestamp":1},
{"label":"s2graph_label_test","from":3,"to":12,"props":{},"timestamp":1},
{"label":"s2graph_label_test","from":3,"to":13,"props":{},"timestamp":1},
{"label":"s2graph_label_test","from":10,"to":100,"props":{},"timestamp":1},
{"label":"s2graph_label_test","from":11,"to":200,"props":{},"timestamp":1},
{"label":"s2graph_label_test","from":12,"to":300,"props":{},"timestamp":1}
]
'

in picture, graph looks like this.

Currently, result json does not contains what is ancestor for result. for example, it is impossible to know where edge (12 -> 300) comes from. it comes from (2 and 3). it would be good to provide way to keep who is ancestor of final step edge for users.

Provide compare edge variable(_from, _to, props, and parent edge props) with input value in `where parser`

Now where parser can compare just edge's property variable(_from,_to, props) with input value

As is: Just compare with input value

where: "_from = 10" # lhs(`_from`) is edge's property rhs(`10`) is input value

To be: Can compare with Variable

# Compare edge's prop(`_from`) with edge's prop(`age`)
where: "_from = age" 

# Compare edge's prop(`cateogry`) with edge's prop(`to`)
where: "category = to" 

# Compare edge's prop(`cateogry`) with parent edge's prop(`category`)
where: "category = _parent.category" 

# Compare edge's prop(`gender`) with parent of parent edge's prop(`gender`)
where: "_parent._parent.gender = gender"

remove duplicate snapshot edge

There are storage overhead on snapshot edge currently.

ex) single request edge with only _timestamp as indexProps.

{"timestamp": t1, "from": 1, "to": 100, "label": "liked", "props": {}, "direction": "out"}

with this single request 2 same edges(logically) need to be created.

(1 -> liked -> out -> 100) 
(100 <- liked <- in <- 1)

note that from, to is swapped and direction is toggled.

if label is undirected then 4 same edges need to be created.

(1 -> liked -> out -> 100), (100 -> liked -> out -> 1)
(100 <- liked <- in <- 1), (1 <- liked <- in <- 100)

in Edge class, relatedEdges method will return these edges.

problem is that all of these related edges come from same data, they share same snapshotEdge.

snapshotEdge store each edge`s (from, labelId) as rowKey and to as qualifier.
all of above relatedEdges will have same data on their snapshotEdge, but current implementation store multiple snapshotEdge.

ex)
request edge: (1 -> liked -> out -> 100)
snapshot edge: RowKey(1, liked), Qualifier(100), Value(...)

request edge: (100 <- in <- liked <- 1)
snapshot edge: RowKey(100, liked), Qualifier(1), Value(...)

we can make single rule for snapshotEdge like RowKey(smaller vertexId, label), Qualifier(larger vertexId)
then only need to keep one snapshot edge if we stick to same rule when lookup snapshotEdge.

This will reduce storage usage significantly.

Refactor LocalCache to store step-wise/queryParam-wise result, not just sequence of edges

current cache is used for only removing unnecessary I/O request to backend storage. even though we hit cache, we still need to run groupBy and filtering operations on these cached edges.

it would be better to provide step-wise/queryParam-wise result cache so we can also ditch unnecessary CPU bound operations like groupBy and filtering. actually these will yield more efficiency on local cache size too.

weighted sum of the union results

As #85, #86, union is performed by the array of the conventional queries. The result is also the array of the results corresponding to per query.

Some use case needs combining the results. For example, hybrid recommender system that combines multiple recommendations needs aggregating the score of the union query results.

For this, I'd like to propose the query as follows

{
  "queries": [
    {
      "srcVertices": [
        {
          "columnName": "user_id_test",
          "id": 0,
          "serviceName": "s2graph"
        }
      ],
      "steps": [
        [
          {
            "direction": "out",
            "label": "s2graph_label_test_0",
            "offset": 0
          }
        ]
      ]
    },
    {
      "srcVertices": [
        {
          "columnName": "user_id_test",
          "id": 0,
          "serviceName": "s2graph"
        }
      ],
      "steps": [
        [
          {
            "direction": "out",
            "label": "s2graph_label_test_1",
            "offset": 0
          }
        ]
      ]
    }
  ],
  "weights": [
    0.6,
    0.4
  ],
  "aggregateBy": ["to"]
}

where weight 0.6 and 0.4 are the weights for aggregating by weighted sum respectively.

The results of the above query would have weightedSum which is the weighted sum of the scores whose to(and keys in aggregateBy) is the same.

kakao / s2graph Goto Github PK

s2graph's People

Contributors

Stargazers

Watchers

Forkers

s2graph's Issues

Problem

Solution

insert vertex data

select vertices

Query example

Result

Recommend Projects

Recommend Topics

Recommend Org