dream-lab / goffish_v3 Goto Github PK

View Code? Open in Web Editor NEW

11.0 11.0 11.0 162.01 MB

Latest version of GoFFish Distributed Graph Processing Platforms

Java 97.48% Shell 1.85% Python 0.67%

goffish_v3's People

Contributors

Stargazers

Watchers

Forkers

upperwal humus- bharathbrat spavz karan970 keladhruv ram2012k hookk mohammadheydari

goffish_v3's Issues

Implement master-compute, aggregators

Alternate Edge Implementation: Source vertex

Support for both source and sink vertex IDs/pointers to be accessible from IEdge, rather than just sink. This may help with samples like spanning tree.

Support for Parquert/HIPPO input format reader

Remove hama readers that are dep

Remote non-FullInfo readers as they are depreacted, and replaced with the FullInfo readers with full hadop pipeline..

iterator over subgraph neighbors

Add Iterable<S> getRemoteSubgraphs() to ISubgraph to allow access to neighboring SG IDs directly rather than iterating through all remote vertices. e.g. helps in connected components.

Optionally, also allow access to all subgraph IDs in the metagraph.

Add logging before/after bsp.sync, memory after bsp.sync

Hama's bsp.sync actually transmits the messages. Having the memory before/after sync allows us to estimate the memory used on the output message buffer. Having the time before/after gives the communiction duration as well.

Add BFS sample to complement SSSP

BFS implementation can use a faster queue than a priority queue

Logging of MSG_COUNT

format: PERF.SG.SEND_MSG_COUNT : sgid, superstep_num, sg_msg_count,broadcast_msg_count,total : messages sent per superstep

*sendToAll increases broadcast_msg_count by 1. So use: total msg count = sg_msg_count + total_sg * broadcast_msg_count

Issue: Sum of all sent messages in ith superstep is not equal to sum of all received messages in (i+1)th superstep.
Typically, send_msg_count << recv_msg_count

Support for time-series graphs

API to get additional counts from ISubgraph

Add helper methods for long getRemoteVertexCount, long getLocalEdgeCount, long getBoundaryEdgeCount, long getBoundaryVertexCount.

Useful for pre-allocating collections.

API for getting boundary entities

Add method for Iterable<IVertex> getBoundaryVertices() and Iterable<IVertex> getBoundaryEdges() to identify local vertices that have an outedge to a remote vertex, and outedges that have a remote vertex as sink.

Refactor package names

For hama, readers should be in separate package, etc.

Make GoFFish Hama's `sendToNeighbors` more efficient

Current implementation scans thru all remote vertices to find neighboring subgraphs. Need to maintain an internal adjacency list of subgraph ID or add Iterator <S> getRemoteSubgraphs() to ISubgraph API to make this efficient.

Remove getLocalState from IRemoteVertex

Not clear of where V getLocalState() is used in IRemoteVertex. Remove to save pointer space?

Support for Subgraph with bi-directed edges

Add support for Iterator<IVertex> getRemoteInEdges() to ISubgraph and Iterator<IVertex> getInEdges() to IVertex. This allows backwards traversals, e.g. in GoDB.

Subgraph object not available in the subgraph compute constructor

The subgraph object can only be accessed after the compute class is instantiated. So a call to getSubgraph() inside constructor of compute class throws a NullPointerException.

Relevant code snippets:

Example where subgraph object might be used inside a constructor (full code):

        /**
         * Input has <num_clusters>,<max_edge_cuts>,<max_iters>
         *
         * @param initMsg
        */
        public KMeans(String initMsg) {
                String[] inp = initMsg.split(",");
                //SubgraphValue value = new SubgraphValue();
                value.k = Integer.parseInt(inp[0]);
                value.maxEdgeCrossing = Long.parseLong(inp[1]);
                value.maxIterations = Integer.parseInt(inp[2]);
                // NOTE: Subgraph state is not available in the constructor.
                //getSubgraph().setSubgraphValue(value);

        }

Framework code snippet which creates subgraphCompute Object and assigns subgraph object to it (full code):

    /*
     * Creating SubgraphCompute objects
     */
    for (ISubgraph<S, V, E, I, J, K> subgraph : partition.getSubgraphs()) {
      Class<? extends AbstractSubgraphComputation<S, V, E, M, I, J, K>> subgraphComputeClass;
      subgraphComputeClass = (Class<? extends AbstractSubgraphComputation<S, V, E, M, I, J, K>>) conf
              .getClass(GraphJob.SUBGRAPH_COMPUTE_CLASS_ATTR, null);
      if (subgraphComputeClass == null)
        throw new RuntimeException("Could not load subgraph compute class");

      AbstractSubgraphComputation<S, V, E, M, I, J, K> abstractSubgraphComputeRunner;

      if (initialValue != null) {
        Object []params = {initialValue};
        abstractSubgraphComputeRunner = ReflectionUtils.newInstance(subgraphComputeClass, params);
      }
      else
        abstractSubgraphComputeRunner = ReflectionUtils.newInstance(subgraphComputeClass);

      // FIXME: Subgraph value is not available to user in the subgraph-compute's constructor,
      // since it is added only after the object is created (using setSubgraph).
      SubgraphCompute<S, V, E, M, I, J, K> subgraphComputeRunner = new SubgraphCompute<S, V, E, M, I, J, K>();
      subgraphComputeRunner.setAbstractSubgraphCompute(abstractSubgraphComputeRunner);
      abstractSubgraphComputeRunner.setSubgraphPlatformCompute(subgraphComputeRunner);
      subgraphComputeRunner.setSubgraph(subgraph);
      subgraphComputeRunner.init(this);
      subgraphs.add(subgraphComputeRunner);
    }

incompatible json format

The JSON input produced by HAMA fastgen is of the format
[srcid, 0 , [[sinkid1,edgevalue1],[sinkid2, edgevalue2]... ]]
**example ** [99,0,[[32,1995],[17,1809],[2,969],[50,1278],[25,321],[28,390]....]]

while the format required by LongTextJSONReader is of the form
[srcid,partitionid,srcvalue,[[sinkid1,edgeid1,edgevalue1],[sinkid2,edgeid2,edgevalue2]... ]]

what to do ?

GoFFish v3 build errors

Platform: Windows10
JAVA: Oracle 8.0
Moven: 3.6

I followed instructions from this https://github.com/dream-lab/goffish_v3/tree/master/giraph

I'm getting below errors?

[ERROR] Failed to execute goal org.sonatype.plugins:munge-maven-plugin:1.0:munge (munge) on project goffish-giraph: Execution munge of goal org.sonatype.plugins:munge-maven-plugin:1.0:munge failed: basedir D:\work\GoFFish\giraph\goffish-giraph\src\test\java does not exist -> [Help 1]

Helper methods for local and remote edges in ISubgraph

Add API support for Iterable<IEdge> getLocalOutEdges() and Iterable<IEdge> getRemoteOutEdges() to ISubgraph, similar to methods for the vertices.

Limit `synchronized` in sendMessage

goffish_v3/hama/v3.1/src/main/java/in/dream_lab/goffish/hama/GraphJobRunner.java

Have synchronized one for the eventual
private void sendMessage(String peerName, Message<K, M> message) rather than for each sendMessage

Update licenses, authors in all files.

Many files have incorrect license terms. Code is licensed under Apache License, Version 2.0 and Copyright is held by DREAM:Lab, IISc. It is NOT "Licensed to the Apache Software Foundation (ASF)"

Hama's `Message.java` does not use the subgraph ID's generic type `K`

this.subgraphID = (K) new LongWritable(); // FIXME: This wont work for non-long subgraph IDs!

Alternate Edge implementations: Vertex Pointer

Support for pointer based access to sink vertex in IEdge using a method IVertex getSinkVertex(). This will avoid an additional lookup when traversing through local subgraph.

Adding lamda expression support

Adding support for lamba expression in some functions api(E.g. getSubgraph(), getVertices())......

unable to install

mvn –Phadoop_yarn –Dhadoop.version=2.7.2 -DskipTests clean package -s settings.xml [INFO] Scanning for projects... [INFO] ------------------------------------------------------------------------ [INFO] Reactor Build Order: [INFO] [INFO] Apache Giraph Parent [INFO] Apache Giraph Core [INFO] Apache Giraph Blocks Framework [INFO] Apache Giraph Examples [INFO] Apache Giraph Accumulo I/O [INFO] Apache Giraph HCatalog I/O [INFO] Apache Giraph Gora I/O [INFO] Apache Giraph Distribution [INFO] [INFO] ------------------------------------------------------------------------ [INFO] Building Apache Giraph Parent 1.2.0 [INFO] ------------------------------------------------------------------------ [INFO] ------------------------------------------------------------------------ [INFO] Reactor Summary: [INFO] [INFO] Apache Giraph Parent ............................... FAILURE [ 0.003 s] [INFO] Apache Giraph Core ................................. SKIPPED [INFO] Apache Giraph Blocks Framework ..................... SKIPPED [INFO] Apache Giraph Examples ............................. SKIPPED [INFO] Apache Giraph Accumulo I/O ......................... SKIPPED [INFO] Apache Giraph HCatalog I/O ......................... SKIPPED [INFO] Apache Giraph Gora I/O ............................. SKIPPED [INFO] Apache Giraph Distribution ......................... SKIPPED [INFO] ------------------------------------------------------------------------ [INFO] BUILD FAILURE [INFO] ------------------------------------------------------------------------ [INFO] Total time: 0.576 s [INFO] Finished at: 2018-04-13T16:41:32+05:30 [INFO] Final Memory: 9M/109M [INFO] ------------------------------------------------------------------------ [ERROR] Unknown lifecycle phase "–Phadoop_yarn". You must specify a valid lifecycle phase or a goal in the format <plugin-prefix>:<goal> or <plugin-group-id>:<plugin-artifact-id>[:<plugin-version>]:<goal>. Available lifecycle phases are: validate, initialize, generate-sources, process-sources, generate-resources, process-resources, compile, process-classes, generate-test-sources, process-test-sources, generate-test-resources, process-test-resources, test-compile, process-test-classes, test, prepare-package, package, pre-integration-test, integration-test, post-integration-test, verify, install, deploy, pre-clean, clean, post-clean, pre-site, site, post-site, site-deploy. -> [Help 1] [ERROR] [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch. [ERROR] Re-run Maven using the -X switch to enable full debug logging. [ERROR] [ERROR] For more information about the errors and possible solutions, please read the following articles: [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/LifecyclePhaseNotFoundException