Uploaded image for project: 'TinkerPop'
  1. TinkerPop
  2. TINKERPOP-1309

Memory output in HadoopGraph is too strongly tied to MapReduce and should be generalized.

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 3.2.0-incubating
    • Fix Version/s: None
    • Component/s: hadoop, process
    • Labels:

      Description

      The Memory object is not being written to disk in SparkGraphComputer unless its being updated within a MapReduce job. That is no bueno. We should really have the computed Memory be written as such:

      hdfs.ls("output")
      ==>~g
      ==>~memory
      

      Moreover, ~g should be ~graph but that is a different story...

      Then:

      hdfs.ls("output/~memory")
      ==>gremlin.traversalVertexProgram.haltedTraversals
      ==>a
      ==>x
      

      Note that every GraphComputer job yields a ComputerResult which is basically Pair<Graph,Memory>. The Graph reference denotes the adjacency list of vertices and on all those vertices, if there are HALTED_TRAVERSERS, they will be on those vertices. This is a distributed representation. Next, the Memory reference denotes data that is no longer "attached to the graph" – like maps, counts, sums, etc. In general, reduction barriers. This data is not tied to any one vertex anymore an thus exists at the "master traversal" via Memory. Thus, "graph is distributed/workers" and "memory is local/master." We need to make sure that the Memory data is serialized to disk appropriately for HadoopGraph-based implementations...

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                okram Marko A. Rodriguez
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated: