[TINKERPOP-1309] Memory output in HadoopGraph is too strongly tied to MapReduce and should be generalized. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 3.2.0-incubating
Fix Version/s: None
Component/s: hadoop, process
Labels:
- breaking

Description

The Memory object is not being written to disk in SparkGraphComputer unless its being updated within a MapReduce job. That is no bueno. We should really have the computed Memory be written as such:

hdfs.ls("output")
==>~g
==>~memory

Moreover, ~g should be ~graph but that is a different story...

Then:

hdfs.ls("output/~memory")
==>gremlin.traversalVertexProgram.haltedTraversals
==>a
==>x

Note that every GraphComputer job yields a ComputerResult which is basically Pair<Graph,Memory>. The Graph reference denotes the adjacency list of vertices and on all those vertices, if there are HALTED_TRAVERSERS, they will be on those vertices. This is a distributed representation. Next, the Memory reference denotes data that is no longer "attached to the graph" – like maps, counts, sums, etc. In general, reduction barriers. This data is not tied to any one vertex anymore an thus exists at the "master traversal" via Memory. Thus, "graph is distributed/workers" and "memory is local/master." We need to make sure that the Memory data is serialized to disk appropriately for HadoopGraph-based implementations...

Attachments

Issue Links

relates to

TINKERPOP-1298 Save OLAP results to file

Open

Activity

People

Assignee:: Unassigned

Reporter:: Marko A. Rodriguez

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 24/May/16 17:10

Updated:: 24/May/16 20:19