Details
-
Improvement
-
Status: Closed
-
Major
-
Resolution: Done
-
None
-
None
Description
The map stage takes a Vertex object as argument but the execution framework of the MapReduce job does not know what data from the Vertex is needed during the map phase. For execution frameworks (like Giraph, TinkerGraph) where all that data is held in memory already, that doesn't make a difference but for execution frameworks that need to pull/load/read the data from somewhere (e.g. Hadoop, Fulgora) this could lead to a lot of wasted time.
For instance, consider a map-reduce job following a PR vertex program to compute some aggregate of the computed PR values. In that case, no edges or other vertex properties are needed - just the single PR property. A Hadoop based implementation would then have to read the entire graph from HDFS instead of just the single value.
Similar to VertexPrograms, MapReduce should have a method where incident traversals specifying the data needed for the map job are returned.