Details
-
New Feature
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
-
None
-
Sprint 42, Sprint 43, Sprint 44, Sprint 45, Sprint 46
Description
Right now cube building is by layer of spanning trees. The algorithm results a total shuffle size around [Avg Cardinality] * [Total Cube Size]. This is the current biggest bottleneck of cube building in eBay deployment.
Propose a different algorithm:
1. Each mapper builds a cube segment independent, and output.
2. One round of shuffle merge sorts the segments.
3. Reducer outputs the final merged cube.
This could achieve 1 * [Total Cube Size] shuffling when there's a mandatory dimension and each mapper takes a different piece on the dimension. E.g. month is mandatory and each mapper is assign a different month data.
This algorithm is also more friendly to streaming.