Details
-
Improvement
-
Status: Closed
-
Major
-
Resolution: Fixed
-
0.5.0
-
None
-
None
-
Reviewed
Description
Tez inherits the writable framework from map-reduce.
This is very flexible, but not particularly memory efficient for the small data types.
When deserializing, each value and key has to be allocated afresh for each small chunk of data (new IntWritable instead of .set()).
The bytes writable serialization operation always has to write a 4 byte prefix for all values and keys, because of requirements around streamed .readFields() instead of a customer setter/getter impl.
Implement a faster serialization mechanism for the inner loop of sort, spill, merge, which doesn't trigger the GC and avoids adding simplistic overheads to the IFile format.