Details
-
New Feature
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
Currently, the bucketing techniques are fairly expensive - The bucketing keys
have to be the same as the reduction keys and the process of bucketization requires
a fully blown map-reduce job.
It should be possible to perform a map-side bucketization. The high level idea is
to shard the data based on the number of buckets, and create a sub-directory for each
bucket. Then, the data from all the mappers (in the same sub-directory) can be merged.
So, instead of having 1 file per directory, it would lead to 1 directory per directory.