Details
-
Sub-task
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
Description
HiveReduceFunction is backed by Hive's ExecReducer, whose reduce function takes an input in the form of <key, value list>. However, HiveReduceFunction's input is an iterator over a set of <key, value> pairs. To reuse Hive's ExecReducer, we need to "stage and cluster" the input rows by key, and then feed the <key, value list> to ExecMapper's reduce method. There are several problems with the current approach:
1. unbounded memory usage.
2. memory inefficient: input has be cached until all input is consumed.
3. this functionality seems generic enough to have it in Spark itself.
Thus, we'd like to check:
1. Whether Spark can provide a different version of PairFlatMapFunction, where the input to the call method is an iterator over tuples of <key, iterator<value>>. Something like this:
public Iterable<Tuple2<BytesWritable, BytesWritable>> call(Iterator<Tuple2<BytesWritable, Iterator<BytesWritable>>> it);
2. If above effort fails, we need to enhance our row clustering mechanism so that it has bounded memory usage and is able to spill if needed.
Attachments
Issue Links
- relates to
-
HIVE-7526 Research to use groupby transformation to replace Hive existing partitionByKey and SparkCollector combination
- Resolved