Details
-
Sub-task
-
Status: Open
-
Major
-
Resolution: Unresolved
-
spark-branch
-
None
Description
Spark Partitioner#getPartition does not take 'value' as an argument.
In practice, most MR Partitioner#getPartition implementations will only use the key and ignore the value. But not all.
In a Spark Partitioner, if the user wants to use the value, then value can made a part of the key, i.e. PairRDD<KeyWithValue, ValueRepeated> and then value extracted from the key in getPartition.
One option is to add 2 extra transformations when custom partitioners are used for a shuffle. Create a PairRDD<KeyWithValue, ValueRepeated> before the shuffle step (and extract value from key inside getPartition) and then transform it back to PairRDD<Key, Value>. Doing so will increase RDD size due to duplicate value (values tend to be large) for all cases, regardless of whether value is used in getPartition. We could address this by only doing this if some configuration is set (enabled by default, since null as a value is a legitimate case which the Partitioner may be handling).