[PIG-4575] Pass value to MR Partitioners in Spark engine - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: spark-branch
Fix Version/s: spark-branch
Component/s: spark
Labels:
None

Description

Spark Partitioner#getPartition does not take 'value' as an argument.

In practice, most MR Partitioner#getPartition implementations will only use the key and ignore the value. But not all.

In a Spark Partitioner, if the user wants to use the value, then value can made a part of the key, i.e. PairRDD<KeyWithValue, ValueRepeated> and then value extracted from the key in getPartition.

One option is to add 2 extra transformations when custom partitioners are used for a shuffle. Create a PairRDD<KeyWithValue, ValueRepeated> before the shuffle step (and extract value from key inside getPartition) and then transform it back to PairRDD<Key, Value>. Doing so will increase RDD size due to duplicate value (values tend to be large) for all cases, regardless of whether value is used in getPartition. We could address this by only doing this if some configuration is set (enabled by default, since null as a value is a legitimate case which the Partitioner may be handling).

Attachments

Activity

People

Assignee:: Mohit Sabharwal

Reporter:: Mohit Sabharwal

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 26/May/15 23:53

Updated:: 26/May/15 23:53