Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-4059 Pig on Spark
  3. PIG-4575

Pass value to MR Partitioners in Spark engine

Add voteVotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments


    • Type: Sub-task
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: spark-branch
    • Fix Version/s: spark-branch
    • Component/s: spark
    • Labels:


      Spark Partitioner#getPartition does not take 'value' as an argument.

      In practice, most MR Partitioner#getPartition implementations will only use the key and ignore the value. But not all.

      In a Spark Partitioner, if the user wants to use the value, then value can made a part of the key, i.e. PairRDD<KeyWithValue, ValueRepeated> and then value extracted from the key in getPartition.

      One option is to add 2 extra transformations when custom partitioners are used for a shuffle. Create a PairRDD<KeyWithValue, ValueRepeated> before the shuffle step (and extract value from key inside getPartition) and then transform it back to PairRDD<Key, Value>. Doing so will increase RDD size due to duplicate value (values tend to be large) for all cases, regardless of whether value is used in getPartition. We could address this by only doing this if some configuration is set (enabled by default, since null as a value is a legitimate case which the Partitioner may be handling).




            • Assignee:
              mohitsabharwal Mohit Sabharwal
              mohitsabharwal Mohit Sabharwal


              • Created:

                Issue deployment