Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-4059 Pig on Spark
  3. PIG-4575

Pass value to MR Partitioners in Spark engine

    XMLWordPrintableJSON

    Details

    • Type: Sub-task
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: spark-branch
    • Fix Version/s: spark-branch
    • Component/s: spark
    • Labels:
      None

      Description

      Spark Partitioner#getPartition does not take 'value' as an argument.

      In practice, most MR Partitioner#getPartition implementations will only use the key and ignore the value. But not all.

      In a Spark Partitioner, if the user wants to use the value, then value can made a part of the key, i.e. PairRDD<KeyWithValue, ValueRepeated> and then value extracted from the key in getPartition.

      One option is to add 2 extra transformations when custom partitioners are used for a shuffle. Create a PairRDD<KeyWithValue, ValueRepeated> before the shuffle step (and extract value from key inside getPartition) and then transform it back to PairRDD<Key, Value>. Doing so will increase RDD size due to duplicate value (values tend to be large) for all cases, regardless of whether value is used in getPartition. We could address this by only doing this if some configuration is set (enabled by default, since null as a value is a legitimate case which the Partitioner may be handling).

        Attachments

          Activity

            People

            • Assignee:
              mohitsabharwal Mohit Sabharwal
              Reporter:
              mohitsabharwal Mohit Sabharwal
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated: