Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-4059 Pig on Spark
  3. PIG-4575

Pass value to MR Partitioners in Spark engine

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Open
    • Major
    • Resolution: Unresolved
    • spark-branch
    • spark-branch
    • spark
    • None

    Description

      Spark Partitioner#getPartition does not take 'value' as an argument.

      In practice, most MR Partitioner#getPartition implementations will only use the key and ignore the value. But not all.

      In a Spark Partitioner, if the user wants to use the value, then value can made a part of the key, i.e. PairRDD<KeyWithValue, ValueRepeated> and then value extracted from the key in getPartition.

      One option is to add 2 extra transformations when custom partitioners are used for a shuffle. Create a PairRDD<KeyWithValue, ValueRepeated> before the shuffle step (and extract value from key inside getPartition) and then transform it back to PairRDD<Key, Value>. Doing so will increase RDD size due to duplicate value (values tend to be large) for all cases, regardless of whether value is used in getPartition. We could address this by only doing this if some configuration is set (enabled by default, since null as a value is a legitimate case which the Partitioner may be handling).

      Attachments

        Activity

          People

            mohitsabharwal Mohit Sabharwal
            mohitsabharwal Mohit Sabharwal
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: