Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-2978

Provide an MR-style shuffle transformation

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.2.0
    • Spark Core
    • None

    Description

      For Hive on Spark joins in particular, and for running legacy MR code in general, I think it would be useful to provide a transformation with the semantics of the Hadoop MR shuffle, i.e. one that

      • groups by key: provides (Key, Iterator[Value])
      • within each partition, provides keys in sorted order

      A couple ways that could make sense to expose this:

      • Add a new operator. "groupAndSortByKey", "groupByKeyAndSortWithinPartition", "hadoopStyleShuffle", maybe?
      • Allow groupByKey to take an ordering param for keys within a partition

      Attachments

        Issue Links

          Activity

            People

              sandyr Sandy Ryza
              sandyr Sandy Ryza
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: