Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-12394

Support writing out pre-hash-partitioned data and exploit that in join optimizations to avoid shuffle (i.e. bucketing in Hive)

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 2.0.0
    • SQL
    • None

    Description

      In many cases users know ahead of time the columns that they will be joining or aggregating on. Ideally they should be able to leverage this information and pre-shuffle the data so that subsequent queries do not require a shuffle. Hive supports this functionality by allowing the user to define buckets, which are hash partitioning of the data based on some key.

      • Allow the user to specify a set of columns when caching or writing out data
      • Allow the user to specify some parallelism
      • Shuffle the data when writing / caching such that its distributed by these columns
      • When planning/executing a query, use this distribution to avoid another shuffle when reading, assuming the join or aggregation is compatible with the columns specified
      • Should work with existing save modes: append, overwrite, etc
      • Should work at least with all Hadoops FS data sources
      • Should work with any data source when caching

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            nongli Nong Li
            rxin Reynold Xin
            Votes:
            1 Vote for this issue
            Watchers:
            21 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment