Hive
  1. Hive
  2. HIVE-7292 Hive on Spark
  3. HIVE-7334

Create SparkShuffler, shuffling data between map-side data processing and reduce-side processing [Spark Branch]

    Details

    • Type: Sub-task Sub-task
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: spark-branch
    • Component/s: None
    • Labels:
      None

      Description

      Please refer to the design spec.

        Activity

        Xuefu Zhang created issue -
        Xuefu Zhang made changes -
        Field Original Value New Value
        Assignee Rui Li [ lirui ]
        Rui Li made changes -
        Attachment HIVE-7334.patch [ 12658562 ]
        Hide
        Rui Li added a comment -

        Just some initial ground work. Submitted for review

        Show
        Rui Li added a comment - Just some initial ground work. Submitted for review
        Hide
        Xuefu Zhang added a comment -

        Rui Li Thanks for the patch. I took a brief look, and found you might need to rebase your patch with the latest branch. On the top level, here is the plan for sortBy, groupBy, and HiveReduceFunction. Also, please note that there are some overlap between your work and [~robustchao]'s HIVE-7526. I'd like to make clear so that we don't overstep each other's toe.

        1. We will use groupBy unless sorting is required. For this, we need to change HiveReduceFunction API. (Chao)
        2. Since sortBy and groupBy generate different type data sets, we will need to cluster rows from sortBy and match the input of HiveReduceFunction. We will create a subclass of SparkTran for row clustering. The cluster should be simpler than the existing one in HiveReduceFunction as we assume that the key are ordered. Thus, we accumulate rows with the same key. (Chao)
        3. We have ShuffleTran for shuffling. Currently it only uses paritionByKey(). We will change it to groupBy. (Chao)
        4. We will add logic in SparkCompiler/SparkPlanGenerator to determine which which shuffle to use: either groupBy + ReduceTran or sortBy + RowClusteringTran + ReduceTran. (Rui)
        5. Make sure Hive's order by, sort by, distributed by, and clustered by work (Rui).
        6. It seems that we don't need partitionByKey.

        Please work together with Chao to move this forward.

        In addition, I'd like you to find out what takes to support shuffling required for Hive's reduce-side join. If there is anything missing in Spark, please create corresponding JIRAs.

        Let me know if you have any questions.

        Show
        Xuefu Zhang added a comment - Rui Li Thanks for the patch. I took a brief look, and found you might need to rebase your patch with the latest branch. On the top level, here is the plan for sortBy, groupBy, and HiveReduceFunction. Also, please note that there are some overlap between your work and [~robustchao] 's HIVE-7526 . I'd like to make clear so that we don't overstep each other's toe. 1. We will use groupBy unless sorting is required. For this, we need to change HiveReduceFunction API. (Chao) 2. Since sortBy and groupBy generate different type data sets, we will need to cluster rows from sortBy and match the input of HiveReduceFunction. We will create a subclass of SparkTran for row clustering. The cluster should be simpler than the existing one in HiveReduceFunction as we assume that the key are ordered. Thus, we accumulate rows with the same key. (Chao) 3. We have ShuffleTran for shuffling. Currently it only uses paritionByKey(). We will change it to groupBy. (Chao) 4. We will add logic in SparkCompiler/SparkPlanGenerator to determine which which shuffle to use: either groupBy + ReduceTran or sortBy + RowClusteringTran + ReduceTran. (Rui) 5. Make sure Hive's order by, sort by, distributed by, and clustered by work (Rui). 6. It seems that we don't need partitionByKey. Please work together with Chao to move this forward. In addition, I'd like you to find out what takes to support shuffling required for Hive's reduce-side join. If there is anything missing in Spark, please create corresponding JIRAs. Let me know if you have any questions.
        Hide
        Rui Li added a comment -

        Thanks Xuefu Zhang this is much clearer.

        Show
        Rui Li added a comment - Thanks Xuefu Zhang this is much clearer.
        Hide
        Reynold Xin added a comment -
        Show
        Reynold Xin added a comment - BTW definitely look at https://github.com/apache/spark/pull/1499
        Hide
        Xuefu Zhang added a comment -

        Rui Li Please feel free to create smaller JIRAs to enable sorting in Hive on Spark. Here are some ideas:

        1. Complete SortByShuffler
        2. Add logic in SparkCompiler to generate SparkEdgeProperty with right sorting property.
        3. Add logic in SparkPlanGenerator to generate plan with right shuffle type.
        4. Test Hive's sorting related queries to make sure they work. File JIRAs for problems found.

        Also, please take a look at the link Reynold Xin pointed out above to see if we can benefit in any way.

        Show
        Xuefu Zhang added a comment - Rui Li Please feel free to create smaller JIRAs to enable sorting in Hive on Spark. Here are some ideas: 1. Complete SortByShuffler 2. Add logic in SparkCompiler to generate SparkEdgeProperty with right sorting property. 3. Add logic in SparkPlanGenerator to generate plan with right shuffle type. 4. Test Hive's sorting related queries to make sure they work. File JIRAs for problems found. Also, please take a look at the link Reynold Xin pointed out above to see if we can benefit in any way.
        Brock Noland made changes -
        Summary Create SparkShuffler, shuffling data between map-side data processing and reduce-side processing Create SparkShuffler, shuffling data between map-side data processing and reduce-side processing [Spark Branch]
        Hide
        Brock Noland added a comment -

        Rui Li as per your comments I am resolving this since HIVE-7528 is resolved. If anyone disagrees, please re-open.

        Show
        Brock Noland added a comment - Rui Li as per your comments I am resolving this since HIVE-7528 is resolved. If anyone disagrees, please re-open.
        Brock Noland made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Fix Version/s spark-branch [ 12327352 ]
        Resolution Fixed [ 1 ]
        Transition Time In Source Status Execution Times Last Executer Last Execution Date
        Open Open Resolved Resolved
        47d 8h 25m 1 Brock Noland 19/Aug/14 04:28

          People

          • Assignee:
            Rui Li
            Reporter:
            Xuefu Zhang
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development