Details

      Description

      Spark as an open-source data analytics cluster computing framework has gained significant momentum recently. Many Hive users already have Spark installed as their computing backbone. To take advantages of Hive, they still need to have either MapReduce or Tez on their cluster. This initiative will provide user a new alternative so that those user can consolidate their backend.

      Secondly, providing such an alternative further increases Hive's adoption as it exposes Spark users to a viable, feature-rich de facto standard SQL tools on Hadoop.

      Finally, allowing Hive to run on Spark also has performance benefits. Hive queries, especially those involving multiple reducer stages, will run faster, thus improving user experience as Tez does.

      This is an umbrella JIRA which will cover many coming subtask. Design doc will be attached here shortly, and will be on the wiki as well. Feedback from the community is greatly appreciated!

        Attachments

          Issue Links

          1.
          Refactoring: make Hive reduce side data processing reusable [Spark Branch] Sub-task Reopened Xuefu Zhang
          2.
          StarterProject: Move configuration from SparkClient to HiveConf [Spark Branch] Sub-task Open Unassigned
          3.
          Research Hive dependency on MR distributed cache[Spark Branch] Sub-task Open Unassigned
          4.
          UT: add TestSparkMinimrCliDriver to run UTs that use HDFS Sub-task Open Thomas Friedrich
          5.
          UT: fix bucket_num_reducers test Sub-task Open Chinna Rao Lalam
          6.
          UTs: create missing output files for some tests under clientpositive/spark Sub-task Open Thomas Friedrich
          7.
          UT: fix hook_context_cs test case Sub-task Open Unassigned
          8.
          Downgrade guava version to be consistent with Hive and the rest of Hadoop [Spark Branch] Sub-task Open Unassigned
          9.
          Clean up temp files of RSC [Spark Branch] Sub-task Open Unassigned
          10.
          Choosing right preference between map join and bucket map join [Spark Branch] Sub-task Open Unassigned
          11.
          Error when cleaning up in spark.log [Spark Branch] Sub-task Open Unassigned
          12.
          Support backup task for join related optimization [Spark Branch] Sub-task Patch Available Chao Sun
          13.
          UT: set hive.support.concurrency to true for spark UTs Sub-task Open Unassigned
          14.
          Cleanup code for getting spark job progress and metrics Sub-task Open Rui Li
          15.
          Improve replication factor of small table file given big table partitions [Spark branch] Sub-task Open Jimmy Xiang
          16.
          thrift.transport.TTransportException [Spark Branch] Sub-task Open Chao Sun
          17.
          Hive reported exception because that hive's derby version conflict with spark's derby version [Spark Branch] Sub-task Patch Available Pierre Yin
          18.
          Enable infer_bucket_sort_dyn_part.q for TestMiniSparkOnYarnCliDriver test. [Spark Branch] Sub-task Open Unassigned
          19.
          SparkSessionImpl calcualte wrong cores number in TestSparkCliDriver [Spark Branch] Sub-task Open Unassigned
          20.
          HiveInputFormat implementations getsplits may lead to memory leak.[Spark Branch] Sub-task Open Unassigned
          21.
          Provide more informative stage description in Spark Web UI [Spark Branch] Sub-task Open Unassigned
          22.
          Improve common join performance [Spark Branch] Sub-task Open Unassigned
          23.
          Implement Hybrid Hybrid Grace Hash Join for Spark Branch [Spark Branch] Sub-task Open Unassigned
          24.
          Fix test failures after last merge from trunk [Spark Branch] Sub-task Open Unassigned
          25.
          Followup for HIVE-10550, check performance w.r.t. persistence level [Spark Branch] Sub-task Open GaoLun
          26.
          Make HIVE-10001 work with Spark [Spark Branch] Sub-task Open Unassigned
          27.
          Combine equavilent leaf works in SparkWork[Spark Branch] Sub-task Open Chengxiang Li
          28.
          [Spark Branch] ClassNotFoundException occurs during query case with group by and UDF defined Sub-task Open Chengxiang Li
          29.
          NullPointerException thrown by Executors causes job can't be finished Sub-task Open Unassigned

            Activity

              People

              • Assignee:
                xuefuz Xuefu Zhang
                Reporter:
                xuefuz Xuefu Zhang
              • Votes:
                49 Vote for this issue
                Watchers:
                227 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: