Spark as an open-source data analytics cluster computing framework has gained significant momentum recently. Many Hive users already have Spark installed as their computing backbone. To take advantages of Hive, they still need to have either MapReduce or Tez on their cluster. This initiative will provide user a new alternative so that those user can consolidate their backend.

      Secondly, providing such an alternative further increases Hive's adoption as it exposes Spark users to a viable, feature-rich de facto standard SQL tools on Hadoop.

      Finally, allowing Hive to run on Spark also has performance benefits. Hive queries, especially those involving multiple reducer stages, will run faster, thus improving user experience as Tez does.

      This is an umbrella JIRA which will cover many coming subtask. Design doc will be attached here shortly, and will be on the wiki as well. Feedback from the community is greatly appreciated!

      1. Hive-on-Spark.pdf
        290 kB
        Xuefu Zhang

        Issue Links

        1. Refactoring: make Hive reduce side data processing reusable [Spark Branch] Sub-task Reopened Xuefu Zhang
        2. StarterProject: Move configuration from SparkClient to HiveConf [Spark Branch] Sub-task Open Unassigned
        3. Research Hive dependency on MR distributed cache[Spark Branch] Sub-task Open Unassigned
        4. UT: add TestSparkMinimrCliDriver to run UTs that use HDFS Sub-task Open Thomas Friedrich
        5. UT: fix bucket_num_reducers test Sub-task Open Chinna Rao Lalam
        6. UTs: create missing output files for some tests under clientpositive/spark Sub-task Open Thomas Friedrich
        7. UT: fix udf_context_aware Sub-task Open Aihua Xu
        8. UT: fix hook_context_cs test case Sub-task Open Unassigned
        9. Downgrade guava version to be consistent with Hive and the rest of Hadoop [Spark Branch] Sub-task Open Unassigned
        10. Clean up temp files of RSC [Spark Branch] Sub-task Open Unassigned
        11. Choosing right preference between map join and bucket map join [Spark Branch] Sub-task Open Unassigned
        12. Error when cleaning up in spark.log [Spark Branch] Sub-task Open Unassigned
        13. Support backup task for join related optimization [Spark Branch] Sub-task Patch Available Chao Sun
        14. UT: set to true for spark UTs Sub-task Open Unassigned
        15. Cleanup code for getting spark job progress and metrics Sub-task Open Rui Li
        16. Improve replication factor of small table file given big table partitions [Spark branch] Sub-task Open Jimmy Xiang
        17. thrift.transport.TTransportException [Spark Branch] Sub-task Open Chao Sun
        18. Hive reported exception because that hive's derby version conflict with spark's derby version [Spark Branch] Sub-task Patch Available Pierre Yin
        19. Enable infer_bucket_sort_dyn_part.q for TestMiniSparkOnYarnCliDriver test. [Spark Branch] Sub-task Open Unassigned
        20. SparkSessionImpl calcualte wrong cores number in TestSparkCliDriver [Spark Branch] Sub-task Open Unassigned
        21. Print yarn application id to console [Spark Branch] Sub-task Open Chinna Rao Lalam
        22. HiveInputFormat implementations getsplits may lead to memory leak.[Spark Branch] Sub-task Open Unassigned
        23. Log the information of cached RDD [Spark Branch] Sub-task Patch Available Chinna Rao Lalam
        24. Provide more informative stage description in Spark Web UI [Spark Branch] Sub-task Open Unassigned
        25. Improve common join performance [Spark Branch] Sub-task Open Unassigned
        26. Implement Hybrid Hybrid Grace Hash Join for Spark Branch [Spark Branch] Sub-task Open Unassigned
        27. Fix test failures after last merge from trunk [Spark Branch] Sub-task Open Unassigned
        28. Followup for HIVE-10550, check performance w.r.t. persistence level [Spark Branch] Sub-task Open GaoLun
        29. Make HIVE-10001 work with Spark [Spark Branch] Sub-task Open Unassigned
        30. Support hive.explain.user for Spark [Spark Branch] Sub-task Open Unassigned
        31. Combine equavilent leaf works in SparkWork[Spark Branch] Sub-task Open Chengxiang Li
        32. Merge master into spark 11/17/2015 [Spark Branch] Sub-task Patch Available Xuefu Zhang



            • Assignee:
              Xuefu Zhang
              Xuefu Zhang
            • Votes:
              38 Vote for this issue
              196 Start watching this issue


              • Created: