Details

      Description

      Spark as an open-source data analytics cluster computing framework has gained significant momentum recently. Many Hive users already have Spark installed as their computing backbone. To take advantages of Hive, they still need to have either MapReduce or Tez on their cluster. This initiative will provide user a new alternative so that those user can consolidate their backend.

      Secondly, providing such an alternative further increases Hive's adoption as it exposes Spark users to a viable, feature-rich de facto standard SQL tools on Hadoop.

      Finally, allowing Hive to run on Spark also has performance benefits. Hive queries, especially those involving multiple reducer stages, will run faster, thus improving user experience as Tez does.

      This is an umbrella JIRA which will cover many coming subtask. Design doc will be attached here shortly, and will be on the wiki as well. Feedback from the community is greatly appreciated!

      1. Hive-on-Spark.pdf
        290 kB
        Xuefu Zhang

        Issue Links

        1. Refactoring: make Hive reduce side data processing reusable [Spark Branch] Sub-task Reopened Xuefu Zhang
         
        2. StarterProject: Move configuration from SparkClient to HiveConf [Spark Branch] Sub-task Open Unassigned
         
        3. Research Hive dependency on MR distributed cache[Spark Branch] Sub-task Open Unassigned
         
        4. UT: add TestSparkMinimrCliDriver to run UTs that use HDFS Sub-task Open Thomas Friedrich
         
        5. UT: fix bucket_num_reducers test Sub-task Open Chinna Rao Lalam
         
        6. UTs: create missing output files for some tests under clientpositive/spark Sub-task Open Thomas Friedrich
         
        7. UT: add test flag in hive-site.xml for spark tests Sub-task Open Thomas Friedrich
         
        8. UT: fix udf_context_aware Sub-task Open Aihua Xu
         
        9. UT: fix hook_context_cs test case Sub-task Open Unassigned
         
        10. Hive/Spark/Yarn integration [Spark Branch] Sub-task Open Chengxiang Li
         
        11. Downgrade guava version to be consistent with Hive and the rest of Hadoop [Spark Branch] Sub-task Open Unassigned
         
        12. Clean up temp files of RSC [Spark Branch] Sub-task Open Unassigned
         
        13. Choosing right preference between map join and bucket map join [Spark Branch] Sub-task Open Unassigned
         
        14. Error when cleaning up in spark.log [Spark Branch] Sub-task Open Unassigned
         
        15. Support backup task for join related optimization [Spark Branch] Sub-task Patch Available Chao Sun
         
        16. Clean up GenSparkProcContext.clonedReduceSinks and related code [Spark Branch] Sub-task Patch Available Chao Sun
         
        17. UT: set hive.support.concurrency to true for spark UTs Sub-task Open Bing Li
         
        18. Cleanup code for getting spark job progress and metrics Sub-task Open Rui Li
         
        19. Improve replication factor of small table file given big table partitions [Spark branch] Sub-task Open Jimmy Xiang
         
        20. thrift.transport.TTransportException [Spark Branch] Sub-task Open Chao Sun
         
        21. Hive reported exception because that hive's derby version conflict with spark's derby version [Spark Branch] Sub-task Patch Available Pierre Yin
         
        22. Enable infer_bucket_sort_dyn_part.q for TestMiniSparkOnYarnCliDriver test. [Spark Branch] Sub-task Open Unassigned
         
        23. SparkSessionImpl calcualte wrong cores number in TestSparkCliDriver [Spark Branch] Sub-task Open Unassigned
         
        24. Print yarn application id to console [Spark Branch] Sub-task Open Chinna Rao Lalam
         
        25. Querying parquet tables fails with IllegalStateException [Spark Branch] Sub-task Open Unassigned
         
        26. HiveInputFormat implementations getsplits may lead to memory leak.[Spark Branch] Sub-task Open Unassigned
         
        27. Log the information of cached RDD [Spark Branch] Sub-task Patch Available Chinna Rao Lalam
         
        28. Provide more informative stage description in Spark Web UI [Spark Branch] Sub-task Open Unassigned
         
        29. Improve common join performance [Spark Branch] Sub-task Patch Available Unassigned
         
        30. Merge Spark branch to trunk 3/31/2015 Sub-task Open Unassigned
         
        31. Implement Hybrid Hybrid Grace Hash Join for Spark Branch [Spark Branch] Sub-task Open Unassigned
         
        32. Fix test failures after last merge from trunk [Spark Branch] Sub-task Open Unassigned
         
        33. Enable parallel order by for spark [Spark Branch] Sub-task Patch Available Rui Li
         
        34. Dynamic RDD caching optimization for HoS.[Spark Branch] Sub-task Open Unassigned
         

          Activity

          Hide
          wangmeng added a comment -

          This is a very valuable project!

          Show
          wangmeng added a comment - This is a very valuable project!
          Hide
          Szehon Ho added a comment -

          Adding a short Getting Started Guide.

          Show
          Szehon Ho added a comment - Adding a short Getting Started Guide .
          Hide
          Paulo Motta added a comment -

          Is the branch already usable in production?

          Show
          Paulo Motta added a comment - Is the branch already usable in production?
          Hide
          Xuefu Zhang added a comment -

          Paulo Motta, thanks for your interest. I think the branch is ready for propective users to try out, but I'd recommend for production you wait for a formal release.

          Show
          Xuefu Zhang added a comment - Paulo Motta , thanks for your interest. I think the branch is ready for propective users to try out, but I'd recommend for production you wait for a formal release.
          Hide
          Kiran Lonikar added a comment -

          Sorry, I have not looked at the code, but want to know how is the RDD structured? is it columnar? I am specifically interested for ORC, RC, Parquet files about how you preserve their columnar structure. RDD by nature is row wise and the SchemaRDD more specifically so.

          The spark sql component uses SchemaRDD which is row wise. Just to be clear, I am not reporting any problems with this JIRA. I am interested to know the implementation.

          I think columnar structure has its advantages and thats what hive vectorization did (https://issues.apache.org/jira/browse/HIVE-4160). The earlier SQL implementation shark also had some kind of columnar structure. I am not sure this spark on hive is preserving it.

          Show
          Kiran Lonikar added a comment - Sorry, I have not looked at the code, but want to know how is the RDD structured? is it columnar? I am specifically interested for ORC, RC, Parquet files about how you preserve their columnar structure. RDD by nature is row wise and the SchemaRDD more specifically so. The spark sql component uses SchemaRDD which is row wise. Just to be clear, I am not reporting any problems with this JIRA. I am interested to know the implementation. I think columnar structure has its advantages and thats what hive vectorization did ( https://issues.apache.org/jira/browse/HIVE-4160 ). The earlier SQL implementation shark also had some kind of columnar structure. I am not sure this spark on hive is preserving it.
          Hide
          yuemeng added a comment -

          i am very interesting in hive on spark ,an try to use it,when i bulit it (download from https://github.com/apache/hive.git,and chose the spark branch)use maven with command: mvn package -DskipTests -Phadoop-2 -Pdist,but it give me some error like
          [ERROR] /home/ym/hive-on-spark/hive/ql/src/java/org/apache/hadoop/hive/ql/exec/spark/status/SparkJobStatus.java:[22,24] cannot find symbol
          [ERROR] symbol: class JobExecutionStatus
          [ERROR] location: package org.apache.spark
          [ERROR] /home/ym/hive-on-spark/hive/ql/src/java/org/apache/hadoop/hive/ql/exec/spark/status/SparkJobStatus.java:[33,10] cannot find symbol
          [ERROR] symbol: class JobExecutionStatus
          [ERROR] location: interface org.apache.hadoop.hive.ql.exec.spark.status.SparkJobStatus
          [ERROR] /home/ym/hive-on-spark/hive/ql/src/java/org/apache/hadoop/hive/ql/exec/spark/status/SparkJobMonitor.java:[31,24] cannot find symbol
          [ERROR] symbol: class JobExecutionStatus
          can you tell me why?

          Show
          yuemeng added a comment - i am very interesting in hive on spark ,an try to use it,when i bulit it (download from https://github.com/apache/hive.git,and chose the spark branch)use maven with command: mvn package -DskipTests -Phadoop-2 -Pdist,but it give me some error like [ERROR] /home/ym/hive-on-spark/hive/ql/src/java/org/apache/hadoop/hive/ql/exec/spark/status/SparkJobStatus.java: [22,24] cannot find symbol [ERROR] symbol: class JobExecutionStatus [ERROR] location: package org.apache.spark [ERROR] /home/ym/hive-on-spark/hive/ql/src/java/org/apache/hadoop/hive/ql/exec/spark/status/SparkJobStatus.java: [33,10] cannot find symbol [ERROR] symbol: class JobExecutionStatus [ERROR] location: interface org.apache.hadoop.hive.ql.exec.spark.status.SparkJobStatus [ERROR] /home/ym/hive-on-spark/hive/ql/src/java/org/apache/hadoop/hive/ql/exec/spark/status/SparkJobMonitor.java: [31,24] cannot find symbol [ERROR] symbol: class JobExecutionStatus can you tell me why?
          Hide
          Xuefu Zhang added a comment -

          yuemeng, you can try removing org/apache/spark folder in your local maven repo to see if it fixes it.

          Show
          Xuefu Zhang added a comment - yuemeng , you can try removing org/apache/spark folder in your local maven repo to see if it fixes it.
          Hide
          Xuefu Zhang added a comment -

          Bing Li, I assume you assigned this JIRA to yourself by mistake. However, let me know if you plan to work on this. Thanks.

          Show
          Xuefu Zhang added a comment - Bing Li , I assume you assigned this JIRA to yourself by mistake. However, let me know if you plan to work on this. Thanks.
          Hide
          Peter Lin added a comment -

          Would love to use this production, is it going to release in hive 15?

          Show
          Peter Lin added a comment - Would love to use this production, is it going to release in hive 15?
          Hide
          Xuefu Zhang added a comment -

          Formerly 0.15, now 1.1 is going to be release soon. Release candidate is out.

          Show
          Xuefu Zhang added a comment - Formerly 0.15, now 1.1 is going to be release soon. Release candidate is out.
          Hide
          Xuefu Zhang added a comment - - edited

          Formerly 0.15, now 1.1 is going to be released soon. Release candidate is out.

          Show
          Xuefu Zhang added a comment - - edited Formerly 0.15, now 1.1 is going to be released soon. Release candidate is out.
          Hide
          Peter Lin added a comment -

          Thanks Xuefu for the quick reply. I will give it a try next week.

          Show
          Peter Lin added a comment - Thanks Xuefu for the quick reply. I will give it a try next week.
          Hide
          Peter Lin added a comment -

          Thanks Xuefu for the quick reply. I will give it a try next week.

          Show
          Peter Lin added a comment - Thanks Xuefu for the quick reply. I will give it a try next week.
          Hide
          leftylev added a comment - - edited

          Although this issue is still marked Unresolved, the Spark branch has been merged to trunk and is Resolved for the 1.1.0 release (HIVE-9257 and HIVE-9352). (Edit: Also HIVE-9448.)

          Show
          leftylev added a comment - - edited Although this issue is still marked Unresolved, the Spark branch has been merged to trunk and is Resolved for the 1.1.0 release ( HIVE-9257 and HIVE-9352 ). (Edit: Also HIVE-9448 .)
          Hide
          Lefty Leverenz added a comment -
          Show
          Lefty Leverenz added a comment - Doc note: See comments on HIVE-9257 and HIVE-9448 for documentation issues. HIVE-9257 commit comment with doc notes HIVE-9448 doc comments list of configuration parameters where documented
          Hide
          Ruslan Dautkhanov added a comment -

          Exciting. Hopefully it will be released some time soon.

          Show
          Ruslan Dautkhanov added a comment - Exciting. Hopefully it will be released some time soon.
          Hide
          Xin Hao added a comment -

          OK, I see. Thanks for your info.

          Show
          Xin Hao added a comment - OK, I see. Thanks for your info.

            People

            • Assignee:
              Xuefu Zhang
              Reporter:
              Xuefu Zhang
            • Votes:
              25 Vote for this issue
              Watchers:
              173 Start watching this issue

              Dates

              • Created:
                Updated:

                Development