Details

      Description

      Spark as an open-source data analytics cluster computing framework has gained significant momentum recently. Many Hive users already have Spark installed as their computing backbone. To take advantages of Hive, they still need to have either MapReduce or Tez on their cluster. This initiative will provide user a new alternative so that those user can consolidate their backend.

      Secondly, providing such an alternative further increases Hive's adoption as it exposes Spark users to a viable, feature-rich de facto standard SQL tools on Hadoop.

      Finally, allowing Hive to run on Spark also has performance benefits. Hive queries, especially those involving multiple reducer stages, will run faster, thus improving user experience as Tez does.

      This is an umbrella JIRA which will cover many coming subtask. Design doc will be attached here shortly, and will be on the wiki as well. Feedback from the community is greatly appreciated!

      1. Hive-on-Spark.pdf
        290 kB
        Xuefu Zhang

        Issue Links

        1. Refactoring: make Hive reduce side data processing reusable [Spark Branch] Sub-task Reopened Xuefu Zhang
         
        2.
        Refactoring: make Hive map side data processing reusable [Spark Branch] Sub-task Resolved Xuefu Zhang
         
        3.
        Create SparkWork [Spark Branch] Sub-task Resolved Xuefu Zhang
         
        4.
        Create SparkTask [Spark Branch] Sub-task Resolved Chinna Rao Lalam
         
        5.
        Create SparkCompiler [Spark Branch] Sub-task Resolved Xuefu Zhang
         
        6.
        Create SparkClient, interface to Spark cluster [Spark Branch] Sub-task Resolved Chengxiang Li
         
        7.
        Create RDD translator, translating Hive Tables into Spark RDDs [Spark Branch] Sub-task Resolved Rui Li
         
        8.
        Create SparkShuffler, shuffling data between map-side data processing and reduce-side processing [Spark Branch] Sub-task Resolved Rui Li
         
        9.
        Create SparkPlan, DAG representation of a Spark job [Spark Branch] Sub-task Resolved Xuefu Zhang
         
        10.
        Create MapFunction [Spark Branch] Sub-task Resolved Xuefu Zhang
         
        11.
        Create ReduceFunction [Spark Branch] Sub-task Resolved Xuefu Zhang
         
        12.
        Create SparkPlanGenerator [Spark Branch] Sub-task Resolved Xuefu Zhang
         
        13.
        Create a MiniSparkCluster and set up a testing framework [Spark Branch] Sub-task Resolved Rui Li
         
        14.
        Research into reduce-side join [Spark Branch] Sub-task Resolved Szehon Ho
         
        15.
        Spark 1.0.1 is released, stop using SNAPSHOT [Spark Branch] Sub-task Resolved Brock Noland
         
        16.
        Exclude hadoop 1 from spark dep [Spark Branch] Sub-task Resolved Brock Noland
         
        17.
        Load Spark configuration into Hive driver [Spark Branch] Sub-task Resolved Chengxiang Li
         
        18.
        Counters, statistics, and metrics [Spark Branch] Sub-task Resolved Chengxiang Li
         
        19.
        Spark job monitoring and error reporting [Spark Branch] Sub-task Resolved Chengxiang Li
         
        20.
        Implement pre-commit testing [Spark Branch] Sub-task Resolved Brock Noland
         
        21.
        Enhance SparkCollector [Spark Branch] Sub-task Resolved Venki Korukanti
         
        22.
        Enhance HiveReduceFunction's row clustering [Spark Branch] Sub-task Resolved Chao Sun
         
        23.
        Support Hive's multi-table insert query with Spark [Spark Branch] Sub-task Resolved Chao Sun
         
        24.
        Support order by and sort by on Spark [Spark Branch] Sub-task Resolved Rui Li
         
        25.
        Support cluster by and distributed by [Spark Branch] Sub-task Resolved Rui Li
         
        26.
        Support union all on Spark [Spark Branch] Sub-task Resolved Na Yang
         
        27. StarterProject: Move configuration from SparkClient to HiveConf [Spark Branch] Sub-task Open Unassigned
         
        28.
        StarterProject: Fix exception handling in POC code [Spark Branch] Sub-task Resolved Chao Sun
         
        29.
        StarterProject: Move from assert to Guava Preconditions.* in Hive on Spark [Spark Branch] Sub-task Resolved Chao Sun
         
        30.
        Make sure multi-MR queries work [Spark Branch] Sub-task Resolved Chao Sun
         
        31.
        Support dynamic partitioning [Spark Branch] Sub-task Resolved Chinna Rao Lalam
         
        32.
        Instantiate SparkClient per user session [Spark Branch] Sub-task Resolved Chinna Rao Lalam
         
        33.
        Support analyze table [Spark Branch] Sub-task Resolved Chengxiang Li
         
        34.
        Find solution for closures containing writables [Spark Branch] Sub-task Resolved Unassigned
         
        35.
        Support Hive TABLESAMPLE [Spark Branch] Sub-task Resolved Chengxiang Li
         
        36.
        Create TestSparkCliDriver to run test in spark local mode [Spark Branch] Sub-task Resolved Szehon Ho
         
        37.
        Update to Spark 1.2 [Spark Branch] Sub-task Resolved Brock Noland
         
        38.
        Implement native HiveMapFunction [Spark Branch] Sub-task Resolved Chengxiang Li
         
        39.
        Implement native HiveReduceFunction [Spark Branch] Sub-task Resolved Chengxiang Li
         
        40.
        Start running .q file tests on spark [Spark Branch] Sub-task Resolved Chinna Rao Lalam
         
        41.
        Fix qtest-spark pom.xml reference to test properties [Spark Branch] Sub-task Resolved Brock Noland
         
        42.
        Create SparkReporter [Spark Branch] Sub-task Resolved Chengxiang Li
         
        43.
        Incorrect result returned when a map work has multiple downstream reduce works [Spark Branch] Sub-task Resolved Chao Sun
         
        44.
        TestSparkCliDriver should not use includeQueryFiles [Spark Branch] Sub-task Resolved Brock Noland
         
        45.
        Add .q tests coverage for "union all" [Spark Branch] Sub-task Resolved Na Yang
         
        46.
        Enable q-tests for TABLESAMPLE feature [Spark Branch] Sub-task Resolved Chengxiang Li
         
        47.
        Research to find out if it's possible to submit Spark jobs concurrently using shared SparkContext [Spark Branch] Sub-task Resolved Chao Sun
         
        48.
        Enable q-tests for ANALYZE TABLE feature [Spark Branch] Sub-task Resolved Na Yang
         
        49.
        Add qfile_regex to qtest-spark pom [Spark Branch] Sub-task Resolved Brock Noland
         
        50.
        Enable timestamp.* tests [Spark Branch] Sub-task Resolved Brock Noland
         
        51.
        Enable avro* tests [Spark Branch] Sub-task Resolved Brock Noland
         
        52.
        PTest2 separates test files with spaces while QTestGen uses commas [Spark Branch] Sub-task Resolved Brock Noland
         
        53.
        Cleanup Reduce operator code [Spark Branch] Sub-task Resolved Rui Li
         
        54.
        hive.optimize.union.remove does not work properly [Spark Branch] Sub-task Resolved Na Yang
         
        55.
        Integrate with Spark executor scaling [Spark Branch] Sub-task Resolved Chengxiang Li
         
        56.
        Research optimization of auto convert join to map join [Spark branch] Sub-task Resolved Suhas Satish
         
        57.
        Support windowing and analytic functions [Spark Branch] Sub-task Resolved Chengxiang Li
         
        58.
        Enable windowing and analytic function qtests [Spark Branch] Sub-task Resolved Chengxiang Li
         
        59.
        Union all query finished with errors [Spark Branch] Sub-task Resolved Rui Li
         
        60.
        Enable tests on Spark branch (1) [Sparch Branch] Sub-task Resolved Brock Noland
         
        61.
        Enable tests on Spark branch (2) [Sparch Branch] Sub-task Resolved Venki Korukanti
         
        62.
        Enable tests on Spark branch (3) [Sparch Branch] Sub-task Resolved Chengxiang Li
         
        63.
        Enable tests on Spark branch (4) [Sparch Branch] Sub-task Resolved Chinna Rao Lalam
         
        64.
        Enable map-join tests which Tez executes [Spark Branch] Sub-task Resolved Rui Li
         
        65.
        CounterStatsAggregator throws a class cast exception Sub-task Resolved Brock Noland
         
        66.
        union_null.q is not deterministic Sub-task Closed Brock Noland
         
        67.
        StarterProject: enable groupby4.q [Spark Branch] Sub-task Resolved Suhas Satish
         
        68.
        Research commented out unset in Utiltities [Spark Branch] Sub-task Resolved Unassigned
         
        69.
        Update union_null results now that it's deterministic [Spark Branch] Sub-task Resolved Brock Noland
         
        70.
        Refresh SparkContext when spark configuration changes [Spark Branch] Sub-task Resolved Chinna Rao Lalam
         
        71.
        Enable reduce-side join tests (1) [Spark Branch] Sub-task Resolved Szehon Ho
         
        72.
        Merge from trunk (1) [Spark Branch] Sub-task Resolved Brock Noland
         
        73.
        Re-order spark.query.files in sorted order [Spark Branch] Sub-task Resolved Brock Noland
         
        74.
        Build long running HS2 test framework Sub-task Closed Suhas Satish
         
        75.
        Insert overwrite table query has strange behavior when set hive.optimize.union.remove=true [Spark Branch] Sub-task Resolved Na Yang
         
        76.
        Re-enable lazy HiveBaseFunctionResultList [Spark Branch] Sub-task Resolved Jimmy Xiang
         
        77.
        Enable qtest load_dyn_part1.q [Spark Branch] Sub-task Resolved Venki Korukanti
         
        78.
        orc_analyze.q fails due to random mapred.task.id in FileSinkOperator [Spark Branch] Sub-task Resolved Venki Korukanti
         
        79.
        optimize_nullscan.q fails due to differences in explain plan [Spark Branch] Sub-task Resolved Venki Korukanti
         
        80.
        Support multiple concurrent users Sub-task Resolved Chengxiang Li
         
        81.
        Support subquery [Spark Branch] Sub-task Resolved Xuefu Zhang
         
        82.
        enable Qtest scriptfile1.q [Spark Branch] Sub-task Resolved Chengxiang Li
         
        83.
        enable sample8.q.[Spark Branch] Sub-task Resolved Chengxiang Li
         
        84.
        enable sample10.q.[Spark Branch] Sub-task Resolved Chengxiang Li
         
        85.
        Insert overwrite table query does not generate correct task plan [Spark Branch] Sub-task Resolved Na Yang
         
        86. Research Hive dependency on MR distributed cache[Spark Branch] Sub-task Open Unassigned
         
        87.
        Merge from trunk (2) [Spark Branch] Sub-task Resolved Brock Noland
         
        88.
        Investigate query failures (1) Sub-task Resolved Thomas Friedrich
         
        89.
        Investigate query failures (2) Sub-task Resolved Thomas Friedrich
         
        90.
        Investigate query failures (3) Sub-task Resolved Thomas Friedrich
         
        91.
        Investigate query failures (4) Sub-task Resolved Thomas Friedrich
         
        92.
        Merge from trunk (3) [Spark Branch] Sub-task Resolved Brock Noland
         
        93.
        Use HiveKey instead of BytesWritable as key type of the pair RDD [Spark Branch] Sub-task Resolved Rui Li
         
        94.
        Fix TestSparkCliDriver => optimize_nullscan.q Sub-task Resolved Brock Noland
         
        95.
        Merge trunk into spark 9/12/2014 Sub-task Resolved Brock Noland
         
        96.
        Enable vectorization for spark [spark branch] Sub-task Resolved Chinna Rao Lalam
         
        97.
        Code cleanup after HIVE-8054 [Spark Branch] Sub-task Resolved Na Yang
         
        98.
        Disable hive.optimize.union.remove when hive.execution.engine=spark [Spark Branch] Sub-task Resolved Na Yang
         
        99.
        Remove obsolete code from SparkWork [Spark Branch] Sub-task Resolved Chao Sun
         
        100.
        Refactor the GraphTran code by moving union handling logic to UnionTran [Spark Branch] Sub-task Resolved Na Yang
         
        101.
        Support SMB Join for Hive on Spark [Spark Branch] Sub-task Resolved Szehon Ho
         
        102.
        Merge from trunk to spark 9/20/14 Sub-task Resolved Brock Noland
         
        103.
        clone SparkWork for join optimization Sub-task Resolved Unassigned
         
        104.
        GroupByShuffler.java missing apache license header [Spark Branch] Sub-task Resolved Chao Sun
         
        105.
        Merge from trunk to spark 9/29/14 Sub-task Resolved Xuefu Zhang
         
        106.
        Enable windowing.q for spark [Spark Branch] Sub-task Resolved Jimmy Xiang
         
        107.
        Merge trunk into spark 10/4/2015 [Spark Branch] Sub-task Resolved Brock Noland
         
        108.
        Fix fs_default_name2.q on spark [Spark Branch] Sub-task Resolved Brock Noland
         
        109.
        Investigate flaky test parallel.q Sub-task Resolved Jimmy Xiang
         
        110.
        TPCDS query #7 fails with IndexOutOfBoundsException [Spark Branch] Sub-task Resolved Jimmy Xiang
         
        111.
        Research Bucket Map Join [Spark Branch] Sub-task Resolved Na Yang
         
        112.
        Research on skewed join [Spark Branch] Sub-task Resolved Rui Li
         
        113.
        Make reduce side join work for all join queries [Spark Branch] Sub-task Resolved Xuefu Zhang
         
        114.
        Turn on all join .q tests [Spark Branch] Sub-task Resolved Chao Sun
         
        115.
        Print Spark job progress format info on the console[Spark Branch] Sub-task Resolved Chengxiang Li
         
        116.
        Support Hive Counter to collect spark job metric[Spark Branch] Sub-task Resolved Chengxiang Li
         
        117.
        Update timestamp in status console [Spark Branch] Sub-task Resolved Brock Noland
         
        118.
        TPC-DS Query 96 parallelism is not set correcly Sub-task Resolved Chao Sun
         
        119.
        Merge trunk into spark 10/17/14 [Spark Branch] Sub-task Resolved Brock Noland
         
        120. UT: add TestSparkMinimrCliDriver to run UTs that use HDFS Sub-task Open Thomas Friedrich
         
        121. UT: fix bucket_num_reducers test Sub-task Open Chinna Rao Lalam
         
        122. UTs: create missing output files for some tests under clientpositive/spark Sub-task Open Thomas Friedrich
         
        123. UT: add test flag in hive-site.xml for spark tests Sub-task Open Thomas Friedrich
         
        124.
        UT: fix rcfile_bigdata test [Spark Branch] Sub-task Resolved Chinna Rao Lalam
         
        125.
        UT: fix bucketsort_insert tests - related to SMBMapJoinOperator Sub-task Resolved Chinna Rao Lalam
         
        126.
        UT: fix list_bucket_dml_2 test [Spark Branch] Sub-task Resolved Chinna Rao Lalam
         
        127.
        Update async action in SparkClient as Spark add new Java action API[Spark Branch] Sub-task Resolved Chengxiang Li
         
        128.
        Add remote Spark client to Hive [Spark Branch] Sub-task Resolved Marcelo Vanzin
         
        129.
        Enable collect table statistics based on SparkCounter[Spark Branch] Sub-task Resolved Chengxiang Li
         
        130.
        HivePairFlatMapFunction.java missing license header [Spark Branch] Sub-task Resolved Chao Sun
         
        131.
        Add InterfaceAudience annotations to spark-client [Spark Branch] Sub-task Resolved Marcelo Vanzin
         
        132.
        convert joinOp to MapJoinOp and generate MapWorks only [Spark Branch] Sub-task Resolved Suhas Satish
         
        133.
        Implement bucket map join optimization [Spark Branch] Sub-task Resolved Jimmy Xiang
         
        134.
        Convert SMBJoin to MapJoin [Spark Branch] Sub-task Resolved Szehon Ho
         
        135.
        Support hints of SMBJoin [Spark Branch] Sub-task Resolved Szehon Ho
         
        136.
        Reduce Side Join with single reducer [Spark Branch] Sub-task Resolved Szehon Ho
         
        137.
        Enable parallelism in Reduce Side Join [Spark Branch] Sub-task Resolved Szehon Ho
         
        138.
        Increase level of parallelism in reduce phase [Spark Branch] Sub-task Resolved Jimmy Xiang
         
        139.
        Combine Hive Operator statistic and Spark Metric to an uniformed query statistic.[Spark Branch] Sub-task Resolved Chengxiang Li
         
        140.
        Result differences after merge [Spark Branch] Sub-task Resolved Brock Noland
         
        141.
        Fix tests after merge [Spark Branch] Sub-task Resolved Brock Noland
         
        142.
        Enable table statistic collection on counter for CTAS query[Spark Branch] Sub-task Resolved Chengxiang Li
         
        143.
        spark-client build failed sometimes.[Spark Branch] Sub-task Resolved Chengxiang Li
         
        144.
        Collect Spark TaskMetrics and build job statistic[Spark Branch] Sub-task Resolved Chengxiang Li
         
        145.
        Null Pointer Exception when counter is used for stats during inserting overwrite partitioned tables [Spark Branch] Sub-task Resolved Na Yang
         
        146.
        numRows and rawDataSize are not collected by the Spark stats [Spark Branch] Sub-task Resolved Na Yang
         
        147.
        Investigate test failures related to HIVE-8545 [Spark Branch] Sub-task Resolved Jimmy Xiang
         
        148.
        Fix hadoop-1 build [Spark Branch] Sub-task Resolved Jimmy Xiang
         
        149.
        Merge from trunk 11/6/14 [SPARK BRANCH] Sub-task Resolved Brock Noland
         
        150.
        Should only register used counters in SparkCounters[Spark Branch] Sub-task Resolved Chengxiang Li
         
        151.
        insert1.q and ppd_join4.q hangs with hadoop-1 [Spark Branch] Sub-task Resolved Chengxiang Li
         
        152.
        Create some tests that use Spark counter for stats collection [Spark Branch] Sub-task Resolved Chengxiang Li
         
        153.
        UT: update hive-site.xml for spark UTs to add hive_admin_user to admin role Sub-task Resolved Thomas Friedrich
         
        154.
        UT: fix partition test case [Spark Branch] Sub-task Resolved Chinna Rao Lalam
         
        155. UT: fix udf_context_aware Sub-task Open Aihua Xu
         
        156. UT: fix hook_context_cs test case Sub-task Open Unassigned
         
        157.
        Switch precommit test from local to local-cluster [Spark Branch] Sub-task Resolved Szehon Ho
         
        158.
        Print prettier Spark work graph after HIVE-8793 [Spark Branch] Sub-task Resolved Jimmy Xiang
         
        159.
        Release RDD cache when Hive query is done [Spark Branch] Sub-task Resolved Jimmy Xiang
         
        160.
        Choose a persisent policy for RDD caching [Spark Branch] Sub-task Resolved Jimmy Xiang
         
        161. Hive/Spark/Yarn integration [Spark Branch] Sub-task Open Chengxiang Li
         
        162.
        Update new spark progress API for local submitted job monitoring [Spark Branch] Sub-task Resolved Rui Li
         
        163.
        Visualize generated Spark plan [Spark Branch] Sub-task Resolved Chinna Rao Lalam
         
        164. Downgrade guava version to be consistent with Hive and the rest of Hadoop [Spark Branch] Sub-task Open Unassigned
         
        165.
        Fix test TestHiveKVResultCache [Spark Branch] Sub-task Resolved Jimmy Xiang
         
        166.
        Use MEMORY_AND_DISK for RDD caching [Spark Branch] Sub-task Resolved Jimmy Xiang
         
        167.
        Merge from trunk to spark [Spark Branch] Sub-task Resolved Brock Noland
         
        168.
        downgrade guava version for spark branch from 14.0.1 to 11.0.2.[Spark Branch] Sub-task Resolved Chengxiang Li
         
        169.
        Servlet classes signer information does not match [Spark branch] Sub-task Resolved Chengxiang Li
         
        170.
        IOContext problem with multiple MapWorks cloned for multi-insert [Spark Branch] Sub-task Resolved Xuefu Zhang
         
        171.
        Remove unnecessary dependency collection task [Spark Branch] Sub-task Resolved Rui Li
         
        172.
        Make sure Spark + HS2 work [Spark Branch] Sub-task Resolved Chengxiang Li
         
        173.
        Merge from trunk Nov 28 2014 Sub-task Resolved Brock Noland
         
        174.
        Find thread leak in RSC Tests [Spark Branch] Sub-task Resolved Rui Li
         
        175.
        Logging is not configured in spark-submit sub-process Sub-task Resolved Brock Noland
         
        176.
        SparkCounter display name is not set correctly[Spark Branch] Sub-task Resolved Chengxiang Li
         
        177. Clean up temp files of RSC [Spark Branch] Sub-task Open Unassigned
         
        178.
        Avoid using SPARK_JAVA_OPTS [Spark Branch] Sub-task Resolved Rui Li
         
        179.
        Re-enable remaining tests after HIVE-8970 [Spark Branch] Sub-task Resolved Chao Sun
         
        180.
        Enable ppd_join4 [Spark Branch] Sub-task Resolved Chao Sun
         
        181.
        Replace akka for remote spark client RPC [Spark Branch] Sub-task Resolved Marcelo Vanzin
         
        182.
        Spark Memory can be formatted string [Spark Branch] Sub-task Resolved Jimmy Xiang
         
        183.
        Support multiple mapjoin operators in one work [Spark Branch] Sub-task Resolved Jimmy Xiang
         
        184.
        HiveException: Conflict on row inspector for {table} Sub-task Resolved Jimmy Xiang
         
        185. Choosing right preference between map join and bucket map join [Spark Branch] Sub-task Open Unassigned
         
        186.
        Add additional logging to SetSparkReducerParallelism [Spark Branch] Sub-task Resolved Brock Noland
         
        187.
        Remove wrappers for SparkJobInfo and SparkStageInfo [Spark Branch] Sub-task Resolved Chengxiang Li
         
        188.
        NPE in RemoteSparkJobStatus.getSparkStatistics [Spark Branch] Sub-task Resolved Rui Li
         
        189.
        Generate better plan for queries containing both union and multi-insert [Spark Branch] Sub-task Resolved Chao Sun
         
        190.
        Allow RPC Configuration [Spark Branch] Sub-task Resolved Unassigned
         
        191.
        Hive should not submit second SparkTask while previous one has failed.[Spark Branch] Sub-task Resolved Chengxiang Li
         
        192.
        Hive hangs while failed to get executorCount[Spark Branch] Sub-task Resolved Chengxiang Li
         
        193.
        Skip child tasks if parent task failed [Spark Branch] Sub-task Resolved Unassigned
         
        194.
        Bucket mapjoin should use the new alias in posToAliasMap [Spark Branch] Sub-task Resolved Jimmy Xiang
         
        195.
        Investigate IOContext object initialization problem [Spark Branch] Sub-task Resolved Xuefu Zhang
         
        196.
        Spark Client RPC should have larger default max message size [Spark Branch] Sub-task Resolved Brock Noland
         
        197.
        Spark counter serialization error in spark.log [Spark Branch] Sub-task Resolved Chengxiang Li
         
        198. Error when cleaning up in spark.log [Spark Branch] Sub-task Open Unassigned
         
        199.
        TimeoutException when trying get executor count from RSC [Spark Branch] Sub-task Resolved Chengxiang Li
         
        200.
        Check cross product for conditional task [Spark Branch] Sub-task Resolved Rui Li
         
        201.
        infer_bucket_sort_convert_join.q and mapjoin_hook.q failed.[Spark Branch] Sub-task Resolved Xuefu Zhang
         
        202.
        bucket_map_join_spark4.q failed due to NPE.[Spark Branch] Sub-task Resolved Jimmy Xiang
         
        203. Support backup task for join related optimization [Spark Branch] Sub-task Patch Available Chao Sun
         
        204.
        windowing.q failed when mapred.reduce.tasks is set to larger than one Sub-task Resolved Chao Sun
         
        205.
        Add unit test for multi sessions.[Spark Branch] Sub-task Resolved Chengxiang Li
         
        206.
        Enable beeline query progress information for Spark job[Spark Branch] Sub-task Resolved Chengxiang Li
         
        207.
        RSC stdout is logged twice [Spark Branch] Sub-task Resolved Jimmy Xiang
         
        208. Clean up GenSparkProcContext.clonedReduceSinks and related code [Spark Branch] Sub-task Patch Available Chao Sun
         
        209.
        authorization_admin_almighty1.q fails with result diff [Spark Branch] Sub-task Resolved Unassigned
         
        210.
        Merge from trunk to spark 12/26/2014 [Spark Branch] Sub-task Resolved Brock Noland
         
        211. UT: set hive.support.concurrency to true for spark UTs Sub-task Open Bing Li
         
        212.
        UT: udf_in_file fails with filenotfoundexception [Spark Branch] Sub-task Resolved Chinna Rao Lalam
         
        213.
        Create a separate API for remote Spark Context RPC other than job submission [Spark Branch] Sub-task Resolved Marcelo Vanzin
         
        214.
        Add listeners on JobHandle so job status change can be notified to the client [Spark Branch] Sub-task Resolved Marcelo Vanzin
         
        215.
        TimeOutException when using RSC with beeline [Spark Branch] Sub-task Resolved Unassigned
         
        216.
        One-pass SMB Optimizations [Spark Branch] Sub-task Resolved Szehon Ho
         
        217.
        Choose Kryo as the serializer for pTest [Spark Branch] Sub-task Resolved Xuefu Zhang
         
        218.
        Test windowing.q is failing [Spark Branch] Sub-task Resolved Unassigned
         
        219.
        Add more log information for debug RSC[Spark Branch] Sub-task Resolved Chengxiang Li
         
        220.
        Spark branch compile failed on hadoop-1[Spark Branch] Sub-task Resolved Chengxiang Li
         
        221.
        Research on build mini HoS cluster on YARN for unit test[Spark Branch] Sub-task Resolved Chengxiang Li
         
        222.
        Remove authorization_admin_almighty1 from spark tests [Spark Branch] Sub-task Resolved Xuefu Zhang
         
        223.
        Investigate differences for auto join tests in explain after merge from trunk [Spark Branch] Sub-task Resolved Chao Sun
         
        224.
        Followup for HIVE-9125, update ppd_join4.q.out for Spark [Spark Branch] Sub-task Resolved Xuefu Zhang
         
        225.
        Remove tabs from spark code [Spark Branch] Sub-task Resolved Brock Noland
         
        226.
        SetSparkReducerParallelism is likely to set too small number of reducers [Spark Branch] Sub-task Resolved Rui Li
         
        227.
        Merge trunk to spark 1/5/2015 [Spark Branch] Sub-task Resolved Szehon Ho
         
        228.
        Merge from spark to trunk January 2015 Sub-task Resolved Szehon Ho
         
        229.
        Explain query should share the same Spark application with regular queries [Spark Branch] Sub-task Resolved Jimmy Xiang
         
        230.
        Ensure custom UDF works with Spark [Spark Branch] Sub-task Resolved Xuefu Zhang
         
        231.
        Code cleanup [Spark Branch] Sub-task Resolved Szehon Ho
         
        232.
        TODO cleanup task1.[Spark Branch] Sub-task Resolved Chengxiang Li
         
        233. Cleanup code for getting spark job progress and metrics Sub-task Open Rui Li
         
        234. Improve replication factor of small table file given big table partitions [Spark branch] Sub-task Open Jimmy Xiang
         
        235.
        Set default miniClusterType back to none in QTestUtil.[Spark branch] Sub-task Resolved Chengxiang Li
         
        236.
        Let Context.isLocalOnlyExecutionMode() return false if execution engine is Spark [Spark Branch] Sub-task Resolved Xuefu Zhang
         
        237. thrift.transport.TTransportException [Spark Branch] Sub-task Open Chao Sun
         
        238.
        Cleanup Modified Files [Spark Branch] Sub-task Resolved Szehon Ho
         
        239.
        Merge from trunk to spark 1/8/2015 Sub-task Resolved Szehon Ho
         
        240.
        BaseProtocol.Error failed to deserialization due to NPE.[Spark Branch] Sub-task Resolved Chengxiang Li
         
        241.
        Address review items on HIVE-9257 [Spark Branch] Sub-task Resolved Brock Noland
         
        242.
        Optimize split grouping for CombineHiveInputFormat [Spark Branch] Sub-task Resolved Jimmy Xiang
         
        243.
        Address review of HIVE-9257 (ii) [Spark Branch] Sub-task Resolved Szehon Ho
         
        244.
        Fix windowing.q for Spark on trunk Sub-task Resolved Rui Li
         
        245.
        Merge from spark to trunk (follow-up of HIVE-9257) Sub-task Resolved Szehon Ho
         
        246.
        SparkJobMonitor timeout as sortByKey would launch extra Spark job before original job get submitted [Spark Branch] Sub-task Resolved Chengxiang Li
         
        247.
        Fix tests with some versions of Spark + Snappy [Spark Branch] Sub-task Resolved Brock Noland
         
        248.
        add num-executors / executor-cores / executor-memory option support for hive on spark in Yarn mode [Spark Branch] Sub-task Resolved Pierre Yin
         
        249.
        Shutting down cli takes quite some time [Spark Branch] Sub-task Resolved Rui Li
         
        250.
        Make WAIT_SUBMISSION_TIMEOUT configuable and check timeout in SparkJobMonitor level.[Spark Branch] Sub-task Resolved Chengxiang Li
         
        251.
        Avoid ser/de loggers as logging framework can be incompatible on driver and workers Sub-task Resolved Rui Li
         
        252.
        ClassNotFoundException occurs during hive query case execution with UDF defined [Spark Branch] Sub-task Resolved Chengxiang Li
         
        253.
        Add jar/file doesn't work with yarn-cluster mode [Spark Branch] Sub-task Resolved Rui Li
         
        254.
        Merge trunk to spark 1/21/2015 Sub-task Resolved Szehon Ho
         
        255.
        Move more hive.spark.* configurations to HiveConf [Spark Branch] Sub-task Resolved Szehon Ho
         
        256.
        LocalSparkJobStatus may return failed job as successful [Spark Branch] Sub-task Resolved Rui Li
         
        257.
        Push YARN configuration to Spark while deply Spark on YARN[Spark Branch] Sub-task Resolved Chengxiang Li
         
        258.
        MapJoin task shouldn't start if HashTableSink task failed [Spark Branch] Sub-task Resolved Unassigned
         
        259.
        No error thrown when global limit optimization failed to find enough number of rows [Spark Branch] Sub-task Resolved Rui Li
         
        260.
        Make Remote Spark Context secure [Spark Branch] Sub-task Resolved Marcelo Vanzin
         
        261.
        Failed job may not throw exceptions [Spark Branch] Sub-task Resolved Rui Li
         
        262.
        Enable CBO related tests [Spark Branch] Sub-task Closed Chinna Rao Lalam
         
        263.
        UNION ALL query failed with ArrayIndexOutOfBoundsException [Spark Branch] Sub-task Resolved Chao Sun
         
        264. Hive reported exception because that hive's derby version conflict with spark's derby version [Spark Branch] Sub-task Patch Available Pierre Yin
         
        265. Enable infer_bucket_sort_dyn_part.q for TestMiniSparkOnYarnCliDriver test. [Spark Branch] Sub-task Open Unassigned
         
        266. SparkSessionImpl calcualte wrong cores number in TestSparkCliDriver [Spark Branch] Sub-task Open Unassigned
         
        267.
        Merge trunk to Spark branch 2/2/2015 [Spark Branch] Sub-task Resolved Xuefu Zhang
         
        268.
        SHUFFLE_SORT should only be used for order by query [Spark Branch] Sub-task Closed Rui Li
         
        269.
        Revert changes in two test configuration files accidently brought in by HIVE-9552 [Spark Branch] Sub-task Resolved Xuefu Zhang
         
        270.
        Enable more unit tests for UNION ALL [Spark Branch] Sub-task Closed Chao Sun
         
        271.
        Lazy computing in HiveBaseFunctionResultList may hurt performance [Spark Branch] Sub-task Resolved Jimmy Xiang
         
        272.
        'Error while trying to create table container' occurs during hive query case execution when hive.optimize.skewjoin set to 'true' [Spark Branch] Sub-task Closed Rui Li
         
        273.
        Improve some qtests Sub-task Closed Rui Li
         
        274.
        Support Impersonation [Spark Branch] Sub-task Closed Brock Noland
         
        275.
        Address RB comments for HIVE-9425 [Spark Branch] Sub-task Closed Unassigned
         
        276.
        Hive on Spark is not as aggressive as MR on map join [Spark Branch] Sub-task Resolved Unassigned
         
        277.
        Merge trunk to Spark branch 2/15/2015 [Spark Branch] Sub-task Closed Xuefu Zhang
         
        278.
        Upgrade to spark 1.3 [Spark Branch] Sub-task Closed Brock Noland
         
        279. Print yarn application id to console [Spark Branch] Sub-task Open Chinna Rao Lalam
         
        280.
        Utilize spark.kryo.classesToRegister [Spark Branch] Sub-task Closed Jimmy Xiang
         
        281.
        java.lang.NoSuchMethodError occurs during hive query execution which has 'ADD FILE XXXX.jar' sentence Sub-task Resolved Unassigned
         
        282.
        Merge trunk to Spark branch 02/27/2015 [Spark Branch] Sub-task Closed Xuefu Zhang
         
        283.
        Load spark-defaults.conf from classpath [Spark Branch] Sub-task Closed Brock Noland
         
        284. Querying parquet tables fails with IllegalStateException [Spark Branch] Sub-task Open Unassigned
         
        285.
        Print spark job id in history file [spark branch] Sub-task Closed Chinna Rao Lalam
         
        286.
        Add jar/file doesn't work with yarn-cluster mode [Spark Branch] Sub-task Closed Rui Li
         
        287.
        Merge trunk to Spark branch 3/6/2015 [Spark Branch] Sub-task Closed Xuefu Zhang
         
        288.
        New Beeline queries will hang If Beeline terminates in-properly [Spark Branch] Sub-task Closed Jimmy Xiang
         
        289.
        Avoid Utilities.getMapRedWork for spark [Spark Branch] Sub-task Closed Rui Li
         
        290.
        RSC has memory leak while execute multi queries.[Spark Branch] Sub-task Closed Chengxiang Li
         
        291. HiveInputFormat implementations getsplits may lead to memory leak.[Spark Branch] Sub-task Open Unassigned
         
        292. Log the information of cached RDD [Spark Branch] Sub-task Patch Available Chinna Rao Lalam
         
        293. Provide more informative stage description in Spark Web UI [Spark Branch] Sub-task Open Unassigned
         
        294. Improve common join performance [Spark Branch] Sub-task Patch Available Unassigned
         
        295.
        Merge trunk to Spark branch 03/27/2015 [Spark Branch] Sub-task Closed Xuefu Zhang
         
        296.
        Fix test failures after HIVE-10130 [Spark Branch] Sub-task Closed Chao Sun
         
        297. Merge Spark branch to trunk 3/31/2015 Sub-task Open Unassigned
         
        298. Implement Hybrid Hybrid Grace Hash Join for Spark Branch [Spark Branch] Sub-task Open Unassigned
         
        299.
        Hive on Spark job configuration needs to be logged [Spark Branch] Sub-task Closed Szehon Ho
         
        300.
        ParseException issue (Failed to recognize predicate 'user') [Spark Branch] Sub-task Resolved Unassigned
         
        301.
        Merge trunk to spark 4/14/2015 [Spark Branch] Sub-task Resolved Szehon Ho
         
        302. Fix test failures after last merge from trunk [Spark Branch] Sub-task Open Unassigned
         
        303.
        Merge spark to trunk 4/15/2015 Sub-task Closed Szehon Ho
         
        304.
        Cancel connection when remote Spark driver process has failed [Spark Branch] Sub-task Resolved Chao Sun
         
        305.
        Enable parallel order by for spark [Spark Branch] Sub-task Resolved Rui Li
         
        306.
        Hive query should fail when it fails to initialize a session in SetSparkReducerParallelism [Spark Branch] Sub-task Resolved Chao Sun
         
        307.
        NPE in SparkUtilities::isDedicatedCluster [Spark Branch] Sub-task Resolved Rui Li
         
        308.
        Dynamic RDD caching optimization for HoS.[Spark Branch] Sub-task Resolved Chengxiang Li
         
        309.
        Combine equivalent Works for HoS[Spark Branch] Sub-task Resolved Chengxiang Li
         
        310. Followup for HIVE-10550, check performance w.r.t. persistence level [Spark Branch] Sub-task Open Chengxiang Li
         
        311. Make HIVE-10001 work with Spark [Spark Branch] Sub-task Open Unassigned
         
        312.
        Make HIVE-10568 work with Spark [Spark Branch] Sub-task Resolved Rui Li
         
        313. Merge trunk to Spark branch 5/28/2015 [Spark Branch] Sub-task Patch Available Deepesh Khandelwal
         
        314.
        Merge master to Spark branch 6/7/2015 [Spark Branch] Sub-task Resolved Unassigned
         
        315.
        HoS can't control number of map tasks for runtime skew join [Spark Branch] Sub-task Resolved Rui Li
         
        316.
        Upgrade Spark dependency to 1.4 [Spark Branch] Sub-task Resolved Rui Li
         
        317.
        Hive not able to pass Hive's Kerberos credential to spark-submit process [Spark Branch] Sub-task Resolved Unassigned
         
        318.
        Enable more tests for grouping by skewed data [Spark Branch] Sub-task Resolved Mohit Sabharwal
         
        319. Add more tests for HIVE-10844[Spark Branch] Sub-task Open GaoLun
         
        320.
        Remote Spark client doesn't use Kerberos keytab to authenticate [Spark Branch] Sub-task Resolved Xuefu Zhang
         
        321.
        Merge master to Spark branch 6/20/2015 [Spark Branch] Sub-task Resolved Xuefu Zhang
         
        322. Support multi edge between nodes in SparkPlan[Spark Branch] Sub-task Open Unassigned
         
        323. Investigate intermitten failure of join28.q for Spark Sub-task Open Mohit Sabharwal
         
        324.
        Add support for running negative q-tests [Spark Branch] Sub-task Resolved Mohit Sabharwal
         
        325.
        HashTableSinkOperator doesn't support vectorization [Spark Branch] Sub-task Resolved Rui Li
         
        326. Support hive.explain.user for Spark [Spark Branch] Sub-task Open Unassigned
         
        327.
        Query fails when there isn't a comparator for an operator [Spark Branch] Sub-task Resolved Rui Li
         

          Activity

          Hide
          Xuefu Zhang added a comment -

          Please move the discussion to user@hive.apache.com. The JIRA is to track project progress, but not for trouble-shooting. I will delete above posts soon to clean up. Thanks.

          Show
          Xuefu Zhang added a comment - Please move the discussion to user@hive.apache.com. The JIRA is to track project progress, but not for trouble-shooting. I will delete above posts soon to clean up. Thanks.
          Hide
          Xuefu Zhang added a comment - - edited

          Zhang Jingpeng, please check your distribution provider about production-readiness. If you're building your own bits, you will run your tests in order to answer the question. As a dev on the project, I think it's ready on the basis of my personal judgement.

          Show
          Xuefu Zhang added a comment - - edited Zhang Jingpeng , please check your distribution provider about production-readiness. If you're building your own bits, you will run your tests in order to answer the question. As a dev on the project, I think it's ready on the basis of my personal judgement.
          Hide
          Martin Wang added a comment -

          Hi Chinna Rao Lalam,
          I'm using the CDH-5.3.0-1.cdh5.3.0.p0.280, which is a Cloudera CDH version that includes Hive on Spark.
          The total table number is 61. I tested successfully with 35 tables, it will use 126 maps to process the data.
          When there's no error, I can see the job in the Spark Web UI(From YARN web UI->Application Master web UI). When there is error, the job is not even started in Spark Web UI.
          The total rows in 61 tables is about 80,000,000, not very big. The data in 61 tables is total 4GB.
          I ran the command in hive CLI, when there's error, I see below message:
          Query ID = root_20150701155757_61cd36ee-3c38-49a8-9c13-1029acffa0d3
          Total jobs = 1
          Launching Job 1 out of 1
          In order to change the average load for a reducer (in bytes):
          set hive.exec.reducers.bytes.per.reducer=<number>
          In order to limit the maximum number of reducers:
          set hive.exec.reducers.max=<number>
          In order to set a constant number of reducers:
          set mapreduce.job.reduces=<number>
          Status: Failed
          FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.spark.SparkTask

          Show
          Martin Wang added a comment - Hi Chinna Rao Lalam, I'm using the CDH-5.3.0-1.cdh5.3.0.p0.280, which is a Cloudera CDH version that includes Hive on Spark. The total table number is 61. I tested successfully with 35 tables, it will use 126 maps to process the data. When there's no error, I can see the job in the Spark Web UI(From YARN web UI->Application Master web UI). When there is error, the job is not even started in Spark Web UI. The total rows in 61 tables is about 80,000,000, not very big. The data in 61 tables is total 4GB. I ran the command in hive CLI, when there's error, I see below message: Query ID = root_20150701155757_61cd36ee-3c38-49a8-9c13-1029acffa0d3 Total jobs = 1 Launching Job 1 out of 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Status: Failed FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.spark.SparkTask
          Hide
          Martin Wang added a comment -

          Hi Chinna Rao Lalam,
          I'm using the CDH-5.3.0-1.cdh5.3.0.p0.280, which is a Cloudera CDH version that includes Hive on Spark.
          The total table number is 61. I tested successfully with 35 tables, it will use 126 maps to process the data.
          When there's no error, I can see the job in the Spark Web UI(From YARN web UI->Application Master web UI). When there is error, the job is not even started in Spark Web UI.
          The total rows in 61 tables is about 80,000,000, not very big. The data in 61 tables is total 4GB.

          I ran the command in hive CLI, when there's error, I see below message:

          Query ID = root_20150701155757_61cd36ee-3c38-49a8-9c13-1029acffa0d3
          Total jobs = 1
          Launching Job 1 out of 1
          In order to change the average load for a reducer (in bytes):
          set hive.exec.reducers.bytes.per.reducer=<number>
          In order to limit the maximum number of reducers:
          set hive.exec.reducers.max=<number>
          In order to set a constant number of reducers:
          set mapreduce.job.reduces=<number>
          Status: Failed
          FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.spark.SparkTask

          Show
          Martin Wang added a comment - Hi Chinna Rao Lalam, I'm using the CDH-5.3.0-1.cdh5.3.0.p0.280, which is a Cloudera CDH version that includes Hive on Spark. The total table number is 61. I tested successfully with 35 tables, it will use 126 maps to process the data. When there's no error, I can see the job in the Spark Web UI(From YARN web UI->Application Master web UI). When there is error, the job is not even started in Spark Web UI. The total rows in 61 tables is about 80,000,000, not very big. The data in 61 tables is total 4GB. I ran the command in hive CLI, when there's error, I see below message: Query ID = root_20150701155757_61cd36ee-3c38-49a8-9c13-1029acffa0d3 Total jobs = 1 Launching Job 1 out of 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Status: Failed FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.spark.SparkTask
          Hide
          Chinna Rao Lalam added a comment -

          Hi, Can you add some more details like,
          Stack trace, which version you are using and table number is how big enough..

          Show
          Chinna Rao Lalam added a comment - Hi, Can you add some more details like, Stack trace, which version you are using and table number is how big enough..
          Hide
          Martin Wang added a comment -

          Hi Dear Experts,
          I'm trying Hive on Spark. I met a problem when I ran a map-only query like
          create table xxx as
          select a,b from table1 union all
          select a,b from table2 union all
          select a,b from table3 union all
          ...

          When the table number is not big, it works fine.
          When the table number is big enough, it said:
          FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.spark.SparkTask

          And, when the fail occurs, I can't see the job in the Spark Web UI.

          Can anyone help me to solve this problem?
          Thank you!

          Martin

          Show
          Martin Wang added a comment - Hi Dear Experts, I'm trying Hive on Spark. I met a problem when I ran a map-only query like create table xxx as select a,b from table1 union all select a,b from table2 union all select a,b from table3 union all ... When the table number is not big, it works fine. When the table number is big enough, it said: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.spark.SparkTask And, when the fail occurs, I can't see the job in the Spark Web UI. Can anyone help me to solve this problem? Thank you! Martin
          Hide
          Zhang Jingpeng added a comment -

          Is the branch already usable in production?

          Show
          Zhang Jingpeng added a comment - Is the branch already usable in production?
          Hide
          Rui Li added a comment -

          Raj Sharma - Yes you can. You can follow this wiki to see what else you need to do to run Hive on Spark.

          Show
          Rui Li added a comment - Raj Sharma - Yes you can. You can follow this wiki to see what else you need to do to run Hive on Spark.
          Hide
          Raj Sharma added a comment -

          Can I run below command in Hive 1.1 or 1.2 to switch engine from MapReduce to Spark?

          hive> set hive.execution.engine=spark;

          Show
          Raj Sharma added a comment - Can I run below command in Hive 1.1 or 1.2 to switch engine from MapReduce to Spark? hive> set hive.execution.engine=spark;
          Hide
          Chao Sun added a comment -

          Hi Raj, as mentioned by Xuefu above, Hive on Spark is already available in Hive 1.1 and 1.2. Please check it out.

          Show
          Chao Sun added a comment - Hi Raj, as mentioned by Xuefu above, Hive on Spark is already available in Hive 1.1 and 1.2. Please check it out.
          Hide
          Raj Sharma added a comment -

          When will Spark be shipped with Hive as an option of Hive engine along with Tez and MapReduce?

          Show
          Raj Sharma added a comment - When will Spark be shipped with Hive as an option of Hive engine along with Tez and MapReduce?
          Hide
          Xuefu Zhang added a comment -

          Yes, it's available in both 1.1 and 1.2.

          Show
          Xuefu Zhang added a comment - Yes, it's available in both 1.1 and 1.2.
          Hide
          Xiaoyong Zhu added a comment -

          So is this available in Hive 1.2?

          Show
          Xiaoyong Zhu added a comment - So is this available in Hive 1.2?
          Hide
          Xin Hao added a comment -

          OK, I see. Thanks for your info.

          Show
          Xin Hao added a comment - OK, I see. Thanks for your info.
          Hide
          Ruslan Dautkhanov added a comment -

          Exciting. Hopefully it will be released some time soon.

          Show
          Ruslan Dautkhanov added a comment - Exciting. Hopefully it will be released some time soon.
          Hide
          Lefty Leverenz added a comment -
          Show
          Lefty Leverenz added a comment - Doc note: See comments on HIVE-9257 and HIVE-9448 for documentation issues. HIVE-9257 commit comment with doc notes HIVE-9448 doc comments list of configuration parameters where documented
          Hide
          leftylev added a comment - - edited

          Although this issue is still marked Unresolved, the Spark branch has been merged to trunk and is Resolved for the 1.1.0 release (HIVE-9257 and HIVE-9352). (Edit: Also HIVE-9448.)

          Show
          leftylev added a comment - - edited Although this issue is still marked Unresolved, the Spark branch has been merged to trunk and is Resolved for the 1.1.0 release ( HIVE-9257 and HIVE-9352 ). (Edit: Also HIVE-9448 .)
          Hide
          Peter Lin added a comment -

          Thanks Xuefu for the quick reply. I will give it a try next week.

          Show
          Peter Lin added a comment - Thanks Xuefu for the quick reply. I will give it a try next week.
          Hide
          Xuefu Zhang added a comment - - edited

          Formerly 0.15, now 1.1 is going to be released soon. Release candidate is out.

          Show
          Xuefu Zhang added a comment - - edited Formerly 0.15, now 1.1 is going to be released soon. Release candidate is out.
          Hide
          Xuefu Zhang added a comment -

          Formerly 0.15, now 1.1 is going to be release soon. Release candidate is out.

          Show
          Xuefu Zhang added a comment - Formerly 0.15, now 1.1 is going to be release soon. Release candidate is out.
          Hide
          Peter Lin added a comment -

          Would love to use this production, is it going to release in hive 15?

          Show
          Peter Lin added a comment - Would love to use this production, is it going to release in hive 15?
          Hide
          Xuefu Zhang added a comment -

          Bing Li, I assume you assigned this JIRA to yourself by mistake. However, let me know if you plan to work on this. Thanks.

          Show
          Xuefu Zhang added a comment - Bing Li , I assume you assigned this JIRA to yourself by mistake. However, let me know if you plan to work on this. Thanks.
          Hide
          Xuefu Zhang added a comment -

          yuemeng, you can try removing org/apache/spark folder in your local maven repo to see if it fixes it.

          Show
          Xuefu Zhang added a comment - yuemeng , you can try removing org/apache/spark folder in your local maven repo to see if it fixes it.
          Hide
          yuemeng added a comment -

          i am very interesting in hive on spark ,an try to use it,when i bulit it (download from https://github.com/apache/hive.git,and chose the spark branch)use maven with command: mvn package -DskipTests -Phadoop-2 -Pdist,but it give me some error like
          [ERROR] /home/ym/hive-on-spark/hive/ql/src/java/org/apache/hadoop/hive/ql/exec/spark/status/SparkJobStatus.java:[22,24] cannot find symbol
          [ERROR] symbol: class JobExecutionStatus
          [ERROR] location: package org.apache.spark
          [ERROR] /home/ym/hive-on-spark/hive/ql/src/java/org/apache/hadoop/hive/ql/exec/spark/status/SparkJobStatus.java:[33,10] cannot find symbol
          [ERROR] symbol: class JobExecutionStatus
          [ERROR] location: interface org.apache.hadoop.hive.ql.exec.spark.status.SparkJobStatus
          [ERROR] /home/ym/hive-on-spark/hive/ql/src/java/org/apache/hadoop/hive/ql/exec/spark/status/SparkJobMonitor.java:[31,24] cannot find symbol
          [ERROR] symbol: class JobExecutionStatus
          can you tell me why?

          Show
          yuemeng added a comment - i am very interesting in hive on spark ,an try to use it,when i bulit it (download from https://github.com/apache/hive.git,and chose the spark branch)use maven with command: mvn package -DskipTests -Phadoop-2 -Pdist,but it give me some error like [ERROR] /home/ym/hive-on-spark/hive/ql/src/java/org/apache/hadoop/hive/ql/exec/spark/status/SparkJobStatus.java: [22,24] cannot find symbol [ERROR] symbol: class JobExecutionStatus [ERROR] location: package org.apache.spark [ERROR] /home/ym/hive-on-spark/hive/ql/src/java/org/apache/hadoop/hive/ql/exec/spark/status/SparkJobStatus.java: [33,10] cannot find symbol [ERROR] symbol: class JobExecutionStatus [ERROR] location: interface org.apache.hadoop.hive.ql.exec.spark.status.SparkJobStatus [ERROR] /home/ym/hive-on-spark/hive/ql/src/java/org/apache/hadoop/hive/ql/exec/spark/status/SparkJobMonitor.java: [31,24] cannot find symbol [ERROR] symbol: class JobExecutionStatus can you tell me why?
          Hide
          Kiran Lonikar added a comment -

          Sorry, I have not looked at the code, but want to know how is the RDD structured? is it columnar? I am specifically interested for ORC, RC, Parquet files about how you preserve their columnar structure. RDD by nature is row wise and the SchemaRDD more specifically so.

          The spark sql component uses SchemaRDD which is row wise. Just to be clear, I am not reporting any problems with this JIRA. I am interested to know the implementation.

          I think columnar structure has its advantages and thats what hive vectorization did (https://issues.apache.org/jira/browse/HIVE-4160). The earlier SQL implementation shark also had some kind of columnar structure. I am not sure this spark on hive is preserving it.

          Show
          Kiran Lonikar added a comment - Sorry, I have not looked at the code, but want to know how is the RDD structured? is it columnar? I am specifically interested for ORC, RC, Parquet files about how you preserve their columnar structure. RDD by nature is row wise and the SchemaRDD more specifically so. The spark sql component uses SchemaRDD which is row wise. Just to be clear, I am not reporting any problems with this JIRA. I am interested to know the implementation. I think columnar structure has its advantages and thats what hive vectorization did ( https://issues.apache.org/jira/browse/HIVE-4160 ). The earlier SQL implementation shark also had some kind of columnar structure. I am not sure this spark on hive is preserving it.
          Hide
          Xuefu Zhang added a comment -

          Paulo Motta, thanks for your interest. I think the branch is ready for propective users to try out, but I'd recommend for production you wait for a formal release.

          Show
          Xuefu Zhang added a comment - Paulo Motta , thanks for your interest. I think the branch is ready for propective users to try out, but I'd recommend for production you wait for a formal release.
          Hide
          Paulo Motta added a comment -

          Is the branch already usable in production?

          Show
          Paulo Motta added a comment - Is the branch already usable in production?
          Hide
          Szehon Ho added a comment -

          Adding a short Getting Started Guide.

          Show
          Szehon Ho added a comment - Adding a short Getting Started Guide .
          Hide
          WangMeng added a comment -

          This is a very valuable project!

          Show
          WangMeng added a comment - This is a very valuable project!

            People

            • Assignee:
              Xuefu Zhang
              Reporter:
              Xuefu Zhang
            • Votes:
              31 Vote for this issue
              Watchers:
              179 Start watching this issue

              Dates

              • Created:
                Updated:

                Development