Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-17532

Hive on Spark query compilation starts Spark session

Log workAgile BoardRank to TopRank to BottomBulk Copy AttachmentsBulk Move AttachmentsAdd voteVotersWatch issueWatchersCreate sub-taskConvert to sub-taskMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 2.2.0
    • None
    • HiveServer2
    • None

    Description

      Hive on Spark query compilation starts a new Spark session when some kind of aggregation is present:

      0: jdbc:hive2://localhost:10000/default> set hive.execution.engine=spark;
      No rows affected (0.013 seconds)
      0: jdbc:hive2://localhost:10000/default> explain select distinct label0 from iris;
      INFO : Compiling command(queryId=hive_20170912151212_914ee322-28dd-442a-9dd9-7ed00a6a8caf): explain select distinct label0 from iris
      INFO : Semantic Analysis Completed
      INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:Explain, type:string, comment:null)], properties:null)
      INFO : Completed compiling command(queryId=hive_20170912151212_914ee322-28dd-442a-9dd9-7ed00a6a8caf); Time taken: 40.594 seconds

      Spark job started, all consecutive explain statements are fast:

      0: jdbc:hive2://localhost:10000/default> explain select distinct a1 from iris;
      INFO : Compiling command(queryId=hive_20170912151414_faacda24-290e-48bb-9daf-3f301fc170c1): explain select distinct label0 from iris
      INFO : Semantic Analysis Completed
      INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:Explain, type:string, comment:null)], properties:null)
      INFO : Completed compiling command(queryId=hive_20170912151414_faacda24-290e-48bb-9daf-3f301fc170c1); Time taken: 0.275 seconds

      Killing the Spark job, the same query is still fast, and no new Spark job has been started:

      0: jdbc:hive2://localhost:10000/default> explain select distinct a2 from iris;
      INFO : Compiling command(queryId=hive_20170912151616_a7ea83b6-03ce-4636-b3d4-be6feadcde35): explain select distinct label0 from iris
      INFO : Semantic Analysis Completed
      INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:Explain, type:string, comment:null)], properties:null)
      INFO : Completed compiling command(queryId=hive_20170912151616_a7ea83b6-03ce-4636-b3d4-be6feadcde35); Time taken: 0.213 seconds

      The code in question:
      SetSparkReducerParallelism.java:
      sparkSessionManager = SparkSessionManagerImpl.getInstance();
      sparkSession = SparkUtilities.getSparkSession(context.getConf(), sparkSessionManager);
      sparkMemoryAndCores = sparkSession.getMemoryAndCores();

      The created Spark session is used for getting the number of cores and memory only. This could be determined from the configurations, without actually starting a session.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned Assign to me
            pcsaszar Peter Csaszar

            Dates

              Created:
              Updated:

              Slack

                Issue deployment