Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-18372

.Hive-staging folders created from Spark hiveContext are not getting cleaned up

Rank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersCreate sub-taskConvert to sub-taskLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Resolved
    • 1.5.2, 1.6.2, 1.6.3
    • 1.6.4
    • SQL
    • None
    • spark standalone and spark yarn

    Description

      Steps to reproduce:
      ================
      1. Launch spark-shell
      2. Run the following scala code via Spark-Shell
      scala> val hivesampletabledf = sqlContext.table("hivesampletable")
      scala> import org.apache.spark.sql.DataFrameWriter
      scala> val dfw : DataFrameWriter = hivesampletabledf.write
      scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS hivesampletablecopypy ( clientid string, querytime string, market string, deviceplatform string, devicemake string, devicemodel string, state string, country string, querydwelltime double, sessionid bigint, sessionpagevieworder bigint )")
      scala> dfw.insertInto("hivesampletablecopypy")
      scala> val hivesampletablecopypydfdf = sqlContext.sql("""SELECT clientid, querytime, deviceplatform, querydwelltime FROM hivesampletablecopypy WHERE state = 'Washington' AND devicemake = 'Microsoft' AND querydwelltime > 15 """)
      hivesampletablecopypydfdf.show
      3. in HDFS (in our case, WASB), we can see the following folders
      hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666
      hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666-1/-ext-10000
      hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693
      the issue is that these don't get cleaned up and get accumulated
      =====
      with the customer, we have tried setting "SET hive.exec.stagingdir=/tmp/hive;" in hive-site.xml - didn't make any difference.
      .hive-staging folders are created under the <TableName> folder - hive/warehouse/hivesampletablecopypy/
      we have tried adding this property to hive-site.xml and restart the components -
      <property>
      <name>hive.exec.stagingdir</name>
      <value>$

      {hive.exec.scratchdir}

      /$

      {user.name}

      /.staging</value>
      </property>
      a new .hive-staging folder was created in hive/warehouse/<tablename> folder
      moreover, please understand that if we run the hive query in pure Hive via Hive CLI on the same Spark cluster, we don't see the behavior
      so it doesn't appear to be a Hive issue/behavior in this case- this is a spark behavior
      I checked in Ambari, spark.yarn.preserve.staging.files=false in Spark configuration already
      The issue happens via Spark-submit as well - customer used the following command to reproduce this -
      spark-submit test-hive-staging-cleanup.py

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            merlin Mingjie Tang
            merlin Mingjie Tang
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment