Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-15367

CTAS with LOCATION should write temp data under location directory rather than database location

Log workAgile BoardRank to TopRank to BottomBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersCreate sub-taskConvert to sub-taskMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 2.3.0
    • Hive
    • None

    Description

      For regular CTAS queries, temp data from a SELECT query will be written to to a staging directory under the database location. The code to control this is in SemanticAnalyzer.java

                   // allocate a temporary output dir on the location of the table
                    String tableName = getUnescapedName((ASTNode) ast.getChild(0));
                    String[] names = Utilities.getDbTableName(tableName);
                    Path location;
                    try {
                      Warehouse wh = new Warehouse(conf);
                      //Use destination table's db location.
                      String destTableDb = qb.getTableDesc() != null? qb.getTableDesc().getDatabaseName(): null;
                      if (destTableDb == null) {
                        destTableDb = names[0];
                      }
                      location = wh.getDatabasePath(db.getDatabase(destTableDb));
                    } catch (MetaException e) {
                      throw new SemanticException(e);
                    }
      

      However, CTAS queries allow specifying a LOCATION for the new table. Its possible for this location to be on a different filesystem than the database location. If this happens temp data will be written to the database filesystem and will be copied to the table filesystem in MoveTask.

      This extra copying of data can drastically affect performance. Rather than always use the database location as the staging dir for CTAS queries, Hive should first check if there is an explicit LOCATION specified in the CTAS query. If there is, staging data should be stored under the LOCATION directory.

      Attachments

        1. HIVE-15367.1.patch
          3 kB
          Sahil Takiar
        2. HIVE-15367.2.patch
          4 kB
          Sahil Takiar
        3. HIVE-15367.3.patch
          43 kB
          Sahil Takiar
        4. HIVE-15367.4.patch
          45 kB
          Sahil Takiar

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            stakiar Sahil Takiar Assign to me
            stakiar Sahil Takiar
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment