Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.6.0
    • Component/s: Metastore
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Partitions and tables in Hive typically consist of many files on HDFS. An issue is that as the number of files increase, there will be higher memory/load requirements on the namenode. Partitions in bucketed tables are a particular problem because they consist of many files, one for each of the buckets.

      One way to drastically reduce the number of files is to use hadoop archives:
      http://hadoop.apache.org/common/docs/current/hadoop_archives.html

      This feature would introduce an ALTER TABLE <table_name> ARCHIVE PARTITION <spec> that would automatically put the files for the partition into a HAR file. We would also have an UNARCHIVE option to convert the files in the partition back to the original files. Archived partitions would be slower to access, but they would have the same functionality and decrease the number of files drastically. Typically, only seldom accessed partitions would be archived.

      Hadoop archives are still somewhat new, so we'll only put in support for the latest released major version (0.20). Here are some bug fixes:

      https://issues.apache.org/jira/browse/HADOOP-6591 (Important - could potentially cause data loss without this fix)
      https://issues.apache.org/jira/browse/HADOOP-6645
      https://issues.apache.org/jira/browse/MAPREDUCE-1585

      1. HIVE-1332.6.patch
        99 kB
        Paul Yang
      2. HIVE-1332.5.patch
        97 kB
        Paul Yang
      3. HIVE-1332.4.patch
        96 kB
        Paul Yang
      4. HIVE-1332.3.patch
        83 kB
        Paul Yang
      5. HIVE-1332.2.patch
        79 kB
        Paul Yang
      6. HIVE-1332.1.patch
        65 kB
        Paul Yang

        Issue Links

          Activity

          Hide
          Steven Wong added a comment -

          If an archive operation runs when a select query is already running, the select may fail, right?

          Show
          Steven Wong added a comment - If an archive operation runs when a select query is already running, the select may fail, right?
          Show
          Paul Yang added a comment - Added archiving sections at: http://wiki.apache.org/hadoop/Hive/LanguageManual/DDL#Alter_Table_.28Un.29Archive http://wiki.apache.org/hadoop/Hive/LanguageManual/Archiving
          Hide
          Paul Yang added a comment -

          No, not yet. I'll update once it is posted.

          Show
          Paul Yang added a comment - No, not yet. I'll update once it is posted.
          Hide
          John Sichi added a comment -

          Paul, is there wiki documentation available for this feature?

          Show
          John Sichi added a comment - Paul, is there wiki documentation available for this feature?
          Hide
          Namit Jain added a comment -

          Committed. Thanks Paul

          Show
          Namit Jain added a comment - Committed. Thanks Paul
          Hide
          Paul Yang added a comment -

          This should fix the test issues - I'm re-running the test suite now but let me know if you see anything.

          Show
          Paul Yang added a comment - This should fix the test issues - I'm re-running the test suite now but let me know if you see anything.
          Hide
          Namit Jain added a comment -

          Otherwise, it looks good

          Show
          Namit Jain added a comment - Otherwise, it looks good
          Hide
          Paul Yang added a comment -

          Well, we can't quite test using CHIF because users will need to have the patch for MAPREDUCE-1806 (when it comes out), or else we will get an error.

          Show
          Paul Yang added a comment - Well, we can't quite test using CHIF because users will need to have the patch for MAPREDUCE-1806 (when it comes out), or else we will get an error.
          Hide
          Namit Jain added a comment -

          Also, since the tests are anyway working only in hadoop 0.20, can we also test CombineHiveInputFormat for that scenario.

          Show
          Namit Jain added a comment - Also, since the tests are anyway working only in hadoop 0.20, can we also test CombineHiveInputFormat for that scenario.
          Hide
          Namit Jain added a comment -

          Some tests failed - send a mail offline to Paul.
          Also, it would be useful if you can change constants like METASTOREINTARCHIVED to METASTORE_INT_ARCHIVED etc.

          Show
          Namit Jain added a comment - Some tests failed - send a mail offline to Paul. Also, it would be useful if you can change constants like METASTOREINTARCHIVED to METASTORE_INT_ARCHIVED etc.
          Hide
          Paul Yang added a comment -

          Updated to current trunk.

          Show
          Paul Yang added a comment - Updated to current trunk.
          Hide
          Paul Yang added a comment -

          This doesn't incorporate the hadoop-version aware test framework, but here's another version that has some additional fixes/tests.

          • Handled table renaming
          • Added a check when creating partitions to catch reserved values
          • Additional tests for above
          Show
          Paul Yang added a comment - This doesn't incorporate the hadoop-version aware test framework, but here's another version that has some additional fixes/tests. Handled table renaming Added a check when creating partitions to catch reserved values Additional tests for above
          Hide
          Namit Jain added a comment -

          --nitpick

          wh.deleteDir(MetaStoreUtils.getOriginalLocation(part), true);

          line 1113: HiveMetastore.java

          archiveParentDir already contains the variable.

          wh.deleteDir(archiveParentDir, true);

          should be fine

          Show
          Namit Jain added a comment - --nitpick wh.deleteDir(MetaStoreUtils.getOriginalLocation(part), true); line 1113: HiveMetastore.java archiveParentDir already contains the variable. wh.deleteDir(archiveParentDir, true); should be fine
          Hide
          Paul Yang added a comment -

          Yes, that would be better. The check was from an earlier version of the patch.

          Show
          Paul Yang added a comment - Yes, that would be better. The check was from an earlier version of the patch.
          Hide
          Namit Jain added a comment -

          if ("har".equalsIgnoreCase(dest_path.toUri().getScheme()))

          { 3257 throw new SemanticException(ErrorMsg.OVERWRITE_ARCHIVED_PART 3258 .getMsg()); 3259 }

          Do you think it is a better idea to check if the partition is ARCHIVED instead ?

          Show
          Namit Jain added a comment - if ("har".equalsIgnoreCase(dest_path.toUri().getScheme())) { 3257 throw new SemanticException(ErrorMsg.OVERWRITE_ARCHIVED_PART 3258 .getMsg()); 3259 } Do you think it is a better idea to check if the partition is ARCHIVED instead ?
          Hide
          Paul Yang added a comment -
          • Moved some constants to thrift definition
          Show
          Paul Yang added a comment - Moved some constants to thrift definition
          Hide
          Paul Yang added a comment -

          Note that because the archiving process involves several filesystem operations, there can be errors if it's run concurrently on the same partition. Once HIVE-1293 is in, this feature will be updated to address this.

          Show
          Paul Yang added a comment - Note that because the archiving process involves several filesystem operations, there can be errors if it's run concurrently on the same partition. Once HIVE-1293 is in, this feature will be updated to address this.
          Hide
          Paul Yang added a comment -
          • Adds negative test, changes way tests are skipped if the hadoop version is not supported
          • Changes ordering of filesystem operations to prevent data loss / corruption, and allows recovery if interrupted by re-running command.
          Show
          Paul Yang added a comment - Adds negative test, changes way tests are skipped if the hadoop version is not supported Changes ordering of filesystem operations to prevent data loss / corruption, and allows recovery if interrupted by re-running command.
          Hide
          Paul Yang added a comment -

          Checks that require access to the metadata should be done in DDLTask, not DDLSemanticAnalyzer, no?

          Show
          Paul Yang added a comment - Checks that require access to the metadata should be done in DDLTask, not DDLSemanticAnalyzer, no?
          Hide
          Paul Yang added a comment -

          Yeah, the way the patch is now, concurrent operations were not supported as it was assumed these commands were going to be run via a single cron job. But that is probably not a good assumption to make. And the priority was to make the order of operations to prevent data loss in case of any failures. The reason why the (un)archive operation is tricky to do concurrently is because there are no ways to lock a partition/table (HIVE-1293) and there are no ways to atomically make a filesystem and metadata change. But there are ways of addressing these concurrency issues while preserving data during failure scenarios:

          Archiving a partition using a conservative approach would involve something like (as discussed with Namit):

          1. Create a copy of the partition, call it ds=1.copy
          2. Alter metadata's location to point to ds=1.copy
          – At this point failures are okay as the copy is not touched
          3. Make the archive of the partition directory in a tmp directory
          4. Remove the directory ds=1
          5. Move the tmp directory to ds=1
          6. Alter metadata's location to point to har:/...ds=1
          7. Delete ds=1.copy

          These set of steps would ensure that no matter when failure occurs, subsequent queries on the partition will continue to succeed. However, this approach incurs the overhead of having to make a copy of the partition, which can be significant. Another approach is to:

          1. Make the archive of the partition in a tmp directory
          2. Move the archive folder to ds=1.copy
          3. Move ds=1 to ds=1.old
          4. Move ds=1.copy to ds=1
          5. Alter the metada to change the location to har:/...ds=1

          The drawback to this approach is that if a failure occurs, subsequent queries will not be able to properly access the data. However, the archive command can be run again to recover from the situation.

          Also since the semantics for FileSystem.rename() do not throw an error if the destination directory already exists, there is a small window for data duplication. However, this issue is already present in INSERT OVERWRITE... These will be addressed with lock support.

          Show
          Paul Yang added a comment - Yeah, the way the patch is now, concurrent operations were not supported as it was assumed these commands were going to be run via a single cron job. But that is probably not a good assumption to make. And the priority was to make the order of operations to prevent data loss in case of any failures. The reason why the (un)archive operation is tricky to do concurrently is because there are no ways to lock a partition/table ( HIVE-1293 ) and there are no ways to atomically make a filesystem and metadata change. But there are ways of addressing these concurrency issues while preserving data during failure scenarios: Archiving a partition using a conservative approach would involve something like (as discussed with Namit): 1. Create a copy of the partition, call it ds=1.copy 2. Alter metadata's location to point to ds=1.copy – At this point failures are okay as the copy is not touched 3. Make the archive of the partition directory in a tmp directory 4. Remove the directory ds=1 5. Move the tmp directory to ds=1 6. Alter metadata's location to point to har:/...ds=1 7. Delete ds=1.copy These set of steps would ensure that no matter when failure occurs, subsequent queries on the partition will continue to succeed. However, this approach incurs the overhead of having to make a copy of the partition, which can be significant. Another approach is to: 1. Make the archive of the partition in a tmp directory 2. Move the archive folder to ds=1.copy 3. Move ds=1 to ds=1.old 4. Move ds=1.copy to ds=1 5. Alter the metada to change the location to har:/...ds=1 The drawback to this approach is that if a failure occurs, subsequent queries will not be able to properly access the data. However, the archive command can be run again to recover from the situation. Also since the semantics for FileSystem.rename() do not throw an error if the destination directory already exists, there is a small window for data duplication. However, this issue is already present in INSERT OVERWRITE... These will be addressed with lock support.
          Hide
          Namit Jain added a comment -

          The same problem is present during unarchive. Once the existence of a har file have been checked, 2 processes running concurrently
          can create files. So, we may duplicate the data

          Show
          Namit Jain added a comment - The same problem is present during unarchive. Once the existence of a har file have been checked, 2 processes running concurrently can create files. So, we may duplicate the data
          Hide
          Namit Jain added a comment -

          Isn't there a race condition in archive() in DDLTask - if 2 archives for the same partition are running concurrently,
          you may endup creating a sub-directory in har directory.

          For eg: you may endup with: /warehosue/T/P/Tbl.har/Tbl.har/....

          Show
          Namit Jain added a comment - Isn't there a race condition in archive() in DDLTask - if 2 archives for the same partition are running concurrently, you may endup creating a sub-directory in har directory. For eg: you may endup with: /warehosue/T/P/Tbl.har/Tbl.har/....
          Hide
          Paul Yang added a comment -

          Yes, that check occurs when we try to get the partition. If the user specifies a partial spec, then getPartition() will return null.

          Show
          Paul Yang added a comment - Yes, that check occurs when we try to get the partition. If the user specifies a partial spec, then getPartition() will return null.
          Hide
          Namit Jain added a comment -

          DDLSemanticAnalyzer.java

          622 private void analyzeAlterTableArchive(CommonTree ast, boolean isUnArchive)
          623 throws SemanticException {
          624
          625 if (!conf.getBoolVar(HiveConf.ConfVars.HIVEARCHIVEENABLED))

          { 626 throw new SemanticException("Archiving methods are currently disabled. " + 627 "Please see the Hive wiki for more information about enabling archiving."); 628 629 }

          630 String tblName = unescapeIdentifier(ast.getChild(0).getText());
          631 // partition name to value
          632 List<Map<String, String>> partSpecs = getPartitionSpecs(ast);
          633 if (partSpecs.size() > 1 )

          { 634 throw new SemanticException(isUnArchive ? "UNARCHIVE" : "ARCHIVE" + 635 " can only be run on a single partition"); 636 }

          637 if (partSpecs.size() == 0) {
          638 throw new SemanticException("ARCHIVE can only be run on partitions");

          Add the error messages in ErrorMsg.java, and add negative tests for all of them.

          DDLTask.java
          413 // Means user specified a table
          414 if (simpleDesc.getPartSpec() == null)

          { 415 throw new HiveException("ARCHIVE is for partitions only"); 416 }

          Shouldn't this be checked in DDLSemanticAnalyzer instead ?

          Same as above:

          421 if (tbl.getTableType() != TableType.MANAGED_TABLE)

          { 422 throw new HiveException("ARCHIVE can only be performed on managed tables"); 423 }

          and:

          429 if (isArchived(p))

          { 430 throw new HiveException("Specified partition is already archived"); 431 }

          One check that seems to be missing:

          if we have multilple partition columns, say ds and hr.

          and if the user tries to archive just by specifying ds, should that be allowed ?
          I dont think it will work - are you checking that ?

          Show
          Namit Jain added a comment - DDLSemanticAnalyzer.java 622 private void analyzeAlterTableArchive(CommonTree ast, boolean isUnArchive) 623 throws SemanticException { 624 625 if (!conf.getBoolVar(HiveConf.ConfVars.HIVEARCHIVEENABLED)) { 626 throw new SemanticException("Archiving methods are currently disabled. " + 627 "Please see the Hive wiki for more information about enabling archiving."); 628 629 } 630 String tblName = unescapeIdentifier(ast.getChild(0).getText()); 631 // partition name to value 632 List<Map<String, String>> partSpecs = getPartitionSpecs(ast); 633 if (partSpecs.size() > 1 ) { 634 throw new SemanticException(isUnArchive ? "UNARCHIVE" : "ARCHIVE" + 635 " can only be run on a single partition"); 636 } 637 if (partSpecs.size() == 0) { 638 throw new SemanticException("ARCHIVE can only be run on partitions"); Add the error messages in ErrorMsg.java, and add negative tests for all of them. DDLTask.java 413 // Means user specified a table 414 if (simpleDesc.getPartSpec() == null) { 415 throw new HiveException("ARCHIVE is for partitions only"); 416 } Shouldn't this be checked in DDLSemanticAnalyzer instead ? Same as above: 421 if (tbl.getTableType() != TableType.MANAGED_TABLE) { 422 throw new HiveException("ARCHIVE can only be performed on managed tables"); 423 } and: 429 if (isArchived(p)) { 430 throw new HiveException("Specified partition is already archived"); 431 } One check that seems to be missing: if we have multilple partition columns, say ds and hr. and if the user tries to archive just by specifying ds, should that be allowed ? I dont think it will work - are you checking that ?

            People

            • Assignee:
              Paul Yang
              Reporter:
              Paul Yang
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development