Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.8.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      In our env, there are a lot of small files inside one partition/table. In order to reduce the namenode load, we have one dedicated housekeeping job running to merge these file. Right now the merge is an 'insert overwrite' in hive, and requires decompress the data and compress it. This jira is to add a command in Hive to do the merge without decompress and recompress the data.

      Something like "alter table tbl_name [partition ()] merge files". In this jira the new command will only support RCFile, since there need some new APIs to the fileformat.

      1. HIVE-1950.6.patch
        165 kB
        He Yongqiang
      2. HIVE-1950.5.patch
        165 kB
        He Yongqiang
      3. HIVE-1950.4.patch
        162 kB
        He Yongqiang
      4. HIVE-1950.3.patch
        159 kB
        He Yongqiang
      5. HIVE-1950.2.patch
        126 kB
        He Yongqiang
      6. HIVE-1950.1.patch
        113 kB
        He Yongqiang

        Issue Links

          Activity

          Hide
          He Yongqiang added a comment -

          A patch for review.

          The code now is kind of very clean. Comments about how to make it clean are welcome!

          Show
          He Yongqiang added a comment - A patch for review. The code now is kind of very clean. Comments about how to make it clean are welcome!
          Hide
          He Yongqiang added a comment -
          Show
          He Yongqiang added a comment - review board: https://reviews.apache.org/r/388/
          Hide
          Namit Jain added a comment -

          I will take a look - 1 minor comment.
          Can you add some negative tests:

          1. merge_files should fail if there is a index on the table/partition.
          2. merge_files should fail if the table is partitioned, but the user did not specify the partition.

          Going forward, we should support merge even if the partition is not fully specified.

          alter table srcpart partition (ds='1') merge_files;

          should merge ds=1/hr=1 and ds=1/hr=2 as a follow-up.
          But for now, they should throw an error

          Show
          Namit Jain added a comment - I will take a look - 1 minor comment. Can you add some negative tests: 1. merge_files should fail if there is a index on the table/partition. 2. merge_files should fail if the table is partitioned, but the user did not specify the partition. Going forward, we should support merge even if the partition is not fully specified. alter table srcpart partition (ds='1') merge_files; should merge ds=1/hr=1 and ds=1/hr=2 as a follow-up. But for now, they should throw an error
          Hide
          He Yongqiang added a comment -

          review comments from internal review:
          1) if the stats present, try to correct it
          2) jobClose of RCFileMergeMapper should share the code in FileSinkOperator
          3) move the original data to a dump loc first
          4) remove getRecordWriter() and RCFileBlockMergeOutputFormat
          5) ioCxt for input file changed
          6) disable merge for archived table/partition and bucketized table/partition
          7) comments
          8) negative tests for hiveinputformat

          Show
          He Yongqiang added a comment - review comments from internal review: 1) if the stats present, try to correct it 2) jobClose of RCFileMergeMapper should share the code in FileSinkOperator 3) move the original data to a dump loc first 4) remove getRecordWriter() and RCFileBlockMergeOutputFormat 5) ioCxt for input file changed 6) disable merge for archived table/partition and bucketized table/partition 7) comments 8) negative tests for hiveinputformat
          Hide
          He Yongqiang added a comment -

          A new patch addressed the review comments.

          Will put a few into followup including the stat update.

          Show
          He Yongqiang added a comment - A new patch addressed the review comments. Will put a few into followup including the stat update.
          Hide
          Ning Zhang added a comment -

          As discussed offline, this patch should be able to handle stats update (creating a StatsTask as child).

          Also please keep in mind that the design and implementation of the new MergeTask should be easy to be used in the merge process in INSERT OVERWRITE.

          Show
          Ning Zhang added a comment - As discussed offline, this patch should be able to handle stats update (creating a StatsTask as child). Also please keep in mind that the design and implementation of the new MergeTask should be easy to be used in the merge process in INSERT OVERWRITE.
          Hide
          Namit Jain added a comment -

          1. Can you change merge_files to concatenate ?
          alter table <T> concatenate;

          2. Move RCFile check to SemanticAnalyzer from runtime.

          3. More comments: DDLTask.java/mergeFiles
          RCFile: all the new functions etc.

          Show
          Namit Jain added a comment - 1. Can you change merge_files to concatenate ? alter table <T> concatenate; 2. Move RCFile check to SemanticAnalyzer from runtime. 3. More comments: DDLTask.java/mergeFiles RCFile: all the new functions etc.
          Hide
          He Yongqiang added a comment -

          >>2. Move RCFile check to SemanticAnalyzer from runtime.
          SemanticAnalyzer only throws SemanticException. we may should keep this semantic. Moving the check to SemanticAnalyzer will need it to handle a lot of HiveExceptions (thrown by getTable etc).

          Show
          He Yongqiang added a comment - >>2. Move RCFile check to SemanticAnalyzer from runtime. SemanticAnalyzer only throws SemanticException. we may should keep this semantic. Moving the check to SemanticAnalyzer will need it to handle a lot of HiveExceptions (thrown by getTable etc).
          Hide
          Ning Zhang added a comment -

          Yongqiang, does the review board have the latest patch?

          Show
          Ning Zhang added a comment - Yongqiang, does the review board have the latest patch?
          Hide
          Ning Zhang added a comment -

          Yongqiang, the patch doesn't compile. Below are some initial reviews from me:

          QTestUtil.java:
          334: you may want to add those index tables that you want to keep in srcTables. Otherewise indexes that are created inside a test will not be cleaned – side-effect.

          StatsTask:
          a StatsTask is added in DDLSemanticAnalyzer for the mege task but why set it to do nothing?

          ExecDriver:
          jobExecHelper is constructed in both the constructors and initialize(). Is there a reason?

          checkFatalError: why removed some code?

          Why remove METASTOREPWD?

          DDLTask:
          move semantics checking (index & archive checking etc.) to DDLSemanticAnalyzer. Execution time should only raise exception if there are runtime exceptions. In another word, explain plan of the query shoull throw an exception if there are indexes or table is archived.

          Show
          Ning Zhang added a comment - Yongqiang, the patch doesn't compile. Below are some initial reviews from me: QTestUtil.java: 334: you may want to add those index tables that you want to keep in srcTables. Otherewise indexes that are created inside a test will not be cleaned – side-effect. StatsTask: a StatsTask is added in DDLSemanticAnalyzer for the mege task but why set it to do nothing? ExecDriver: jobExecHelper is constructed in both the constructors and initialize(). Is there a reason? checkFatalError: why removed some code? Why remove METASTOREPWD? DDLTask: move semantics checking (index & archive checking etc.) to DDLSemanticAnalyzer. Execution time should only raise exception if there are runtime exceptions. In another word, explain plan of the query shoull throw an exception if there are indexes or table is archived.
          Hide
          He Yongqiang added a comment -

          a new patch for Ning's comments. Thanks.

          Show
          He Yongqiang added a comment - a new patch for Ning's comments. Thanks.
          Hide
          Ning Zhang added a comment -

          Yongqiang, I'm still reviewing the new patch (.4) but found some of my comments are not address (e.g., QTestUtil). Can you elaborate which comments have been addressed and which are not (and the reasons)?

          Show
          Ning Zhang added a comment - Yongqiang, I'm still reviewing the new patch (.4) but found some of my comments are not address (e.g., QTestUtil). Can you elaborate which comments have been addressed and which are not (and the reasons)?
          Hide
          He Yongqiang added a comment -

          QTestUtil.java is not related to this jira. should open a new one for it.

          >>jobExecHelper is constructed in both the constructors and initialize(). Is there a reason?
          This is because the existing code may use ExecDriver and may not call initialize() (like ExecDriver's main()).

          >>checkFatalError: why removed some code?
          No code is removed, just some code is moved to jobExecHelper.

          Show
          He Yongqiang added a comment - QTestUtil.java is not related to this jira. should open a new one for it. >>jobExecHelper is constructed in both the constructors and initialize(). Is there a reason? This is because the existing code may use ExecDriver and may not call initialize() (like ExecDriver's main()). >>checkFatalError: why removed some code? No code is removed, just some code is moved to jobExecHelper.
          Hide
          Ning Zhang added a comment -

          2nd Round:
          =========
          QTestUtil:
          same as my previous comment: revert the change if it belongs to another JIRA

          StatsTask:
          line 274, 310: you may not need the updateOnly variable in StatsWork. Instead you can just check HiveConf.ConfVars.HIVE_STATS_ATOMIC.

          CombineHiveKey.java:
          missing Apache liense header

          RCFileMergeMapper.java:
          the jobClose() functions should handle the exceptional case when abort is true (similar to what FileSinkOperator does) or an exception was thrown from the hadoopo layber but it failed to call close(abort=true).
          also in jobClose(), the partition's old directory is first moved to backup directory and then the intermediate directory is moved to the partition's destination directory. All this is done when the partition is online (other queries can read the partition's directory). You may want to create a follow-up JIRA to make this partition as offline during the move.

          Show
          Ning Zhang added a comment - 2nd Round: ========= QTestUtil: same as my previous comment: revert the change if it belongs to another JIRA StatsTask: line 274, 310: you may not need the updateOnly variable in StatsWork. Instead you can just check HiveConf.ConfVars.HIVE_STATS_ATOMIC. CombineHiveKey.java: missing Apache liense header RCFileMergeMapper.java: the jobClose() functions should handle the exceptional case when abort is true (similar to what FileSinkOperator does) or an exception was thrown from the hadoopo layber but it failed to call close(abort=true). also in jobClose(), the partition's old directory is first moved to backup directory and then the intermediate directory is moved to the partition's destination directory. All this is done when the partition is online (other queries can read the partition's directory). You may want to create a follow-up JIRA to make this partition as offline during the move.
          Hide
          He Yongqiang added a comment -

          will update a new patch after 1517.

          Show
          He Yongqiang added a comment - will update a new patch after 1517.
          Hide
          He Yongqiang added a comment -

          A new patch based on trunk and integrated Ning's last comments.

          Show
          He Yongqiang added a comment - A new patch based on trunk and integrated Ning's last comments.
          Hide
          He Yongqiang added a comment -

          fixed a typo, and added some comments to the changes of QTestUtils

          Show
          He Yongqiang added a comment - fixed a typo, and added some comments to the changes of QTestUtils
          Hide
          Ted Yu added a comment -

          This may confuse somebody:

          +      boolean automatic = HiveConf.getBoolVar(conf, HiveConf.ConfVars.HIVE_STATS_ATOMIC);
          
          Show
          Ted Yu added a comment - This may confuse somebody: + boolean automatic = HiveConf.getBoolVar(conf, HiveConf.ConfVars.HIVE_STATS_ATOMIC);
          Hide
          Ning Zhang added a comment -

          @Ted, thanks. Yongqiang has addressed the typo in his latest patch.

          +1 Will commit if tests pass.

          Show
          Ning Zhang added a comment - @Ted, thanks. Yongqiang has addressed the typo in his latest patch. +1 Will commit if tests pass.
          Hide
          He Yongqiang added a comment -

          it's a typo, and i fixed in the new patch.

          HIVE_STATS_ATOMIC is an existing conf for stats.

          Show
          He Yongqiang added a comment - it's a typo, and i fixed in the new patch. HIVE_STATS_ATOMIC is an existing conf for stats.
          Hide
          Ning Zhang added a comment -

          Committed. Thanks Yongqiang!

          Show
          Ning Zhang added a comment - Committed. Thanks Yongqiang!

            People

            • Assignee:
              He Yongqiang
              Reporter:
              He Yongqiang
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development