Hive
  1. Hive
  2. HIVE-6309

Hive incorrectly removes TaskAttempt output files if MRAppMaster fails once

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.13.0
    • Component/s: None
    • Labels:
      None
    • Environment:

      hadoop 2.2

      Description

      We recently upgrade to hadoop2.2 and sometimes find some tables lost several data files after a mid night ETL process. We find that these MapReduce jobs which generate the partial tables have something in common that the MRAppMaster of which all had failed once and the tables all left only a single data file 000000_1000.

      The following log in hive.log give us some clues of what's going on with the incorrectly deleted data files.

      $ grep 'hive_2014-01-24_12-33-18_507_6790415670781610350' hive.log
      2014-01-24 12:52:43,140 WARN  exec.Utilities (Utilities.java:removeTempOrDuplicateFiles(1535)) - Duplicate taskid file removed: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000001_1000 with length 824627293. Existing file: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000000_1000 with length 824860643
      2014-01-24 12:52:43,142 WARN  exec.Utilities (Utilities.java:removeTempOrDuplicateFiles(1535)) - Duplicate taskid file removed: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000002_1000 with length 824681826. Existing file: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000000_1000 with length 824860643
      2014-01-24 12:52:43,149 WARN  exec.Utilities (Utilities.java:removeTempOrDuplicateFiles(1535)) - Duplicate taskid file removed: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000003_1000 with length 824830450. Existing file: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000000_1000 with length 824860643
      2014-01-24 12:52:43,151 WARN  exec.Utilities (Utilities.java:removeTempOrDuplicateFiles(1535)) - Duplicate taskid file removed: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000004_1000 with length 824753882. Existing file: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000000_1000 with length 824860643
      

      We find that it's because nextAttemptNumber in hadoop2.2 is bigger than 1000 and hive doesn't correctly extract task id from filename. See the following code in org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl.java
      and ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java

      // org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl.java
          // All the new TaskAttemptIDs are generated based on MR
          // ApplicationAttemptID so that attempts from previous lives don't
          // over-step the current one. This assumes that a task won't have more
          // than 1000 attempts in its single generation, which is very reasonable.
          nextAttemptNumber = (appAttemptId - 1) * 1000;
      
      // ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java
         /**
          * The first group will contain the task id. The second group is the optional extension. The file
          * name looks like: "0_0" or "0_0.gz". There may be a leading prefix (tmp_). Since getTaskId() can
          * return an integer only - this should match a pure integer as well. {1,3} is used to limit
          * matching for attempts #'s 0-999.
          */
         private static final Pattern FILE_NAME_TO_TASK_ID_REGEX =
             Pattern.compile("^.*?([0-9]+)(_[0-9]{1,3})?(\\..*)?$");
      

      And with the bellow reasons, extract this value for attempt numbers >= 1000 :

      >>> re.match("^.*?([0-9]+)(_[0​-9])?(\\..*)?$", 'part-r-000000_2').group(1)
      '000000'
      >>> re.match("^.*?([0-9]+)(_[0​-9])?(\\..*)?$", 'part-r-000000_1001').group(1)
      '1001'
      

        Issue Links

          Activity

          Ashutosh Chauhan made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Hide
          Ashutosh Chauhan added a comment -

          Committed to trunk. Thanks, Chun!

          Show
          Ashutosh Chauhan added a comment - Committed to trunk. Thanks, Chun!
          Hide
          Ashutosh Chauhan added a comment -

          +1

          Show
          Ashutosh Chauhan added a comment - +1
          Hide
          Chun Chen added a comment -

          I don't think the failed tests are related. Review https://reviews.apache.org/r/17377/

          Show
          Chun Chen added a comment - I don't think the failed tests are related. Review https://reviews.apache.org/r/17377/
          Chun Chen made changes -
          Description We recently upgrade to hadoop2.2 and sometimes find some tables lost several data files than yesterday after a mid night ETL process. We find that these MapReduce jobs which generate the partial tables have something in common that the MRAppMaster of which all had failed once and the tables all left only a single data file 000000_1000.

          The following log in hive.log give us some clues of what's going on with the incorrectly deleted data files.
          {code}
          $ grep 'hive_2014-01-24_12-33-18_507_6790415670781610350' hive.log
          2014-01-24 12:52:43,140 WARN exec.Utilities (Utilities.java:removeTempOrDuplicateFiles(1535)) - Duplicate taskid file removed: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000001_1000 with length 824627293. Existing file: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000000_1000 with length 824860643
          2014-01-24 12:52:43,142 WARN exec.Utilities (Utilities.java:removeTempOrDuplicateFiles(1535)) - Duplicate taskid file removed: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000002_1000 with length 824681826. Existing file: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000000_1000 with length 824860643
          2014-01-24 12:52:43,149 WARN exec.Utilities (Utilities.java:removeTempOrDuplicateFiles(1535)) - Duplicate taskid file removed: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000003_1000 with length 824830450. Existing file: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000000_1000 with length 824860643
          2014-01-24 12:52:43,151 WARN exec.Utilities (Utilities.java:removeTempOrDuplicateFiles(1535)) - Duplicate taskid file removed: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000004_1000 with length 824753882. Existing file: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000000_1000 with length 824860643
          {code}

          We find that it's because nextAttemptNumber in hadoop2.2 is bigger than 1000 and hive doesn't correctly extract task id from filename. See the following code in org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl.java
          and ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java
          {code}
          // org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl.java
              // All the new TaskAttemptIDs are generated based on MR
              // ApplicationAttemptID so that attempts from previous lives don't
              // over-step the current one. This assumes that a task won't have more
              // than 1000 attempts in its single generation, which is very reasonable.
              nextAttemptNumber = (appAttemptId - 1) * 1000;

          // ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java
             /**
              * The first group will contain the task id. The second group is the optional extension. The file
              * name looks like: "0_0" or "0_0.gz". There may be a leading prefix (tmp_). Since getTaskId() can
              * return an integer only - this should match a pure integer as well. {1,3} is used to limit
              * matching for attempts #'s 0-999.
              */
             private static final Pattern FILE_NAME_TO_TASK_ID_REGEX =
                 Pattern.compile("^.*?([0-9]+)(_[0-9]{1,3})?(\\..*)?$");
          {code}

          And with the bellow reasons, extract this value for attempt numbers >= 1000 :
          {code}
          >>> re.match("^.*?([0-9]+)(_[0​-9])?(\\..*)?$", 'part-r-000000_2').group(1)
          '000000'
          >>> re.match("^.*?([0-9]+)(_[0​-9])?(\\..*)?$", 'part-r-000000_1001').group(1)
          '1001'
          {code}
          We recently upgrade to hadoop2.2 and sometimes find some tables lost several data files after a mid night ETL process. We find that these MapReduce jobs which generate the partial tables have something in common that the MRAppMaster of which all had failed once and the tables all left only a single data file 000000_1000.

          The following log in hive.log give us some clues of what's going on with the incorrectly deleted data files.
          {code}
          $ grep 'hive_2014-01-24_12-33-18_507_6790415670781610350' hive.log
          2014-01-24 12:52:43,140 WARN exec.Utilities (Utilities.java:removeTempOrDuplicateFiles(1535)) - Duplicate taskid file removed: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000001_1000 with length 824627293. Existing file: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000000_1000 with length 824860643
          2014-01-24 12:52:43,142 WARN exec.Utilities (Utilities.java:removeTempOrDuplicateFiles(1535)) - Duplicate taskid file removed: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000002_1000 with length 824681826. Existing file: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000000_1000 with length 824860643
          2014-01-24 12:52:43,149 WARN exec.Utilities (Utilities.java:removeTempOrDuplicateFiles(1535)) - Duplicate taskid file removed: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000003_1000 with length 824830450. Existing file: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000000_1000 with length 824860643
          2014-01-24 12:52:43,151 WARN exec.Utilities (Utilities.java:removeTempOrDuplicateFiles(1535)) - Duplicate taskid file removed: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000004_1000 with length 824753882. Existing file: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000000_1000 with length 824860643
          {code}

          We find that it's because nextAttemptNumber in hadoop2.2 is bigger than 1000 and hive doesn't correctly extract task id from filename. See the following code in org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl.java
          and ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java
          {code}
          // org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl.java
              // All the new TaskAttemptIDs are generated based on MR
              // ApplicationAttemptID so that attempts from previous lives don't
              // over-step the current one. This assumes that a task won't have more
              // than 1000 attempts in its single generation, which is very reasonable.
              nextAttemptNumber = (appAttemptId - 1) * 1000;

          // ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java
             /**
              * The first group will contain the task id. The second group is the optional extension. The file
              * name looks like: "0_0" or "0_0.gz". There may be a leading prefix (tmp_). Since getTaskId() can
              * return an integer only - this should match a pure integer as well. {1,3} is used to limit
              * matching for attempts #'s 0-999.
              */
             private static final Pattern FILE_NAME_TO_TASK_ID_REGEX =
                 Pattern.compile("^.*?([0-9]+)(_[0-9]{1,3})?(\\..*)?$");
          {code}

          And with the bellow reasons, extract this value for attempt numbers >= 1000 :
          {code}
          >>> re.match("^.*?([0-9]+)(_[0​-9])?(\\..*)?$", 'part-r-000000_2').group(1)
          '000000'
          >>> re.match("^.*?([0-9]+)(_[0​-9])?(\\..*)?$", 'part-r-000000_1001').group(1)
          '1001'
          {code}
          Chun Chen made changes -
          Description We recently upgrade to hadoop2.2 and sometimes find some tables lost several data files than yesterday after a mid night ETL process. We find that these MapReduce jobs which generate the partial tables have something in common that the MRAppMaster of which all had failed once and the tables all left only a single data file 000000_1000.

          The following log in hive.log give us some clues of what's going on with the incorrectly deleted data files.
          {code}
          $ grep 'hive_2014-01-24_12-33-18_507_6790415670781610350' hive.log
          2014-01-24 12:52:43,140 WARN exec.Utilities (Utilities.java:removeTempOrDuplicateFiles(1535)) - Duplicate taskid file removed: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000001_1000 with length 824627293. Existing file: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000000_1000 with length 824860643
          2014-01-24 12:52:43,142 WARN exec.Utilities (Utilities.java:removeTempOrDuplicateFiles(1535)) - Duplicate taskid file removed: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000002_1000 with length 824681826. Existing file: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000000_1000 with length 824860643
          2014-01-24 12:52:43,149 WARN exec.Utilities (Utilities.java:removeTempOrDuplicateFiles(1535)) - Duplicate taskid file removed: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000003_1000 with length 824830450. Existing file: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000000_1000 with length 824860643
          2014-01-24 12:52:43,151 WARN exec.Utilities (Utilities.java:removeTempOrDuplicateFiles(1535)) - Duplicate taskid file removed: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000004_1000 with length 824753882. Existing file: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000000_1000 with length 824860643
          {code}

          We find that it's because nextAttemptNumber in hadoop2.2 is bigger than 1000 and hive doesn't not correctly extract task id from filename. See the following code in org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl.java
          and ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java
          {code}
          // org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl.java
              // All the new TaskAttemptIDs are generated based on MR
              // ApplicationAttemptID so that attempts from previous lives don't
              // over-step the current one. This assumes that a task won't have more
              // than 1000 attempts in its single generation, which is very reasonable.
              nextAttemptNumber = (appAttemptId - 1) * 1000;

          // ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java
             /**
              * The first group will contain the task id. The second group is the optional extension. The file
              * name looks like: "0_0" or "0_0.gz". There may be a leading prefix (tmp_). Since getTaskId() can
              * return an integer only - this should match a pure integer as well. {1,3} is used to limit
              * matching for attempts #'s 0-999.
              */
             private static final Pattern FILE_NAME_TO_TASK_ID_REGEX =
                 Pattern.compile("^.*?([0-9]+)(_[0-9]{1,3})?(\\..*)?$");
          {code}

          And with the bellow reasons, extract this value for attempt numbers >= 1000 :
          {code}
          >>> re.match("^.*?([0-9]+)(_[0​-9])?(\\..*)?$", 'part-r-000000_2').group(1)
          '000000'
          >>> re.match("^.*?([0-9]+)(_[0​-9])?(\\..*)?$", 'part-r-000000_1001').group(1)
          '1001'
          {code}
          We recently upgrade to hadoop2.2 and sometimes find some tables lost several data files than yesterday after a mid night ETL process. We find that these MapReduce jobs which generate the partial tables have something in common that the MRAppMaster of which all had failed once and the tables all left only a single data file 000000_1000.

          The following log in hive.log give us some clues of what's going on with the incorrectly deleted data files.
          {code}
          $ grep 'hive_2014-01-24_12-33-18_507_6790415670781610350' hive.log
          2014-01-24 12:52:43,140 WARN exec.Utilities (Utilities.java:removeTempOrDuplicateFiles(1535)) - Duplicate taskid file removed: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000001_1000 with length 824627293. Existing file: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000000_1000 with length 824860643
          2014-01-24 12:52:43,142 WARN exec.Utilities (Utilities.java:removeTempOrDuplicateFiles(1535)) - Duplicate taskid file removed: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000002_1000 with length 824681826. Existing file: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000000_1000 with length 824860643
          2014-01-24 12:52:43,149 WARN exec.Utilities (Utilities.java:removeTempOrDuplicateFiles(1535)) - Duplicate taskid file removed: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000003_1000 with length 824830450. Existing file: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000000_1000 with length 824860643
          2014-01-24 12:52:43,151 WARN exec.Utilities (Utilities.java:removeTempOrDuplicateFiles(1535)) - Duplicate taskid file removed: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000004_1000 with length 824753882. Existing file: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000000_1000 with length 824860643
          {code}

          We find that it's because nextAttemptNumber in hadoop2.2 is bigger than 1000 and hive doesn't correctly extract task id from filename. See the following code in org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl.java
          and ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java
          {code}
          // org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl.java
              // All the new TaskAttemptIDs are generated based on MR
              // ApplicationAttemptID so that attempts from previous lives don't
              // over-step the current one. This assumes that a task won't have more
              // than 1000 attempts in its single generation, which is very reasonable.
              nextAttemptNumber = (appAttemptId - 1) * 1000;

          // ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java
             /**
              * The first group will contain the task id. The second group is the optional extension. The file
              * name looks like: "0_0" or "0_0.gz". There may be a leading prefix (tmp_). Since getTaskId() can
              * return an integer only - this should match a pure integer as well. {1,3} is used to limit
              * matching for attempts #'s 0-999.
              */
             private static final Pattern FILE_NAME_TO_TASK_ID_REGEX =
                 Pattern.compile("^.*?([0-9]+)(_[0-9]{1,3})?(\\..*)?$");
          {code}

          And with the bellow reasons, extract this value for attempt numbers >= 1000 :
          {code}
          >>> re.match("^.*?([0-9]+)(_[0​-9])?(\\..*)?$", 'part-r-000000_2').group(1)
          '000000'
          >>> re.match("^.*?([0-9]+)(_[0​-9])?(\\..*)?$", 'part-r-000000_1001').group(1)
          '1001'
          {code}
          Hide
          Hive QA added a comment -

          Overall: -1 at least one tests failed

          Here are the results of testing the latest attachment:
          https://issues.apache.org/jira/secure/attachment/12625277/HIVE-6309.patch

          ERROR: -1 due to 3 failed/errored test(s), 4958 tests executed
          Failed tests:

          org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_import_exported_table
          org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_load_hdfs_file_with_space_in_the_name
          org.apache.hadoop.hive.cli.TestNegativeMinimrCliDriver.testNegativeCliDriver_file_with_header_footer_negative
          

          Test results: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1034/testReport
          Console output: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1034/console

          Messages:

          Executing org.apache.hive.ptest.execution.PrepPhase
          Executing org.apache.hive.ptest.execution.ExecutionPhase
          Executing org.apache.hive.ptest.execution.ReportingPhase
          Tests exited with: TestsFailedException: 3 tests failed
          

          This message is automatically generated.

          ATTACHMENT ID: 12625277

          Show
          Hive QA added a comment - Overall : -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12625277/HIVE-6309.patch ERROR: -1 due to 3 failed/errored test(s), 4958 tests executed Failed tests: org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_import_exported_table org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_load_hdfs_file_with_space_in_the_name org.apache.hadoop.hive.cli.TestNegativeMinimrCliDriver.testNegativeCliDriver_file_with_header_footer_negative Test results: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1034/testReport Console output: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1034/console Messages: Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 3 tests failed This message is automatically generated. ATTACHMENT ID: 12625277
          Chun Chen made changes -
          Link This issue is related to HIVE-2309 [ HIVE-2309 ]
          Chun Chen made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Chun Chen made changes -
          Description We recently upgrade to hadoop2.2 and sometimes find some tables lost several data files than yesterday after a mid night ETL process. We find that these MapReduce jobs which generate the partial tables have something in common that the MRAppMaster of which had failed once and the tables all left only a single data file 000000_1000.

          The following log in hive.log give us some clues of what's going on with the incorrectly deleted data files.
          {code}
          $ grep 'hive_2014-01-24_12-33-18_507_6790415670781610350' hive.log
          2014-01-24 12:52:43,140 WARN exec.Utilities (Utilities.java:removeTempOrDuplicateFiles(1535)) - Duplicate taskid file removed: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000001_1000 with length 824627293. Existing file: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000000_1000 with length 824860643
          2014-01-24 12:52:43,142 WARN exec.Utilities (Utilities.java:removeTempOrDuplicateFiles(1535)) - Duplicate taskid file removed: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000002_1000 with length 824681826. Existing file: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000000_1000 with length 824860643
          2014-01-24 12:52:43,149 WARN exec.Utilities (Utilities.java:removeTempOrDuplicateFiles(1535)) - Duplicate taskid file removed: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000003_1000 with length 824830450. Existing file: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000000_1000 with length 824860643
          2014-01-24 12:52:43,151 WARN exec.Utilities (Utilities.java:removeTempOrDuplicateFiles(1535)) - Duplicate taskid file removed: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000004_1000 with length 824753882. Existing file: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000000_1000 with length 824860643
          {code}

          We find that it's because nextAttemptNumber in hadoop2.2 is bigger than 1000 and hive doesn't not correctly extract task id from filename. See the following code in org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl.java
          and ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java
          {code}
          // org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl.java
              // All the new TaskAttemptIDs are generated based on MR
              // ApplicationAttemptID so that attempts from previous lives don't
              // over-step the current one. This assumes that a task won't have more
              // than 1000 attempts in its single generation, which is very reasonable.
              nextAttemptNumber = (appAttemptId - 1) * 1000;

          // ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java
             /**
              * The first group will contain the task id. The second group is the optional extension. The file
              * name looks like: "0_0" or "0_0.gz". There may be a leading prefix (tmp_). Since getTaskId() can
              * return an integer only - this should match a pure integer as well. {1,3} is used to limit
              * matching for attempts #'s 0-999.
              */
             private static final Pattern FILE_NAME_TO_TASK_ID_REGEX =
                 Pattern.compile("^.*?([0-9]+)(_[0-9]{1,3})?(\\..*)?$");
          {code}

          And with the bellow reasons, extract this value for attempt numbers >= 1000 :
          {code}
          >>> re.match("^.*?([0-9]+)(_[0​-9])?(\\..*)?$", 'part-r-000000_2').group(1)
          '000000'
          >>> re.match("^.*?([0-9]+)(_[0​-9])?(\\..*)?$", 'part-r-000000_1001').group(1)
          '1001'
          {code}
          We recently upgrade to hadoop2.2 and sometimes find some tables lost several data files than yesterday after a mid night ETL process. We find that these MapReduce jobs which generate the partial tables have something in common that the MRAppMaster of which all had failed once and the tables all left only a single data file 000000_1000.

          The following log in hive.log give us some clues of what's going on with the incorrectly deleted data files.
          {code}
          $ grep 'hive_2014-01-24_12-33-18_507_6790415670781610350' hive.log
          2014-01-24 12:52:43,140 WARN exec.Utilities (Utilities.java:removeTempOrDuplicateFiles(1535)) - Duplicate taskid file removed: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000001_1000 with length 824627293. Existing file: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000000_1000 with length 824860643
          2014-01-24 12:52:43,142 WARN exec.Utilities (Utilities.java:removeTempOrDuplicateFiles(1535)) - Duplicate taskid file removed: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000002_1000 with length 824681826. Existing file: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000000_1000 with length 824860643
          2014-01-24 12:52:43,149 WARN exec.Utilities (Utilities.java:removeTempOrDuplicateFiles(1535)) - Duplicate taskid file removed: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000003_1000 with length 824830450. Existing file: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000000_1000 with length 824860643
          2014-01-24 12:52:43,151 WARN exec.Utilities (Utilities.java:removeTempOrDuplicateFiles(1535)) - Duplicate taskid file removed: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000004_1000 with length 824753882. Existing file: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000000_1000 with length 824860643
          {code}

          We find that it's because nextAttemptNumber in hadoop2.2 is bigger than 1000 and hive doesn't not correctly extract task id from filename. See the following code in org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl.java
          and ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java
          {code}
          // org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl.java
              // All the new TaskAttemptIDs are generated based on MR
              // ApplicationAttemptID so that attempts from previous lives don't
              // over-step the current one. This assumes that a task won't have more
              // than 1000 attempts in its single generation, which is very reasonable.
              nextAttemptNumber = (appAttemptId - 1) * 1000;

          // ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java
             /**
              * The first group will contain the task id. The second group is the optional extension. The file
              * name looks like: "0_0" or "0_0.gz". There may be a leading prefix (tmp_). Since getTaskId() can
              * return an integer only - this should match a pure integer as well. {1,3} is used to limit
              * matching for attempts #'s 0-999.
              */
             private static final Pattern FILE_NAME_TO_TASK_ID_REGEX =
                 Pattern.compile("^.*?([0-9]+)(_[0-9]{1,3})?(\\..*)?$");
          {code}

          And with the bellow reasons, extract this value for attempt numbers >= 1000 :
          {code}
          >>> re.match("^.*?([0-9]+)(_[0​-9])?(\\..*)?$", 'part-r-000000_2').group(1)
          '000000'
          >>> re.match("^.*?([0-9]+)(_[0​-9])?(\\..*)?$", 'part-r-000000_1001').group(1)
          '1001'
          {code}
          Chun Chen made changes -
          Attachment HIVE-6309.patch [ 12625277 ]
          Chun Chen made changes -
          Field Original Value New Value
          Description We recently upgrade to hadoop2.2 and sometimes find some tables lost several data files than yesterday after a mid night ETL process. We find that these MapReduce jobs which generate the partial tables have something in common that the MRAppMaster of which had failed once and the tables all left only a single data file 000000_1000.

          The following log in hive.log give us some clues of what's going on with the incorrectly deleted data files.
          {code}
          $ grep 'hive_2014-01-24_12-33-18_507_6790415670781610350' hive.log
          2014-01-24 12:52:43,140 WARN exec.Utilities (Utilities.java:removeTempOrDuplicateFiles(1535)) - Duplicate taskid file removed: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000001_1000 with length 824627293. Existing file: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000000_1000 with length 824860643
          2014-01-24 12:52:43,142 WARN exec.Utilities (Utilities.java:removeTempOrDuplicateFiles(1535)) - Duplicate taskid file removed: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000002_1000 with length 824681826. Existing file: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000000_1000 with length 824860643
          2014-01-24 12:52:43,149 WARN exec.Utilities (Utilities.java:removeTempOrDuplicateFiles(1535)) - Duplicate taskid file removed: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000003_1000 with length 824830450. Existing file: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000000_1000 with length 824860643
          2014-01-24 12:52:43,151 WARN exec.Utilities (Utilities.java:removeTempOrDuplicateFiles(1535)) - Duplicate taskid file removed: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000004_1000 with length 824753882. Existing file: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000000_1000 with length 824860643
          {code}

          We find that it's because nextAttemptNumber in hadoop2.2 is bigger than 1000 and hive doesn't not correctly extract task id from filename. See the following code in org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl.java
          and ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java
          {code}
              // All the new TaskAttemptIDs are generated based on MR
              // ApplicationAttemptID so that attempts from previous lives don't
              // over-step the current one. This assumes that a task won't have more
              // than 1000 attempts in its single generation, which is very reasonable.
              nextAttemptNumber = (appAttemptId - 1) * 1000;

             /**
              * The first group will contain the task id. The second group is the optional extension. The file
              * name looks like: "0_0" or "0_0.gz". There may be a leading prefix (tmp_). Since getTaskId() can
              * return an integer only - this should match a pure integer as well. {1,3} is used to limit
              * matching for attempts #'s 0-999.
              */
             private static final Pattern FILE_NAME_TO_TASK_ID_REGEX =
                 Pattern.compile("^.*?([0-9]+)(_[0-9]{1,3})?(\\..*)?$");
          {code}

          And with the bellow reasons, extract this value for attempt numbers >= 1000 :
          {code}
          >>> re.match("^.*?([0-9]+)(_[0​-9])?(\\..*)?$", 'part-r-000000_2').group(1)
          '000000'
          >>> re.match("^.*?([0-9]+)(_[0​-9])?(\\..*)?$", 'part-r-000000_1001').group(1)
          '1001'
          {code}
          We recently upgrade to hadoop2.2 and sometimes find some tables lost several data files than yesterday after a mid night ETL process. We find that these MapReduce jobs which generate the partial tables have something in common that the MRAppMaster of which had failed once and the tables all left only a single data file 000000_1000.

          The following log in hive.log give us some clues of what's going on with the incorrectly deleted data files.
          {code}
          $ grep 'hive_2014-01-24_12-33-18_507_6790415670781610350' hive.log
          2014-01-24 12:52:43,140 WARN exec.Utilities (Utilities.java:removeTempOrDuplicateFiles(1535)) - Duplicate taskid file removed: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000001_1000 with length 824627293. Existing file: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000000_1000 with length 824860643
          2014-01-24 12:52:43,142 WARN exec.Utilities (Utilities.java:removeTempOrDuplicateFiles(1535)) - Duplicate taskid file removed: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000002_1000 with length 824681826. Existing file: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000000_1000 with length 824860643
          2014-01-24 12:52:43,149 WARN exec.Utilities (Utilities.java:removeTempOrDuplicateFiles(1535)) - Duplicate taskid file removed: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000003_1000 with length 824830450. Existing file: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000000_1000 with length 824860643
          2014-01-24 12:52:43,151 WARN exec.Utilities (Utilities.java:removeTempOrDuplicateFiles(1535)) - Duplicate taskid file removed: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000004_1000 with length 824753882. Existing file: hdfs://hadoop00.lf.sankuai.com:9000/tmp/hive-scratch/hive-sankuai/hive_2014-01-24_12-33-18_507_6790415670781610350/_tmp.-ext-10000.intermediate/000000_1000 with length 824860643
          {code}

          We find that it's because nextAttemptNumber in hadoop2.2 is bigger than 1000 and hive doesn't not correctly extract task id from filename. See the following code in org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl.java
          and ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java
          {code}
          // org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl.java
              // All the new TaskAttemptIDs are generated based on MR
              // ApplicationAttemptID so that attempts from previous lives don't
              // over-step the current one. This assumes that a task won't have more
              // than 1000 attempts in its single generation, which is very reasonable.
              nextAttemptNumber = (appAttemptId - 1) * 1000;

          // ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java
             /**
              * The first group will contain the task id. The second group is the optional extension. The file
              * name looks like: "0_0" or "0_0.gz". There may be a leading prefix (tmp_). Since getTaskId() can
              * return an integer only - this should match a pure integer as well. {1,3} is used to limit
              * matching for attempts #'s 0-999.
              */
             private static final Pattern FILE_NAME_TO_TASK_ID_REGEX =
                 Pattern.compile("^.*?([0-9]+)(_[0-9]{1,3})?(\\..*)?$");
          {code}

          And with the bellow reasons, extract this value for attempt numbers >= 1000 :
          {code}
          >>> re.match("^.*?([0-9]+)(_[0​-9])?(\\..*)?$", 'part-r-000000_2').group(1)
          '000000'
          >>> re.match("^.*?([0-9]+)(_[0​-9])?(\\..*)?$", 'part-r-000000_1001').group(1)
          '1001'
          {code}
          Chun Chen created issue -

            People

            • Assignee:
              Chun Chen
              Reporter:
              Chun Chen
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development