Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-22373

File Merge tasks fail when containers are reused

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 3.1.2
    • Fix Version/s: 4.0.0
    • Component/s: None
    • Labels:
      None

      Description

      Problems

      Setting tez.am.container.reuse.enabled=true allows for containers to be reused across multiple tasks.
      When two File Merge tasks run on the same container, the last task fails in renaming the output path.

      Below is an error log of the task 000001_0 on the container container_e87_1570604853053_11564_01_000003, where the task 000004_0 ran before the task 000001_0.
      It shows that the task 000001_0's output file name is taken from the previous task id 000004_0 mistakenly.

      2019-10-15 13:00:31,438 [ERROR] [TezChild] |tez.TezProcessor|: java.lang.RuntimeException: Hive Runtime Error while closing operators
      	at org.apache.hadoop.hive.ql.exec.tez.MergeFileRecordProcessor.close(MergeFileRecordProcessor.java:188)
      	at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:284)
      	at org.apache.hadoop.hive.ql.exec.tez.MergeFileTezProcessor.run(MergeFileTezProcessor.java:42)
      	at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)
      	at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
      	at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61)
      	at java.security.AccessController.doPrivileged(Native Method)
      	at javax.security.auth.Subject.doAs(Subject.java:422)
      	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
      	at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61)
      	at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37)
      	at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
      	at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:108)
      	at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:41)
      	at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:77)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      	at java.lang.Thread.run(Thread.java:748)
      Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Failed to close AbstractFileMergeOperator
      	at org.apache.hadoop.hive.ql.exec.AbstractFileMergeOperator.closeOp(AbstractFileMergeOperator.java:315)
      	at org.apache.hadoop.hive.ql.exec.OrcFileMergeOperator.closeOp(OrcFileMergeOperator.java:265)
      	at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:733)
      	at org.apache.hadoop.hive.ql.exec.tez.MergeFileRecordProcessor.close(MergeFileRecordProcessor.java:180)
      	... 17 more
      Caused by: java.io.IOException: Unable to rename viewfs://<cluster_name>/user/<user_name>/.hive-staging_hive_2019-10-15_12-59-32_916_2461818728035733124-15476/_task_tmp.-ext-10000/_tmp.000004_0 to viewfs://<cluster_name>/user/<user_name>/.hive-staging_hive_2019-10-15_12-59-32_916_2461818728035733124-15476/_tmp.-ext-10000/000004_0
      	at org.apache.hadoop.hive.ql.exec.AbstractFileMergeOperator.closeOp(AbstractFileMergeOperator.java:254)
      	... 20 more
      

      Causes

      When AbstractFileMergeOperator is initialized, taskId is updated only for the first time.

      • AbstractFileMergeOperator.java
        private void updatePaths(Path tp, Path ttp) {
          if (taskId == null) {
            taskId = Utilities.getTaskId(jc);
          }
        

      It leads to the above conflict of the output file names.

      Solutions

      Remove the null-checking conditional, which was introduced in HIVE-14640, and update taskId from JobConf whenever the operator is initialized.

        Attachments

        1. HIVE-22373.patch
          0.7 kB
          Toshihiko Uchida

          Activity

            People

            • Assignee:
              touchida Toshihiko Uchida
              Reporter:
              touchida Toshihiko Uchida
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: