Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-15798

EC: Reconstruct task failed, and It would be XmitsInProgress of DN has negative number

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.3.1, 3.4.0, 3.2.3
    • Component/s: None
    • Labels:
      None
    • Target Version/s:

      Description

      The EC reconstruct task failed, and the decrementXmitsInProgress of processErasureCodingTasks operation abnormal value ;
      It would be XmitsInProgress of DN has negative number, it affects NN chooses pending tasks based on the ratio between the lengths of replication and erasure-coded block queues.

      // 1.ErasureCodingWorker.java
      
      public void processErasureCodingTasks(
          Collection<BlockECReconstructionInfo> ecTasks) {
        for (BlockECReconstructionInfo reconInfo : ecTasks) {
          int xmitsSubmitted = 0;
          try {
            ...
            // It may throw IllegalArgumentException from task#stripedReader
            // constructor.
            final StripedBlockReconstructor task =
                new StripedBlockReconstructor(this, stripedReconInfo);
            if (task.hasValidTargets()) {
              // See HDFS-12044. We increase xmitsInProgress even the task is only
              // enqueued, so that
              //   1) NN will not send more tasks than what DN can execute and
              //   2) DN will not throw away reconstruction tasks, and instead keeps
              //      an unbounded number of tasks in the executor's task queue.
              xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1);
              getDatanode().incrementXmitsInProcess(xmitsSubmitted); //  task start increment
              stripedReconstructionPool.submit(task);
            } else {
              LOG.warn("No missing internal block. Skip reconstruction for task:{}",
                  reconInfo);
            }
          } catch (Throwable e) {
            getDatanode().decrementXmitsInProgress(xmitsSubmitted); //  task failed decrement,  XmitsInProgress is decremented by the previous value
            LOG.warn("Failed to reconstruct striped block {}",
                reconInfo.getExtendedBlock().getLocalBlock(), e);
          }
        }
      }
      
      
      // 2.StripedBlockReconstructor.java
      public void run() {
        try {
          initDecoderIfNecessary();
         ...
        } catch (Throwable e) {
          LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e);
          getDatanode().getMetrics().incrECFailedReconstructionTasks();
        } finally {
          float xmitWeight = getErasureCodingWorker().getXmitWeight();
          // if the xmits is smaller than 1, the xmitsSubmitted should be set to 1
          // because if it set to zero, we cannot to measure the xmits submitted
          int xmitsSubmitted = Math.max((int) (getXmits() * xmitWeight), 1);
          getDatanode().decrementXmitsInProgress(xmitsSubmitted); // task complete decrement
          ...
        }
      }

        Attachments

        1. HDFS-15798.003.patch
          2 kB
          Haiyang Hu
        2. HDFS-15798.002.patch
          3 kB
          Haiyang Hu
        3. HDFS-15798.001.patch
          2 kB
          Haiyang Hu

          Activity

            People

            • Assignee:
              haiyang Hu Haiyang Hu
              Reporter:
              haiyang Hu Haiyang Hu
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: