Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-15588

Arbitrarily low values for `dfs.block.access.token.lifetime` aren't safe and can cause a healthy datanode to be excluded

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Patch Available
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: hdfs, hdfs-client, security
    • Labels:
      None

      Description

      Problem:
      Setting `dfs.block.access.token.lifetime` to arbitrarily low values (like 1) means the lifetime of a block token is very short, as a result some healthy datanodes could be wrongly excluded by the client due to the `InvalidBlockTokenException`.

      More specifically, in `nextBlockOutputStream`, the client tries to get the `accessToken` from the namenode and use it to talk to datanode. And the lifetime of `accessToken` could set to very small (like 1 min) by setting `dfs.block.access.token.lifetime`. In some extreme conditions (like a VM migration, temporary network issue, or a stop-the-world GC), the `accessToken` could become expired when the client tries to use it to talk to the datanode. If expired, `createBlockOutputStream` will return false (and mask the `InvalidBlockTokenException`), so the client will think the datanode is unhealthy, mark the it as "excluded" and will never read/write on it.

      Related code in `nextBlockOutputStream`:

      // Connect to first DataNode in the list.
      success = createBlockOutputStream(nodes, nextStorageTypes, nextStorageIDs,
          0L, false);
      
      if (!success) {
        LOG.warn("Abandoning " + block);
        dfsClient.namenode.abandonBlock(block.getCurrentBlock(),
            stat.getFileId(), src, dfsClient.clientName);
        block.setCurrentBlock(null);
        final DatanodeInfo badNode = nodes[errorState.getBadNodeIndex()];
        LOG.warn("Excluding datanode " + badNode);
        excludedNodes.put(badNode, badNode);
      }
      

       

      Proposed solution:
      A simple retry on the same datanode after catching `InvalidBlockTokenException` can solve this problem (assuming the extreme conditions won't happen often). Since currently the `dfs.block.access.token.lifetime` can even accept values like 0, we can also choose to prevent the users from setting `dfs.block.access.token.lifetime` to a small value (e.g., we can enforce a minimum value of 5mins for this parameter).

      We submit a patch for retrying after catching `InvalidBlockTokenException` in `nextBlockOutputStream`. We can also provide a patch for enforcing a larger minimum value for `dfs.block.access.token.lifetime` if it is a better way to handle this.

        Attachments

        1. HDFS-15588-002.patch
          3 kB
          sr2020
        2. HDFS-15588-001.patch
          3 kB
          sr2020

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              sr2020 sr2020
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated: