Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-10688

BPServiceActor may run into a tight loop for sending block report when hitting IOException

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 2.8.0, 3.0.0-alpha1
    • datanode
    • None
    • Reviewed

    Description

      Currently in BPServiceActor#offerService, when datanode runs into a local IOException, the DataNode only logs the exception and runs into the while loop again:

            } catch(RemoteException re) {
              .......
              LOG.warn("RemoteException in offerService", re);
              try {
                long sleepTime = Math.min(1000, dnConf.heartBeatInterval);
                Thread.sleep(sleepTime);
              } catch (InterruptedException ie) {
                Thread.currentThread().interrupt();
              }
            } catch (IOException e) {
              LOG.warn("IOException in offerService", e);
            }
      

      This tight loop may cause some issue. For example, in a production cluster, we saw a DataNode hit exception when doing kerberos realm lookup. This tight loop finally caused the DataNode to send hundreds of DNS lookup queries per second.

      Attachments

        1. HDFS-10688.002.patch
          1 kB
          Chen Liang
        2. HDFS-10688.001.patch
          1 kB
          Chen Liang

        Activity

          People

            vagarychen Chen Liang
            jingzhao Jing Zhao
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: