Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-17553

DFSOutputStream.java#closeImpl should have configurable retries upon flushInternal failures

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.3.1, 3.4.0
    • None
    • dfsclient
    • None

    Description

      HDFS-15865 introduced an interrupt in DFSStreamer class to interrupt the 
      waitForAckedSeqno call when timeout has exceeded, which throws an InterrupttedIOExceptions. This method is being used in DFSOutputStream.java#flushInternal , one of whose use case is DFSOutputStream.java#closeImpl to close a file.

      What we saw was that we were getting InterrupttedIOExceptions during the flushInternal call when we were closing out a file, which was unhandled by DFSClient and got thrown to caller. There's a known issue HDFS-4504 that when a file failed to close on HDFS side, block recovery was not called and the lease got leaked until the DFSClient gets recycled. In our HBase setups, DFSClients remain long-lived in regionservers, which means these files remain undead until the corresponding regionservers get restarted.

      This issue was observed during datanode decomission because it was stuck on open files caused by above leakage. As it's good to close a HDFS file as smooth as possible, retries of flushInternal during closeImpl operations would be beneficial to reduce such leakages. The number of retries can be based on dfsclient config. For example
       

      Attachments

        Activity

          People

            Unassigned Unassigned
            zinanzhuang Zinan Zhuang
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: