Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-127

DFSClient block read failures cause open DFSInputStream to become unusable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.21.0
    • Component/s: hdfs-client
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      We are using some Lucene indexes directly from HDFS and for quite long time we were using Hadoop version 0.15.3.

      When tried to upgrade to Hadoop 0.19 - index searches started to fail with exceptions like:
      2008-11-13 16:50:20,314 WARN [Listener-4] [] DFSClient : DFS Read: java.io.IOException: Could not obtain block: blk_5604690829708125511_15489 file=/usr/collarity/data/urls-new/part-00000/20081110-163426/_0.tis
      at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1708)
      at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1536)
      at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1663)
      at java.io.DataInputStream.read(DataInputStream.java:132)
      at org.apache.nutch.indexer.FsDirectory$DfsIndexInput.readInternal(FsDirectory.java:174)
      at org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:152)
      at org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:38)
      at org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:76)
      at org.apache.lucene.index.TermBuffer.read(TermBuffer.java:63)
      at org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:131)
      at org.apache.lucene.index.SegmentTermEnum.scanTo(SegmentTermEnum.java:162)
      at org.apache.lucene.index.TermInfosReader.scanEnum(TermInfosReader.java:223)
      at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:217)
      at org.apache.lucene.index.SegmentTermDocs.seek(SegmentTermDocs.java:54)
      ...

      The investigation showed that the root of this issue is that we exceeded # of xcievers in the data nodes and that was fixed by changing configuration settings to 2k.
      However - one thing that bothered me was that even after datanodes recovered from overload and most of client servers had been shut down - we still observed errors in the logs of running servers.
      Further investigation showed that fix for HADOOP-1911 introduced another problem - the DFSInputStream instance might become unusable once number of failures over lifetime of this instance exceeds configured threshold.

      The fix for this specific issue seems to be trivial - just reset failure counter before reading next block (patch will be attached shortly).

      This seems to be also related to HADOOP-3185, but I'm not sure I really understand necessity of keeping track of failed block accesses in the DFS client.

        Attachments

        1. hdfs-127-regression-test.txt
          3 kB
          Todd Lipcon
        2. hdfs-127-branch20-redone-v2.txt
          13 kB
          Todd Lipcon
        3. hdfs-127-branch20-redone.txt
          13 kB
          Todd Lipcon
        4. h127_20091019b.patch
          0.8 kB
          Tsz Wo Nicholas Sze
        5. h127_20091019.patch
          1 kB
          Tsz Wo Nicholas Sze
        6. h127_20091016.patch
          4 kB
          Tsz Wo Nicholas Sze
        7. 4681.patch
          0.8 kB
          Igor Bolotin

          Issue Links

            Activity

              People

              • Assignee:
                ibolotin Igor Bolotin
                Reporter:
                ibolotin Igor Bolotin
              • Votes:
                3 Vote for this issue
                Watchers:
                17 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: