[HDFS-127] DFSClient block read failures cause open DFSInputStream to become unusable - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.21.0
Component/s: hdfs-client
Labels:
None

Hadoop Flags:

Reviewed

Description

We are using some Lucene indexes directly from HDFS and for quite long time we were using Hadoop version 0.15.3.

When tried to upgrade to Hadoop 0.19 - index searches started to fail with exceptions like:
2008-11-13 16:50:20,314 WARN [Listener-4] [] DFSClient : DFS Read: java.io.IOException: Could not obtain block: blk_5604690829708125511_15489 file=/usr/collarity/data/urls-new/part-00000/20081110-163426/_0.tis
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1708)
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1536)
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1663)
at java.io.DataInputStream.read(DataInputStream.java:132)
at org.apache.nutch.indexer.FsDirectory$DfsIndexInput.readInternal(FsDirectory.java:174)
at org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:152)
at org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:38)
at org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:76)
at org.apache.lucene.index.TermBuffer.read(TermBuffer.java:63)
at org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:131)
at org.apache.lucene.index.SegmentTermEnum.scanTo(SegmentTermEnum.java:162)
at org.apache.lucene.index.TermInfosReader.scanEnum(TermInfosReader.java:223)
at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:217)
at org.apache.lucene.index.SegmentTermDocs.seek(SegmentTermDocs.java:54)
...

The investigation showed that the root of this issue is that we exceeded # of xcievers in the data nodes and that was fixed by changing configuration settings to 2k.
However - one thing that bothered me was that even after datanodes recovered from overload and most of client servers had been shut down - we still observed errors in the logs of running servers.
Further investigation showed that fix for ~~HADOOP-1911~~ introduced another problem - the DFSInputStream instance might become unusable once number of failures over lifetime of this instance exceeds configured threshold.

The fix for this specific issue seems to be trivial - just reset failure counter before reading next block (patch will be attached shortly).

This seems to be also related to HADOOP-3185, but I'm not sure I really understand necessity of keeping track of failed block accesses in the DFS client.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

4681.patch
18/Nov/08 21:48
0.8 kB
Igor Bolotin
h127_20091016.patch
16/Oct/09 23:38
4 kB
Tsz-wo Sze
h127_20091019.patch
19/Oct/09 21:56
1 kB
Tsz-wo Sze
h127_20091019b.patch
20/Oct/09 01:23
0.8 kB
Tsz-wo Sze
hdfs-127-branch20-redone.txt
21/Jan/10 00:17
13 kB
Todd Lipcon
hdfs-127-branch20-redone-v2.txt
21/Jan/10 08:32
13 kB
Todd Lipcon
hdfs-127-regression-test.txt
20/Jan/10 01:57
3 kB
Todd Lipcon

Issue Links

is related to

HDFS-927 DFSInputStream retries too many times for new block locations

Closed

relates to

HADOOP-1911 infinite loop in dfs -cat command.

Closed

HDFS-378 DFSClient should track failures by datanode across all streams

Open

HDFS-784 TestFsck times out on branch 0.20.1

Resolved

HDFS-656 Clarify error handling and retry semantics for DFS read path

Open

Activity

People

Assignee:: Igor Bolotin

Reporter:: Igor Bolotin

Votes:: 3 Vote for this issue

Watchers:: 17 Start watching this issue

Dates

Created:: 18/Nov/08 21:42

Updated:: 24/Aug/10 20:47

Resolved:: 29/Jan/10 22:49