Details
Description
While running a test we found that WebHdfsFileSystem can create several thousand connections when doing a position read of a 200MB file. For each connection the client will connect to the DataNode again and the DataNode will create a new DFSClient instance to handle the read request. This also leads to several thousand getBlockLocations call to the NameNode.
The cause of the issue is that in FSInputStream#read(long, byte[], int, int), each time the inputstream reads some time, it seeks back to the old position and resets its state to SEEK. Thus the next read will regenerate the connection.
public int read(long position, byte[] buffer, int offset, int length) throws IOException { synchronized (this) { long oldPos = getPos(); int nread = -1; try { seek(position); nread = read(buffer, offset, length); } finally { seek(oldPos); } return nread; } }