0. This patch does not apply since it has
So all changes intended for the client are considered to be for NNBench.
Has it been manually compiled or something?
1. I like that data-nodes confirm written blocks rather than the client.
I am not sure we are fixing the problem completely here.
If blockReport() happens before blockReceived() the received block will be removed, wont it?
2. I think we should retain verification of the minimal block replication on the name-node as it was before.
Suppose we are writing to a file for a long time and in the end get a message the first block was not written properly.
I think the client should rather fail on allocating the second block in the case and retry.
In order to accelerate data-node reporting of received blocks we should move it before the blockReport().
3. Exponential backoff seems a little aggressive. You start with 400 msec sleep and on the last (out of 5)
retry of allocating the next block the client will sleep for 32 seconds.
If the name-node is not busy then this will substantially slow down the process, if the name-node is busy
then the timeouts should take care of the overwhelming. I think we should have more experimental data on
this issue before we apply that approach. This seems like a change of the general strategy which we should
consider for all communications rather than just for one case, and it should belong to a separate issue.
In this particular case the slowdown is not justified since when the data-node returns to the client everything
is successfully replicated, written and confirmed.
4. Local disk is faster than network, so if the disk is full or RO there is no reason to send data over the wire,
since it will be redistributed again anyway. This again looks like an attempt to optimize, but has little to do
with solving the problem.
5. The default should be optimal for the most common usage scenario, and should be tested well.
10 handlers handled the traffic well so far. Why changing the default?
I don't see enough motivation for changes 2 through 5 yet. They should be discussed in separate issues.
-1 on including 2-5.
I ran 1, it works on my small cluster. But it needs
a) to move blockReceived() before blockReport(), may be even before sendHeartbeat()
b) to be verified and confirmed with a successful NNBench run.