Github user clockfly commented on the pull request:
@miguno, there are several more observations I have
1. Network still not efficient enough
We can see from the test report, after this fix, the throughput is still bottlenecked by network(CPU: 72%, network: 45%), because there are still margins in the CPU(28%). That’s weird because only 45% of theory network bandwidth is used.
2. Uneven machine message receive latency
In the experiment, I noticed that there are always some machine whose message receive latency is much higher than the others. For example, tuples generated from machine A, are sent to tasks on machine B, C, D, in one run, tasks on B take more time to receive messages, in another run, D may be the slowest.
My guess is that some machine has a longer netty receiver queue than the other machines, and the queue length on all machines becomes stable but not equal after some time(new input = new output) . The latency is different because queue length is different. Changing max.spout.pending won’t improve this, because it only control overall message sent from A, it doesn’t treat B, C, D differently.
3. better max.spout.pending?
I observed, after we tune max.spout.pending to a big enough value, increasing max.spout.pending will only add to latency but not throughput. When spout.pending doubles, the latency doubles.
Can we do flow control adaptively so that we stops when there is no further benefit to continue increasing max.spout.pending?
4. Potential deadlock when all intermediate buffer is full
Consider two worker, task1(workerA) deliver message to task3(workerB), task3 deliver to task2(workerA). There is a loop! It is possible that all worker sender/receiver buffer will be full and block.
The current work-around in storm is tricky, it use a unbounded receiver buffer(LinkedBlockingQueue) for each worker to break the loop. But this is not good, because the receiver buffer can potentially be very long, and latency be very high.
5. Is it necessary for each task to have a dedicated send queue thread?
Currently, each task has a dedicated send queue thread to push data to worker transfer queue. During the profiling, the task send queue thread is usually at wait state. Maybe it is a good idea to use a shared thread pool replace dedicated thread?
6. Acker workload very high.
In the test, I spotted that the acker task is very busy. As each message size is small(100 byte), there are hugh amout of tuples need to be acked.
Can this acker cost be reduced?
For example, we can group the tuple at spout to time slice, and each time slice will share a same root tuple Id. For example, the time slice can be 100ms, and there are 10, 000 message in this slice, all share same root id, before sending to acker task, we can first XOR all acker message of same root Id locally on each worker. In that case, we may can reduce the acking network and task cost. The drawback is that when a message is lost, we need to reply all message in this slice.
7. Worker receive thread blocked by task receiver queue
In worker receiver thread, it will try to publish the messages to the receive queue of each task sequentially in a blocking way. If one task receiver queue is full, the thread will block and wait.