While inserting a large data set into Kudu from Impala, JD and me observed the following issue: It appears as if the writes become throttled at some point in time, timeout or manual reject. Now, the C++ client will retry the operation. However, at this point the previous write will have succeeded and the write operation will fail with "Row already exists in MemRowSet".
This behavior is very unfortunate, since Impala will believe that the data is corrupt even though the actual error is deeper in the communication between the client and Kudu.
I think, we will need some additional information to track if a timed-out or rejected write op will be processed in Kudu even though the client is forced to retry.
This is critical because a insert will look as it inserts the same row twice and abort, even though the row was already inserted. Leaving the system in an inconsistent state from the Impala perspective.