It would be fantastic if the following use case could be rolled up into a single operation.
Step 1: Client applies an Insert
Step 2: Client receives OperationResponse with RowError with Status ALREADY_PRESENT
Step 3: Client retrieves row for key columns specified in Step 1
Step 4: Client locally merges non-key column values from Step 1, with non-key column values retrieved in Step 3, with some merge operation (Details and examples below)
Step 5: Client applies an Update with merged values
Step 6: Client receives OperationResponse with no RowError
Merge operations details - I'm suggesting a few possible Merge operations, noting that all are associative (given the starting value already in the table is either present or null, and with the following operations in any order)
So assuming a key of some unique identifier, or product code:
SUM: Useful for counting the number of times this combination of key columns has been seen before
MAX: Useful for setting timestamp values (newest), or highest price/value for an item seen
MIN: Useful for setting timestamp values (oldest), or lowest price/value for an item seen
SUB(TRACT): I haven't actually got a super useful use case for having a subtracing counter, unless you're wanting some sort of countdown or thresholding of scores (something happens when you reach zero, or negative score)
Sample table, for example, might be one with four columns:
STRING KEY unique_identifier, INT times_seen, TIMESTAMP first_seen, TIMESTAMP last_seen
And streaming a set of unique_identifiers to be stored in a Kudu table as a lookup service, where the client could perform Operation along the lines of:
Merge merge = table.newMerge("times_seen:ADD", "first_seen:MIN", "last_seen:MAX")
and then setting the values in the PartialRow for this Operation with, for example:
"abc", 2, 1445495695517000, 1445495708867000
Which would result in one of two things -
if key "abc" is not present in the table, it would simply be a plain insert
If key "abc" is present in the table, 2 would be added to $times_seen_in_table column, first_seen column would be the result of min($first_seen_in_table, 1445495695517000) and last_seen would be the result of max($last_seen_in_table, 1445495708867000).
So Ideally, the flow would be:
Step 1: Client applies an Merge
Step 2: OperationResponse is returned the client with no RowError. Might be good to have the OperationResponse saying whether it was plain insert, or the result of a merge, but that's not super necessary.
This would save many, many failing inserts, gets, and updates back and forth between servers and clients on constantly updating datasets, really playing to Kudu's strength's even more.
For the merge operations, assuming that TServers are threadsafe for each key and apply these atomically, the operations must be associative; given a value in a table N, with two quick merges of values A and B:
E.g. SUM: (N + A) + B = (N + B) + A
or MIN/MAX: max(max(N, A), B) = max(max(N, B), A)
or SUB: (N - A) - B = (N - B) - A (noting that N is always first operand)
Another constraint would be that the Merge must contain values for all key columns, ensuring a single row is inserted/affected, although I suppose if an Insert was happening anyway, this would be true regardless.