Engines like Impala/Spark use many independent client instances, so we should provide a way to have read-your-writes across many independent client instances, which translates to providing a way to get linearizable behavior.
At first this can be done using the APIs that are already available. For instance if the objective is to be sure to have the results of a write in a a following scan, the following steps can be taken:
- After a write the engine should collect the last observed timestamps from kudu clients
- The engine's coordinator then takes the max of those timestamps, adds 1 and uses that as a snapshot scan timestamp.
One important pre-requisite of the behavior above is that scans be done in READ_AT_SNAPSHOT mode. Also the steps above currently don't actually guarantee the expected behavior, but should as the currently anomalies are taken care of (as part of KUDU-430).
In the immediate future we'll add APIs to the Kudu client so as to make the inner workings of getting this behavior oblivious to the engine. The steps will still be the same, i.e. timestamps or timestamp tokens will still be passed around, but the kudu client will encapsulate the choice of the timestamp for the scan.
Later we will add a way to obtain this behavior without timestamp propagation, either by doing a write-side commit-wait, where clients wait out the clock error after/during the last write thus making sure any future operation will have a higher timestamp; or by making read-side commit wait, where we provide an api on the kudu client for the engine to perform a similar call before the scan call to obtain a scan timestamp.