A fault tolerant (FT) scan needs the entirety of the primary key in its projection in order to work properly. Prior to 1.10.0, that was because:
- FT scans sorted their results in primary key order (note: within a tablet only; this sort is not global). These scans used the MergeIterator to achieve this sorting by comparing rows via their primary keys.
- Every FT scan RPC response included a "last primary key" which, in the event of failure, allowed the scan to be resumed from a particular key on another tserver.
Two important caveats:
- The primary key columns did not need to be part of the response sent to the client. They only needed to be part of the projection server-side in order to satisfy the above two requirements, then stripped out of the results before serialization. There was code in the tserver new scan path to add missing key columns to the projection of an FT scan so that clients needn't concern themselves with this.
- The order of the primary key columns in the projection didn't matter. Although non-obvious, this was because the same order was used in all MergeIterator comparisons and in all "last primary key" fields. Clients that relied on the "partial sort" behavior of an FT scan would no doubt have been surprised with the results, but the fault tolerant aspect of the scan wasn't affected.
1.10.0 implicitly removed that last caveat by requiring the primary key columns of an FT scan to be in table schema order. That's because of the MergeIterator changes made in
KUDU-2466: now the MergeIterator also compares rowset bounds to primary keys, and rowset bounds are always stored in table schema order. This means that since 1.10.0, any FT scan whose server-side projection had mis-ordered primary key columns would fail. If you were lucky, the error would surface at scan start time and included either the text "key too short" or "Missing separator after composite key string component".
What kind of FT scan could cause this?
- A scan whose projection included at least two primary key columns in a different order than how they were ordered in the table's schema.
- A scan whose projection didn't include all primary key columns, but whose predicates included one or more of the primary key columns missing from the projection. Predicates are accumulated in a hash map (keyed by column name) before being serialized to the wire, so when the tserver adds missing key columns from predicates into the scan projection, they're effectively in random order.
Diff scans, by virtue of also being FT scans, are also affected. However, the BDR Spark application is unaffected because it always projects the entire table schema verbatim.