When a table uses multi-tier partitioning scheme, with large number of partitions created, an AlterTable request that affects many partitions/tablets turns into a much larger UpdateConsensus RPC when leader master pushes the corresponding update on the system tablet to follower masters.
I did some testing for this use case. With AlterTable RPC adding new range partitions, I observed the following:
- With range x 2 hash partitions, with the incoming AlterTable RPC request size is 37070 bytes, the size for the corresponding UpdateConsensus is 274278 bytes (~ 7x multiplication factor).
- With range x 10 hash partitions, with the incoming AlterTable RPC request size is 37070 bytes, the size for the corresponding UpdateConsensus when leader master pushes the updates on the system tablet to followers is 1365438 bytes (~ 36x multiplication factor).
With that, it's easy to hit the limit on the maximum PRC size (controlled via the --rpc_max_message_size flag) in case of larger Kudu clusters. If that happens, Kudu masters start continuous leader re-election cycle since follower masters don't receive any Raft heartbeats from their leader: the heartbeats are rejected at the lower RPC layer due to the maximum RPC size limit.