Instead of marking an index as permanently disabled in the partial index rebuilder when a failure occurs, we should let it try again up to a configurable amount of time. The reason is that the fail-fast approach with the lower RPC timeout will continue to cause a failure until the index region can be written to. This will allow us to ride out region moves without a long RPC time out and thus without holding handler threads for long periods of time. We can base the failure on the INDEX_DISABLE_TIMESTAMP value of an index as we walk through the scan results here in MetaDataRegionObserver. :
I'd propose we allow 30 minutes to get an index back online.