Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
Description
Instead of marking an index as permanently disabled in the partial index rebuilder when a failure occurs, we should let it try again up to a configurable amount of time. The reason is that the fail-fast approach with the lower RPC timeout will continue to cause a failure until the index region can be written to. This will allow us to ride out region moves without a long RPC time out and thus without holding handler threads for long periods of time. We can base the failure on the INDEX_DISABLE_TIMESTAMP value of an index as we walk through the scan results here in MetaDataRegionObserver. :
do { results.clear(); hasMore = scanner.next(results); if (results.isEmpty()) break; Result r = Result.create(results); byte[] disabledTimeStamp = r.getValue(PhoenixDatabaseMetaData.TABLE_FAMILY_BYTES, PhoenixDatabaseMetaData.INDEX_DISABLE_TIMESTAMP_BYTES); byte[] indexState = r.getValue(PhoenixDatabaseMetaData.TABLE_FAMILY_BYTES, PhoenixDatabaseMetaData.INDEX_STATE_BYTES); if (disabledTimeStamp == null || disabledTimeStamp.length == 0) { continue; } // TODO: if disabledTimeStamp - System.currentTimeMillis() > configurableAmount // then disable the index.
I'd propose we allow 30 minutes to get an index back online.