One of the clusters timed out taking a snapshot for a disabled table. The table is big enough, and the master operation takes more than 1 min to complete. However while trying to increase the timeout, we noticed that there are two parameters with very similar names configuring different things:
hbase.snapshot.master.timeout.millis is defined in SnapshotDescriptionUtils and is send to client side and used in disabled table snapshot.
hbase.snapshot.master.timeoutMillis is defined in SnapshotManager and used as the timeout for the procedure execution.
So, there are a couple of improvements that we can do:
- 1 min is too low for big tables. We need to set this to 5 min or 10 min by default. Even a 6T table which is medium sized fails.
- Unify the two timeouts into one. Decide on either of them, and deprecate the other. Use the biggest one for BC.
- Add the timeout to hbase-default.xml.
- Why do we even have a timeout for disabled table snapshots? The master is doing the work so we should not timeout in any case.