Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
3.0.0-alpha-1
-
None
-
Reviewed
Description
With the introduction of the new procedure store to save procs on a table, instead of proc WALs, we added some migration logic to enable a smooth upgrade of clusters under previous versions that used to have proc wal files as the underlying procedure store.
While running a fresh install with latest master branch version, I had noticed that once the old WALProcedureStore directory is detected, it creates and start a WPS instance here, passing a built-in ProcedureLoader here to perform the conversion from proc wal files to table based.
At the end of this conversion, the builtin loader deletes the proc wal dir here. This may cause a race condition where a running WALProcedureStore internal thread may be trying to delete the first proc wal file that may had already rolled, then failing with FNFE.
Even worse is that WALProcedureStore.syncLoop keeps trying indefinitely, flooding Master logs with messages such as below:
2020-01-14 06:13:23,331 INFO [master/hbase01:16000:becomeActiveMaster] region.RegionProcedureStore: The old WALProcedureStore wal directory file:/root/hbase-3.0.0-SNAPSHOT/hbase-data/MasterProcWALs exists, migrating... 2020-01-14 06:13:23,361 DEBUG [master/hbase01:16000:becomeActiveMaster] wal.WALProcedureStore: Starting WAL Procedure Store lease recovery 2020-01-14 06:13:23,453 INFO [master/hbase01:16000:becomeActiveMaster] wal.WALProcedureStore: Rolled new Procedure Store WAL, id=1 2020-01-14 06:13:23,454 DEBUG [master/hbase01:16000:becomeActiveMaster] wal.WALProcedureStore: Lease acquired for flushLogId=1 2020-01-14 06:13:23,454 DEBUG [master/hbase01:16000:becomeActiveMaster] wal.WALProcedureStore: No state logs to replay. 2020-01-14 06:13:23,454 INFO [master/hbase01:16000:becomeActiveMaster] region.RegionProcedureStore: The WALProcedureStore max pid is 0, and the max pid of all loaded procedures is -1 2020-01-14 06:13:23,455 INFO [master/hbase01:16000:becomeActiveMaster] region.RegionProcedureStore: Migration of WALProcedureStore finished 2020-01-14 06:13:23,455 INFO [master/hbase01:16000:becomeActiveMaster] procedure2.ProcedureExecutor: Recovered RegionProcedureStore lease in 784 msec 2020-01-14 06:13:23,471 TRACE [WALProcedureStoreSyncThread] wal.WALProcedureStore: no active procedures 2020-01-14 06:13:23,516 INFO [master/hbase01:16000:becomeActiveMaster] procedure2.ProcedureExecutor: Loaded RegionProcedureStore in 61 msec 2020-01-14 06:13:23,516 INFO [master/hbase01:16000:becomeActiveMaster] procedure2.RemoteProcedureDispatcher: Instantiated, coreThreads=128 (allowCoreThreadTimeOut=true), queueMaxSize=32, operationDelay=150 2020-01-14 06:13:23,628 INFO [WALProcedureStoreSyncThread] wal.WALProcedureStore: Rolled new Procedure Store WAL, id=2 2020-01-14 06:13:23,634 INFO [WALProcedureStoreSyncThread] wal.WALProcedureStore: Remove all state logs with ID less than 1, since no active procedures 2020-01-14 06:13:23,634 TRACE [WALProcedureStoreSyncThread] wal.WALProcedureStore: Removing log=file:/root/hbase-3.0.0-SNAPSHOT/hbase-data/MasterProcWALs/pv2-00000000000000000001.log 2020-01-14 06:13:23,634 INFO [WALProcedureStoreSyncThread] wal.ProcedureWALFile: Archiving file:/root/hbase-3.0.0-SNAPSHOT/hbase-data/MasterProcWALs/pv2-00000000000000000001.log to file:/root/hbase-3.0.0-SNAPSHOT/hbase-data/oldWALs/pv2-00000000000000000001.log 2020-01-14 06:13:23,634 ERROR [WALProcedureStoreSyncThread] wal.WALProcedureStore: Unable to remove log: file:/root/hbase-3.0.0-SNAPSHOT/hbase-data/MasterProcWALs/pv2-00000000000000000001.log java.io.FileNotFoundException: File file:/root/hbase-3.0.0-SNAPSHOT/hbase-data/MasterProcWALs/pv2-00000000000000000001.log does not exist ... 2020-01-14 06:13:23,635 ERROR [WALProcedureStoreSyncThread] wal.WALProcedureStore: Unable to remove log: file:/root/hbase-3.0.0-SNAPSHOT/hbase-data/MasterProcWALs/pv2-00000000000000000001.log java.io.FileNotFoundException: File file:/root/hbase-3.0.0-SNAPSHOT/hbase-data/MasterProcWALs/pv2-00000000000000000001.log does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:635) ... 2020-01-14 06:13:37,352 ERROR [WALProcedureStoreSyncThread] wal.WALProcedureStore: Unable to remove log: file:/root/hbase-3.0.0-SNAPSHOT/hbase-data/MasterProcWALs/pv2-00000000000000000001.log java.io.FileNotFoundException: File file:/root/hbase-3.0.0-SNAPSHOT/hbase-data/MasterProcWALs/pv2-00000000000000000001.log does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:635)
I believe a simple solution in this case is to explicitly call WALProcedureStore.stop method at the end of the migration, which stops WALProcedureStore internal syncLoop thread.
Am opening a PR soon.