Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-23694

After RegionProcedureStore completes migration of WALProcedureStore, still running WALProcedureStore.syncThread keeps trying to delete now inexistent log files.

    XMLWordPrintableJSON

Details

    • Reviewed

    Description

      With the introduction of the new procedure store to save procs on a table, instead of proc WALs, we added some migration logic to enable a smooth upgrade of clusters under previous versions that used to have proc wal files as the underlying procedure store.

      While running a fresh install with latest master branch version, I had noticed that once the old WALProcedureStore directory is detected, it creates and start a WPS instance here, passing a built-in ProcedureLoader here to perform the conversion from proc wal files to table based.

      At the end of this conversion, the builtin loader deletes the proc wal dir here. This may cause a race condition where a running WALProcedureStore internal thread may be trying to delete the first proc wal file that may had already rolled, then failing with FNFE.

      Even worse is that WALProcedureStore.syncLoop keeps trying indefinitely, flooding Master logs with messages such as below:

      2020-01-14 06:13:23,331 INFO  [master/hbase01:16000:becomeActiveMaster] region.RegionProcedureStore: The old WALProcedureStore wal directory file:/root/hbase-3.0.0-SNAPSHOT/hbase-data/MasterProcWALs exists, migrating...
      2020-01-14 06:13:23,361 DEBUG [master/hbase01:16000:becomeActiveMaster] wal.WALProcedureStore: Starting WAL Procedure Store lease recovery
      2020-01-14 06:13:23,453 INFO  [master/hbase01:16000:becomeActiveMaster] wal.WALProcedureStore: Rolled new Procedure Store WAL, id=1
      2020-01-14 06:13:23,454 DEBUG [master/hbase01:16000:becomeActiveMaster] wal.WALProcedureStore: Lease acquired for flushLogId=1
      2020-01-14 06:13:23,454 DEBUG [master/hbase01:16000:becomeActiveMaster] wal.WALProcedureStore: No state logs to replay.
      2020-01-14 06:13:23,454 INFO  [master/hbase01:16000:becomeActiveMaster] region.RegionProcedureStore: The WALProcedureStore max pid is 0, and the max pid of all loaded procedures is -1
      2020-01-14 06:13:23,455 INFO  [master/hbase01:16000:becomeActiveMaster] region.RegionProcedureStore: Migration of WALProcedureStore finished
      2020-01-14 06:13:23,455 INFO  [master/hbase01:16000:becomeActiveMaster] procedure2.ProcedureExecutor: Recovered RegionProcedureStore lease in 784 msec
      2020-01-14 06:13:23,471 TRACE [WALProcedureStoreSyncThread] wal.WALProcedureStore: no active procedures
      2020-01-14 06:13:23,516 INFO  [master/hbase01:16000:becomeActiveMaster] procedure2.ProcedureExecutor: Loaded RegionProcedureStore in 61 msec
      2020-01-14 06:13:23,516 INFO  [master/hbase01:16000:becomeActiveMaster] procedure2.RemoteProcedureDispatcher: Instantiated, coreThreads=128 (allowCoreThreadTimeOut=true), queueMaxSize=32, operationDelay=150
      2020-01-14 06:13:23,628 INFO  [WALProcedureStoreSyncThread] wal.WALProcedureStore: Rolled new Procedure Store WAL, id=2
      2020-01-14 06:13:23,634 INFO  [WALProcedureStoreSyncThread] wal.WALProcedureStore: Remove all state logs with ID less than 1, since no active procedures
      2020-01-14 06:13:23,634 TRACE [WALProcedureStoreSyncThread] wal.WALProcedureStore: Removing log=file:/root/hbase-3.0.0-SNAPSHOT/hbase-data/MasterProcWALs/pv2-00000000000000000001.log
      2020-01-14 06:13:23,634 INFO  [WALProcedureStoreSyncThread] wal.ProcedureWALFile: Archiving file:/root/hbase-3.0.0-SNAPSHOT/hbase-data/MasterProcWALs/pv2-00000000000000000001.log to file:/root/hbase-3.0.0-SNAPSHOT/hbase-data/oldWALs/pv2-00000000000000000001.log
      2020-01-14 06:13:23,634 ERROR [WALProcedureStoreSyncThread] wal.WALProcedureStore: Unable to remove log: file:/root/hbase-3.0.0-SNAPSHOT/hbase-data/MasterProcWALs/pv2-00000000000000000001.log
      java.io.FileNotFoundException: File file:/root/hbase-3.0.0-SNAPSHOT/hbase-data/MasterProcWALs/pv2-00000000000000000001.log does not exist
      ...
      2020-01-14 06:13:23,635 ERROR [WALProcedureStoreSyncThread] wal.WALProcedureStore: Unable to remove log: file:/root/hbase-3.0.0-SNAPSHOT/hbase-data/MasterProcWALs/pv2-00000000000000000001.log
      java.io.FileNotFoundException: File file:/root/hbase-3.0.0-SNAPSHOT/hbase-data/MasterProcWALs/pv2-00000000000000000001.log does not exist
              at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:635)
      ...
      2020-01-14 06:13:37,352 ERROR [WALProcedureStoreSyncThread] wal.WALProcedureStore: Unable to remove log: file:/root/hbase-3.0.0-SNAPSHOT/hbase-data/MasterProcWALs/pv2-00000000000000000001.log
      java.io.FileNotFoundException: File file:/root/hbase-3.0.0-SNAPSHOT/hbase-data/MasterProcWALs/pv2-00000000000000000001.log does not exist
              at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:635)
      

      I believe a simple solution in this case is to explicitly call WALProcedureStore.stop method at the end of the migration, which stops WALProcedureStore internal syncLoop thread. 

      Am opening a PR soon.

      Attachments

        Activity

          People

            wchevreuil Wellington Chevreuil
            wchevreuil Wellington Chevreuil
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: