Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-21358

Snapshot procedure fails but SnapshotManager thinks it is still running

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 1.3.2
    • Fix Version/s: None
    • Component/s: snapshots
    • Labels:
      None

      Description

      A snapshot procedure fails due to chaotic test action but the snapshot manager still thinks it is running. The test client spins needlessly checking for something that will never actually complete. We give up eventually but we could be failing this a lot faster.

      On the integration client we are checking and re-checking:

      2018-10-20 01:06:11,718 DEBUG [ChaosMonkeyThread] client.HBaseAdmin: Getting current status of snapshot from master...
      2018-10-20 01:06:11,719 DEBUG [ChaosMonkeyThread] client.HBaseAdmin: (#40) Sleeping: 8571ms while waiting for snapshot completion.

      This is what it looks like on the master side each time the client checks in:

      2018-10-20 01:04:54,565 DEBUG [RpcServer.FifoWFPBQ.default.handler=29,queue=2,port=8100] master.MasterRpcServices: Checking to see if snapshot from request:

      { ss=IntegrationTestBigLinkedList-it-1539997289258 table=IntegrationTestBigLinkedList type=FLUSH } is done
      2018-10-20 01:04:54,565 DEBUG [RpcServer.FifoWFPBQ.default.handler=29,queue=2,port=8100] snapshot.SnapshotManager: Snapshoting '{ ss=IntegrationTestBigLinkedList-it-1539997289258 table=IntegrationTestBigLinkedList type=FLUSH }

      ' is still in progress!

      There is no running procedure for the snapshot. The procedure has failed. The snapshot manager does not take any useful action afterward but believes the snapshot to still be in progress.

      I see related complaint from the hfile archiver task afterward, empty directories, failure to parse protobuf in descriptor files... Seems like there was junk in the filesystem left over from the failed snapshot. The master was soon restarted by chaos action, and now I don't see these complaints, so that partially complete snapshot may have been cleaned up.

      This is with 1.3.2, but patched to include the multithreaded hfile archiving improvements from later versions.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              apurtell Andrew Kyle Purtell
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated: