Uploaded image for project: 'Accumulo'
  1. Accumulo
  2. ACCUMULO-2480

ha fail-failover failure

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.7.0
    • Component/s: master, tserver
    • Labels:
      None
    • Environment:

      running continuous ingest on a 74-node HA NN hadoop 2.3 cluster, 1.6.0-SNAPSHOT.

      Description

      Ran service network stop on the active NN. The service failed to switch over since the fencing script on the standby failed to run (sshfence).

      After the network interface was re-established, the standby took over.

      However, accumulo ingest began to have very long hold times since the standby was not providing service for several minutes.

      The master attempted to shutdown the tablet servers with hold time.

      The filesystem hook closed the filesystem, and the servers got stuck endlessly trying to write to the WAL.

      Even after the NN was active, because the filesytem was closed, attempts to get a new WAL continued to fail.

      • why didn't the tablet servers stop?
      • WAL loop should be able to terminate if they see an IOException that indicates that the filesystem is closed

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                ecn Eric C. Newton
                Reporter:
                ecn Eric C. Newton
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 0.5h
                  0.5h