Uploaded image for project: 'Accumulo'
  1. Accumulo
  2. ACCUMULO-777

isLockHeld needs better bullet-proofing against transient errors

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.3.5-incubating, 1.3.6, 1.4.0, 1.4.1
    • 1.4.2
    • client
    • None
    • medium sized cluster

    Description

      During the minor compaction, the zookeeper lock for the tablet server is double-checked prior to updating the METADATA table information. In one unlucky moment, the zookeeper connection was lost during this check. The tablet server failed the check, but the lock was not lost. As a result, the root tablet remained hosted for another 4 weeks, but did not flush mutations to disk. When memory filled, the operator noticed a long hold time and killed the server. This caused a log recovery of 98 1G of logs, some of which were very old.

      Attachments

        Activity

          People

            ecn Eric C. Newton
            ecn Eric C. Newton
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: