Uploaded image for project: 'ZooKeeper'
  1. ZooKeeper
  2. ZOOKEEPER-4846

Failure to reload database due to missing ACL

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None

    Description

      ZooKeeper snapshots are fuzzy, as the server does not stop processing requests while ACLs and nodes are being streamed to disk.

      ACLs, notably, are streamed first, as a mapping between the full serialized ACL and an "ACL ID" referenced by the node.

      Consequently, a snapshot can very well contain ACL IDs which do not exist in the mapping. Prior to ZOOKEEPER-4799, such situations would produce harmless (if annoying) "Ignoring acl XYZ as it does not exist in the cache" INFO entries in the server logs.

      With ZOOKEEPER-4799, we started "eagerly" fetching the referenced ACLs in DataTree operations such as createNode, deleteNode, etc.—as opposed to just fetching them from request processors.

      This can result in fatal errors during the fastForwardFromEdits phase of restoring a database, when transactions are processed on top of an inconsistent data tree—preventing the server from starting.

      The errors are thrown in this code path:

      // ReferenceCountedACLCache.java:90
      List<ACL> acls = longKeyMap.get(longVal);
      if (acls == null) {
          LOG.error("ERROR: ACL not available for long {}", longVal);
          throw new RuntimeException("Failed to fetch acls for " + longVal);
      }
      

      Here is a scenario leading to such a failure:

      • An existing node /foo, sporting an unique ACL, is deleted. This is recorded in transaction log $SNAP-1; said ACL is also deallocated;
      • Snapshot $SNAP is started;
      • The ACL map is serialized to $SNAP;
      • A new node /foo sporting the same unique ACL is created in a portion of the data tree which still has to be serialized;
      • Node /foo is serialized to $SNAP—but its ACL isn't;
      • The server is restarted;
      • The DataTree is initialized from $SNAP, including node /foo with a dangling ACL reference;
      • Transaction log $SNAP-1 is being replayed, leading to a deleteNode("/foo");
      • getACL(node) panics, preventing a successful restart.

      Attachments

        Issue Links

          Activity

            People

              ztzg Damien Diederen
              ztzg Damien Diederen
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 10m
                  10m