Uploaded image for project: 'ZooKeeper'
  1. ZooKeeper
  2. ZOOKEEPER-4689

Node may not accessible due the the inconsistent ACL reference map after SNAP sync (again)



    • Bug
    • Status: Open
    • Critical
    • Resolution: Unresolved
    • 3.6.0, 3.7.0, 3.8.0
    • None
    • server


      In Zookeeper, we do a "fuzzy snapshot". It means that we don't make a copy of the DataTree or grab a lock or anything when serializing a DataTree. Instead, we note down the zxid when we start serializing DataTree. We serialize the DataTree while it's getting mutated and replay the {transactions after starting to take snapshot} after deserializing the DataTree. The idea is that those transactions should be idempotent.

      Zookeeper also implements its own interned ACL. It keeps a [long -> ACL] map and store the `long` in each node as nodes tend to share the same ACL.

      When serializing DataTree, we first serialize the ACL cache and then serialize the nodes. It's possible that with the following sequence, a node points to an invalid ACL entry:

      1. Serialize ACL
      2. Create node with new ACL
      3. Serialize node

      ZOOKEEPER-3306 fixes this by making sure to insert the ACL to cache upon calling `DataTree.createNode` when replaying transactions and when the node already exists. However, we only insert it to the cache, but do not set the interned ACL in the node to point to the new entry.

      It's possible that the longval we get for the ACL is inconsistent, even though we follow the same zxid ordering of events. Specifically, we keep a [aclIndex] pointing to the max entry that currently exists and increment that whenever we need to intern a new ACL we've never seen before.

      With ZOOKEEPER-2214, we started to do reference counting in ACL cache and remove no-longer used entries from the cache. 

      Say the current aclIndex is 10. If we create a node with ACL unseen before and delete that node, aclIndex will increment to 11. However, when we deserialize the tree, we'll set aclIndex to the max existent ACL cache entry, so it's reverted back to 10. aclIndex inconsistency on its own is fine but it causes problem to the ZOOKEEPER-3306 patch.

      Now if we follow the same scenario mentioned in ZOOKEEPER-3306:

      1. Leader creates ACL entry 11 and delete it due to node deletion
      2. Server A starts to have snap sync with leader
      3. After serializing the ACL map to Server A, there is a txn T1 to create a node N1 with new ACL_1 which was not exist in ACL map
      4. On leader, after this txn, the ACL map will be 12 -> (ACL_1, COUNT: 1), and data tree N1 -> 12
      5. On server A, it will be ACL map with max ID 10, and N1 -> 12 in fuzzy snapshot
      6. When replaying the txn T1, it will add 11 -> (ACL_1, COUNT: 1) to the ACL cache but the node N1 still points to 12.

      N1 still points to invalid ACL entry.

      There are two ways to fix this:

      1. Make aclIndex consistent upon re-deserialization (by either serializing it in snapshot or paying special attention to decrement it when removing cache)
      2. Fix ZOOKEEPER-3306 patch so that we also override the ACL of node to new key if previous entry does not exist in the ACL table.


      I think solution 2 is nicer as aclIndex inconsistency itself is not a problem. With solution 1, we're still implicitly depending on aclIndex consistency and ordering of events. It's harder to reason about and it seems more fragile than solution 1.

      I'm going to send a patch for solution 2 but please let me know if you disagree and I'm happy to go with solution 1 instead.


        Issue Links



              Unassigned Unassigned
              adamyi Adam Yi
              0 Vote for this issue
              4 Start watching this issue



                Time Tracking

                  Original Estimate - Not Specified
                  Not Specified
                  Remaining Estimate - 0h
                  Time Spent - 1h