Uploaded image for project: 'Ignite'
  1. Ignite
  2. IGNITE-15990

AssertionError in B+Tree under load

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 2.13
    • data structures
    • Fixed potential data corruption during rapid delete and update operations.
    • Release Notes Required

    Description

      Following exception has occurred in GridGain CE 8.7.29:

      class org.apache.ignite.internal.processors.cache.persistence.tree.CorruptedTreeException: B+Tree is corrupted [pages(groupId, pageId)=[IgniteBiTuple [val1=-830802122, val2=844420635165316]], cacheId=-112787658, cacheName=TABLEDSS, indexName=TABLEDSS_TTL_DESC_IDX, msg=Runtime failure on search row: Row@28da1b19[ ... ]
      	at org.apache.ignite.internal.processors.query.h2.database.H2Tree.corruptedTreeException(H2Tree.java:927)
      	at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.doRemove(BPlusTree.java:2103)
      	at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.removex(BPlusTree.java:1899)
      	at org.apache.ignite.internal.processors.query.h2.database.H2TreeIndex.removex(H2TreeIndex.java:515)
      	at org.apache.ignite.internal.processors.query.h2.opt.GridH2Table.remove(GridH2Table.java:822)
      	at org.apache.ignite.internal.processors.query.h2.IgniteH2Indexing.remove(IgniteH2Indexing.java:420)
      	at org.apache.ignite.internal.processors.query.GridQueryProcessor.remove(GridQueryProcessor.java:2952)
      	at org.apache.ignite.internal.processors.cache.query.GridCacheQueryManager.remove(GridCacheQueryManager.java:437)
      	at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.finishRemove(IgniteCacheOffheapManagerImpl.java:2703)
      	at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.invoke0(IgniteCacheOffheapManagerImpl.java:1727)
      	at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.invoke(IgniteCacheOffheapManagerImpl.java:1694)
      	at org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.invoke(GridCacheOffheapManager.java:2643)
      	at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl.invoke(IgniteCacheOffheapManagerImpl.java:435)
      	at org.apache.ignite.internal.processors.cache.GridCacheMapEntry.innerUpdate(GridCacheMapEntry.java:2329)
      	at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateSingle(GridDhtAtomicCache.java:2596)
      	at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.update(GridDhtAtomicCache.java:2056)
      	at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal0(GridDhtAtomicCache.java:1873)
      	at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal(GridDhtAtomicCache.java:1674)
      	at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridNearAtomicAbstractUpdateFuture.sendSingleRequest(GridNearAtomicAbstractUpdateFuture.java:319)
      	at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridNearAtomicSingleUpdateFuture.map(GridNearAtomicSingleUpdateFuture.java:496)
      	at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridNearAtomicSingleUpdateFuture.mapOnTopology(GridNearAtomicSingleUpdateFuture.java:454)
      	at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridNearAtomicAbstractUpdateFuture.map(GridNearAtomicAbstractUpdateFuture.java:267)
      	at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.remove0(GridDhtAtomicCache.java:1192)
      	at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.remove0(GridDhtAtomicCache.java:728)
      	at org.apache.ignite.internal.processors.cache.GridCacheAdapter.remove(GridCacheAdapter.java:2858)
      	at org.apache.ignite.internal.processors.cache.GridCacheAdapter.remove(GridCacheAdapter.java:2835)
      	at org.apache.ignite.internal.processors.cache.IgniteCacheProxyImpl.remove(IgniteCacheProxyImpl.java:1431)
      	at org.apache.ignite.internal.processors.cache.GatewayProtectedCacheProxy.remove(GatewayProtectedCacheProxy.java:913)
      	at org.apache.ignite.internal.processors.platform.cache.PlatformCache.processInStreamOutLong(PlatformCache.java:521)
      	at org.apache.ignite.internal.processors.platform.PlatformTargetProxyImpl.inStreamOutLong(PlatformTargetProxyImpl.java:66)
      	at org.apache.ignite.internal.processors.platform.callback.PlatformCallbackUtils.inLongOutLong(Native Method)
      	at org.apache.ignite.internal.processors.platform.callback.PlatformCallbackGateway.computeJobExecute(PlatformCallbackGateway.java:328)
      	at org.apache.ignite.internal.processors.platform.compute.PlatformClosureJob.execute0(PlatformClosureJob.java:75)
      	at org.apache.ignite.internal.processors.platform.compute.PlatformAbstractJob.execute(PlatformAbstractJob.java:79)
      	at org.apache.ignite.internal.processors.job.GridJobWorker$2.call(GridJobWorker.java:570)
      	at org.apache.ignite.internal.util.IgniteUtils.wrapThreadLoader(IgniteUtils.java:7121)
      	at org.apache.ignite.internal.processors.job.GridJobWorker.execute0(GridJobWorker.java:564)
      	at org.apache.ignite.internal.processors.job.GridJobWorker.body(GridJobWorker.java:491)
      	at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:119)
      	at org.apache.ignite.internal.processors.job.GridJobProcessor.processJobExecuteRequest(GridJobProcessor.java:1344)
      	at org.apache.ignite.internal.processors.job.GridJobProcessor$JobExecutionListener.onMessage(GridJobProcessor.java:2102)
      	at org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1721)
      	at org.apache.ignite.internal.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:1328)
      	at org.apache.ignite.internal.managers.communication.GridIoManager.access$4600(GridIoManager.java:157)
      	at org.apache.ignite.internal.managers.communication.GridIoManager$8.execute(GridIoManager.java:1213)
      	at org.apache.ignite.internal.managers.communication.TraceRunnable.run(TraceRunnable.java:54)
      	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
      	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
      	at java.base/java.lang.Thread.run(Thread.java:834)
      Caused by: java.lang.AssertionError: tail=Tail[pageId=0002ffff00000284, cnt= 1, lvl=2, sibling=null], res=NOT_FOUND
      	at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.doRemove(BPlusTree.java:2083)
      	... 47 more 

      The reason behind it is a broken tree invariant that last leaf item must also be present in one of inner nodes (unless it’s a last element in the entire tree). In the assertion above it’s implied that all inner nodes chain from the leaf to the root have been checked and an item has not been found.

      Bug appears when Ignite node experiences slow inserts and very rapid deletions (caused by TTL, for example). Let’s examine the following tree configuration (assuming that 2 items in the node is the max):

                             [ R:0 ]
                           /         \
            [ X:0 | K:1 ]              ...
          /       |       \
      ...      [ K:1 ]      [ R:0 ] 

      Key K appears twice in the tree and update K=2 would be performed in two steps: inner node first, leaf node last. Code that does that can be found in BPlusTree.Put#found method. These is a time gap between these two modifications where none of tree’s pages are locked, and other threads are free to attempt any other tree updates. Corruption happens when other thread tries to remove R during the gap. Here are tree states during these operations:

      // First half of replace operation.
                             [ R:0 ]
                           /         \
            [ X:0 | K:2 ]              ...              // Replace happed in inner node at first.
          /       |       \
      ...      [ K:1 ]      [ R:0 ]
      
      
      // Remove R starts deleting an item from the leaf.
                             [ R:0 ]
                           /         \
            [ X:0 | K:2 ]              ...
          /       |       \
      ...      [ K:1 ]      [ ]                         // Leaf is empty, merge it about to happen.
      
      
      // Empty branch is merged with its left neighbour.
                      [ R:0 ]
                    /         \
            [ X:0 ]              ...                   // K:2 item has been removed during the merge.
          /         \
      ...             [ K:1 ]
      
      
      // R in the root is finally replaced with appropriate item and removal is complete.
                      [ K:1 ]                         // K:1 is copied from the leaf as the closest left item.
                    /         \
            [ X:0 ]              ...
          /         \
      ...             [ K:1 ]
      
      
      // K:2 thread wakes up and continues updating downwards, changing the value in the leaf. Operation completed.
                      [ K:1 ]
                    /         \
            [ X:0 ]              ...
          /         \
      ...             [ K:2 ] 

      As a result of these events, root points to a nonexistent value 1, which is most likely an invalid pointer. In case of H2 indexes, everything seems to be working, because actual columns are inlined and the pointer is never read. AssertionError from the beginning happens when one tries to delete K. As a last item in the leaf, it must have a copy in inner node, but it’s not the case since K:1 is not equal to K:2.

      Attachments

        Issue Links

          Activity

            People

              ibessonov Ivan Bessonov
              ibessonov Ivan Bessonov
              Sergey Chugunov Sergey Chugunov
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 20m
                  20m