Uploaded image for project: 'Ignite'
  1. Ignite
  2. IGNITE-15990

AssertionError in B+Tree under load

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 2.13
    • data structures
    • Fixed potential data corruption during rapid delete and update operations.
    • Release Notes Required

    Description

      Following exception has occurred in GridGain CE 8.7.29:

      class org.apache.ignite.internal.processors.cache.persistence.tree.CorruptedTreeException: B+Tree is corrupted [pages(groupId, pageId)=[IgniteBiTuple [val1=-830802122, val2=844420635165316]], cacheId=-112787658, cacheName=TABLEDSS, indexName=TABLEDSS_TTL_DESC_IDX, msg=Runtime failure on search row: Row@28da1b19[ ... ]
      	at org.apache.ignite.internal.processors.query.h2.database.H2Tree.corruptedTreeException(H2Tree.java:927)
      	at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.doRemove(BPlusTree.java:2103)
      	at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.removex(BPlusTree.java:1899)
      	at org.apache.ignite.internal.processors.query.h2.database.H2TreeIndex.removex(H2TreeIndex.java:515)
      	at org.apache.ignite.internal.processors.query.h2.opt.GridH2Table.remove(GridH2Table.java:822)
      	at org.apache.ignite.internal.processors.query.h2.IgniteH2Indexing.remove(IgniteH2Indexing.java:420)
      	at org.apache.ignite.internal.processors.query.GridQueryProcessor.remove(GridQueryProcessor.java:2952)
      	at org.apache.ignite.internal.processors.cache.query.GridCacheQueryManager.remove(GridCacheQueryManager.java:437)
      	at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.finishRemove(IgniteCacheOffheapManagerImpl.java:2703)
      	at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.invoke0(IgniteCacheOffheapManagerImpl.java:1727)
      	at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.invoke(IgniteCacheOffheapManagerImpl.java:1694)
      	at org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.invoke(GridCacheOffheapManager.java:2643)
      	at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl.invoke(IgniteCacheOffheapManagerImpl.java:435)
      	at org.apache.ignite.internal.processors.cache.GridCacheMapEntry.innerUpdate(GridCacheMapEntry.java:2329)
      	at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateSingle(GridDhtAtomicCache.java:2596)
      	at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.update(GridDhtAtomicCache.java:2056)
      	at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal0(GridDhtAtomicCache.java:1873)
      	at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal(GridDhtAtomicCache.java:1674)
      	at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridNearAtomicAbstractUpdateFuture.sendSingleRequest(GridNearAtomicAbstractUpdateFuture.java:319)
      	at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridNearAtomicSingleUpdateFuture.map(GridNearAtomicSingleUpdateFuture.java:496)
      	at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridNearAtomicSingleUpdateFuture.mapOnTopology(GridNearAtomicSingleUpdateFuture.java:454)
      	at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridNearAtomicAbstractUpdateFuture.map(GridNearAtomicAbstractUpdateFuture.java:267)
      	at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.remove0(GridDhtAtomicCache.java:1192)
      	at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.remove0(GridDhtAtomicCache.java:728)
      	at org.apache.ignite.internal.processors.cache.GridCacheAdapter.remove(GridCacheAdapter.java:2858)
      	at org.apache.ignite.internal.processors.cache.GridCacheAdapter.remove(GridCacheAdapter.java:2835)
      	at org.apache.ignite.internal.processors.cache.IgniteCacheProxyImpl.remove(IgniteCacheProxyImpl.java:1431)
      	at org.apache.ignite.internal.processors.cache.GatewayProtectedCacheProxy.remove(GatewayProtectedCacheProxy.java:913)
      	at org.apache.ignite.internal.processors.platform.cache.PlatformCache.processInStreamOutLong(PlatformCache.java:521)
      	at org.apache.ignite.internal.processors.platform.PlatformTargetProxyImpl.inStreamOutLong(PlatformTargetProxyImpl.java:66)
      	at org.apache.ignite.internal.processors.platform.callback.PlatformCallbackUtils.inLongOutLong(Native Method)
      	at org.apache.ignite.internal.processors.platform.callback.PlatformCallbackGateway.computeJobExecute(PlatformCallbackGateway.java:328)
      	at org.apache.ignite.internal.processors.platform.compute.PlatformClosureJob.execute0(PlatformClosureJob.java:75)
      	at org.apache.ignite.internal.processors.platform.compute.PlatformAbstractJob.execute(PlatformAbstractJob.java:79)
      	at org.apache.ignite.internal.processors.job.GridJobWorker$2.call(GridJobWorker.java:570)
      	at org.apache.ignite.internal.util.IgniteUtils.wrapThreadLoader(IgniteUtils.java:7121)
      	at org.apache.ignite.internal.processors.job.GridJobWorker.execute0(GridJobWorker.java:564)
      	at org.apache.ignite.internal.processors.job.GridJobWorker.body(GridJobWorker.java:491)
      	at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:119)
      	at org.apache.ignite.internal.processors.job.GridJobProcessor.processJobExecuteRequest(GridJobProcessor.java:1344)
      	at org.apache.ignite.internal.processors.job.GridJobProcessor$JobExecutionListener.onMessage(GridJobProcessor.java:2102)
      	at org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1721)
      	at org.apache.ignite.internal.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:1328)
      	at org.apache.ignite.internal.managers.communication.GridIoManager.access$4600(GridIoManager.java:157)
      	at org.apache.ignite.internal.managers.communication.GridIoManager$8.execute(GridIoManager.java:1213)
      	at org.apache.ignite.internal.managers.communication.TraceRunnable.run(TraceRunnable.java:54)
      	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
      	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
      	at java.base/java.lang.Thread.run(Thread.java:834)
      Caused by: java.lang.AssertionError: tail=Tail[pageId=0002ffff00000284, cnt= 1, lvl=2, sibling=null], res=NOT_FOUND
      	at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.doRemove(BPlusTree.java:2083)
      	... 47 more 

      The reason behind it is a broken tree invariant that last leaf item must also be present in one of inner nodes (unless it’s a last element in the entire tree). In the assertion above it’s implied that all inner nodes chain from the leaf to the root have been checked and an item has not been found.

      Bug appears when Ignite node experiences slow inserts and very rapid deletions (caused by TTL, for example). Let’s examine the following tree configuration (assuming that 2 items in the node is the max):

                             [ R:0 ]
                           /         \
            [ X:0 | K:1 ]              ...
          /       |       \
      ...      [ K:1 ]      [ R:0 ] 

      Key K appears twice in the tree and update K=2 would be performed in two steps: inner node first, leaf node last. Code that does that can be found in BPlusTree.Put#found method. These is a time gap between these two modifications where none of tree’s pages are locked, and other threads are free to attempt any other tree updates. Corruption happens when other thread tries to remove R during the gap. Here are tree states during these operations:

      // First half of replace operation.
                             [ R:0 ]
                           /         \
            [ X:0 | K:2 ]              ...              // Replace happed in inner node at first.
          /       |       \
      ...      [ K:1 ]      [ R:0 ]
      
      
      // Remove R starts deleting an item from the leaf.
                             [ R:0 ]
                           /         \
            [ X:0 | K:2 ]              ...
          /       |       \
      ...      [ K:1 ]      [ ]                         // Leaf is empty, merge it about to happen.
      
      
      // Empty branch is merged with its left neighbour.
                      [ R:0 ]
                    /         \
            [ X:0 ]              ...                   // K:2 item has been removed during the merge.
          /         \
      ...             [ K:1 ]
      
      
      // R in the root is finally replaced with appropriate item and removal is complete.
                      [ K:1 ]                         // K:1 is copied from the leaf as the closest left item.
                    /         \
            [ X:0 ]              ...
          /         \
      ...             [ K:1 ]
      
      
      // K:2 thread wakes up and continues updating downwards, changing the value in the leaf. Operation completed.
                      [ K:1 ]
                    /         \
            [ X:0 ]              ...
          /         \
      ...             [ K:2 ] 

      As a result of these events, root points to a nonexistent value 1, which is most likely an invalid pointer. In case of H2 indexes, everything seems to be working, because actual columns are inlined and the pointer is never read. AssertionError from the beginning happens when one tries to delete K. As a last item in the leaf, it must have a copy in inner node, but it’s not the case since K:1 is not equal to K:2.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            ibessonov Ivan Bessonov
            ibessonov Ivan Bessonov
            Sergey Chugunov Sergey Chugunov
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 20m
                20m

                Slack

                  Issue deployment