Uploaded image for project: 'Geode'
  1. Geode
  2. GEODE-6931

A failed RemotePutMessage can cause a PersistentReplicatesOfflineException to be thrown when no persistent members are offline

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • messaging
    • None

    Description

      One of the places that RemotePutMessage is sent is DistributedRegion virtualPut.

      Its sent from this method in this case:

      • 2 wan sites
      • the member in the receiving site that processes the batch defines the region as replicate proxy
      • other receiving site members define the region as replicate persistent

      DistributedRegion virtualPut is invoked by the GatewayReceiverCommand here:

      java.lang.Exception: Stack trace
      	at java.lang.Thread.dumpStack(Thread.java:1333)
      	at org.apache.geode.internal.cache.DistributedRegion.virtualPut(DistributedRegion.java:341)
      	at org.apache.geode.internal.cache.LocalRegionDataView.putEntry(LocalRegionDataView.java:162)
      	at org.apache.geode.internal.cache.LocalRegion.basicUpdate(LocalRegion.java:5549)
      	at org.apache.geode.internal.cache.LocalRegion.basicBridgePut(LocalRegion.java:5200)
      	at org.apache.geode.internal.cache.tier.sockets.command.GatewayReceiverCommand.cmdExecute(GatewayReceiverCommand.java:429)
      

      In this case, requiresOneHopForMissingEntry called by virtualPut returns true since a proxy region with other persistent replicates can't generate a version tag. This causes RemotePutMessage.distribute to be called.

      If didDistribute returns false from RemotePutMessage.distribute (meaning the distribution failed), a PersistentReplicatesOfflineException is thrown regardless of the actual exception on the remote member:

      if (!generateVersionTag && !didDistribute) {
        throw new PersistentReplicatesOfflineException();
      }
      

      One of the ways that didDistribute can be false is if both the remote wan site and local wan site are updating the same key at the same time. In that case a ConcurrentCacheModificationException can occur in the replicate persistent member (the one processing the RemotePutMessage).

      This exception is not logged anywhere, and RemotePutMessage operateOnRegion doesn't know anything about it.

      RemotePutMessage operateOnRegion running in the replicate persistent member calls:

      result = r.getDataView().putEntry(event, this.ifNew, this.ifOld, this.expectedOldValue,
          this.requireOldValue, this.lastModified, true);
      

      If putEntry returns false, it throws a RemoteOperationException which is sent back to the caller and causes didDistribute to be false.

      The result can be false in the RemotePutMessage operateOnRegion method because of a ConcurrentCacheModificationException:

      org.apache.geode.internal.cache.versions.ConcurrentCacheModificationException: conflicting WAN event detected
      	at org.apache.geode.internal.cache.entries.AbstractRegionEntry.processGatewayTag(AbstractRegionEntry.java:1924)
      	at org.apache.geode.internal.cache.entries.AbstractRegionEntry.processVersionTag(AbstractRegionEntry.java:1443)
      	at org.apache.geode.internal.cache.entries.AbstractOplogDiskRegionEntry.processVersionTag(AbstractOplogDiskRegionEntry.java:165)
      	at org.apache.geode.internal.cache.entries.VersionedThinDiskLRURegionEntryHeapStringKey1.processVersionTag(VersionedThinDiskLRURegionEntryHeapStringKey1.java:378)
      	at org.apache.geode.internal.cache.AbstractRegionMap.processVersionTag(AbstractRegionMap.java:527)
      	at org.apache.geode.internal.cache.map.RegionMapPut.updateEntry(RegionMapPut.java:484)
      	at org.apache.geode.internal.cache.map.RegionMapPut.createOrUpdateEntry(RegionMapPut.java:256)
      	at org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutAndDeliverEvent(AbstractRegionMapPut.java:300)
      	at org.apache.geode.internal.cache.map.AbstractRegionMapPut.runWithIndexUpdatingInProgress(AbstractRegionMapPut.java:308)
      	at org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutIfPreconditionsSatisified(AbstractRegionMapPut.java:296)
      	at org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutOnSynchronizedRegionEntry(AbstractRegionMapPut.java:282)
      	at org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutOnRegionEntryInMap(AbstractRegionMapPut.java:273)
      	at org.apache.geode.internal.cache.map.AbstractRegionMapPut.addRegionEntryToMapAndDoPut(AbstractRegionMapPut.java:251)
      	at org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutRetryingIfNeeded(AbstractRegionMapPut.java:216)
      	at org.apache.geode.internal.cache.map.AbstractRegionMapPut.doWithIndexInUpdateMode(AbstractRegionMapPut.java:198)
      	at org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPut(AbstractRegionMapPut.java:180)
      	at org.apache.geode.internal.cache.map.AbstractRegionMapPut.runWhileLockedForCacheModification(AbstractRegionMapPut.java:119)
      	at org.apache.geode.internal.cache.map.RegionMapPut.runWhileLockedForCacheModification(RegionMapPut.java:161)
      	at org.apache.geode.internal.cache.map.AbstractRegionMapPut.put(AbstractRegionMapPut.java:169)
      	at org.apache.geode.internal.cache.AbstractRegionMap.basicPut(AbstractRegionMap.java:2047)
      	at org.apache.geode.internal.cache.LocalRegion.virtualPut(LocalRegion.java:5569)
      	at org.apache.geode.internal.cache.DistributedRegion.virtualPut(DistributedRegion.java:386)
      	at org.apache.geode.internal.cache.LocalRegionDataView.putEntry(LocalRegionDataView.java:162)
      	at org.apache.geode.internal.cache.tx.RemotePutMessage.operateOnRegion(RemotePutMessage.java:635)
      	at org.apache.geode.internal.cache.tx.RemoteOperationMessage.process(RemoteOperationMessage.java:195)
      

      This exception is caught in LocalRegion.virtualPut but not logged, so there is no evidence of it. LocalRegion.virtualPut just returns false in that case.

      So, to the caller, it looks like a persistent replicated is offline when it isn't.

      A GatewayConflictResolver can help detect this case. If the resolver accepts the wan event, then the exceptions do not occur. If the resolver rejects the WAN event, then exceptions will occur.

      All they really mean is that the wan event was rejected because it was conflicting with a local event on the same key.

      It would be nice if instead of RemotePutMessage operateOnRegion returning a generic RemoteOperationException, an actual ConcurrentCacheModificationException could be returned (or at least a RemoteOperationException with the ConcurrentCacheModificationException message). Short of that, logging the ConcurrentCacheModificationException and throwing something other than the PersistentReplicatesOfflineException in DistributedRegion virtualPut would be better.

      Attachments

        Activity

          People

            Unassigned Unassigned
            boglesby Barrett Oglesby
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: