Uploaded image for project: 'Ignite'
  1. Ignite
  2. IGNITE-12845

GridNioServer can infinitely lose some events

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 2.9
    • None
    • None

    Description

      With enabled optimization (IGNITE_NO_SELECTOR_OPTS = false, by default) GridNioServer can lose some events for a channel (depending on JDK version and OS). It can lead to connected applications hang. Reproducer: 

          public void testConcurrentLoad() throws Exception {
              startGrid(0);
      
              try (IgniteClient client = Ignition.startClient(new ClientConfiguration().setAddresses("127.0.0.1:10800"))) {
                  ClientCache<Integer, Integer> cache = client.getOrCreateCache(DEFAULT_CACHE_NAME);
      
                  GridTestUtils.runMultiThreaded(
                      () -> {
                          for (int i = 0; i < 1000; i++)
                              cache.put(i, i);
                      }, 5, "run-async");
              }
          }
      

      This reproducer hangs eventually on MacOS (tested with JDK 8, 11, 12, 13, 14), hangs on Windows with some JDK versions (tested with JDK 11, 14), but passes on Windows with JDK 8, Linux systems, or when system property IGNITE_NO_SELECTOR_OPTS = true is set.

      The root cause: optimized SelectedSelectionKeySet always returns false for contains() method. The contains() method used by sun.nio.ch.SelectorImpl.processReadyEvents() method:

      if (selectedKeys.contains(ski)) {
          if (ski.translateAndUpdateReadyOps(rOps)) {
              return 1;
          }
      } else {
          ski.translateAndSetReadyOps(rOps);
          if ((ski.nioReadyOps() & ski.nioInterestOps()) != 0) {
              selectedKeys.add(ski);
              return 1;
          }
      }
      

      So, for fair implementation, if a selection key is contained in the selected keys set, then ready operations flags are updated, but for SelectedSelectionKeySet ready operations flags will be always overridden and new selector key will be added even if it's already contained in the set. Some SelectorImpl implementations can pass several events for one selector key to processReadyEvents method (for example, MacOs implementation KQueueSelectorImpl works in such a way). In this case, duplicated selector keys will be added to selectedKeys and all events except last will be lost.

      Two bad things happen in GridNioServer due to described above reasons:

      1. Some event flags are lost and the worker doesn't process corresponding action (for attached reproducer "channel is ready for reading" event is lost and the workers never read the channel after some point in time).
      2. Duplicated selector keys with the same event flags (for attached reproducer it's "channel is ready for writing" event, this duplication leads to wrong processing of GridSelectorNioSessionImpl#procWrite flag, which will be false in some cases, but at the same time selector key's interestedOps will contain OP_WRITE operation and this operation never be excluded) 

      Possible solutions:

      • Fair implementation of SelectedSelectionKeySet.contains method (this will solve all problems but can be resource consuming)
      • Always set GridSelectorNioSessionImpl#procWrite to true when adding OP_WRITE to interestedOps (for example in AbstractNioClientWorker.registerWrite() method). In this case, some "channel is ready for reading" events (but not data) still can be lost, but not infinitely, and eventually data will be read. If events will be reordered (first "channel is ready for writing", after it "channel is ready for reading") then write to the channel will be only processed after all data will be read.
      • Exclude OP_WRITE from interestedOps even if GridSelectorNioSessionImpl#procWrite is false when there are no write requests in the queue (see GridNioServer.stopPollingForWrite() method). This solution has the same shortcomings as the previous one. 
      • Hybrid approach. Use some probabilistic implementation for contains method (bloom filter or just check the last element) and use one of two previous solutions as a workaround, for cases when we incorrectly return false for contains

       

      Attachments

        Issue Links

          Activity

            People

              alex_pl Aleksey Plekhanov
              alex_pl Aleksey Plekhanov
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 20m
                  20m