Uploaded image for project: 'Ignite'
  1. Ignite
  2. IGNITE-8098

Getting affinity for topology version earlier than affinity is calculated because of data race

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 2.3
    • None
    • None
    • None

    Description

      From time to time the Ignite cluster with services throws next exception during restarting of  some nodes:

      java.lang.IllegalStateException: Getting affinity for topology version earlier than affinity is calculated [locNode=TcpDiscoveryNode [id=c770dbcf-2908-442d-8aa0-bf26a2aecfef, addrs=[10.44.162.169, 127.0.0.1], sockAddrs=[clrv0000041279.ic.ing.net/10.44.162.169:56500, /127.0.0.1:56500], discPort=56500, order=11, intOrder=8, lastExchangeTime=1520931375337, loc=true, ver=2.3.3#20180213-sha1:f446df34, isClient=false], grp=ignite-sys-cache, topVer=AffinityTopologyVersion [topVer=13, minorTopVer=0], head=AffinityTopologyVersion [topVer=15, minorTopVer=0], history=[AffinityTopologyVersion [topVer=11, minorTopVer=0], AffinityTopologyVersion [topVer=11, minorTopVer=1], AffinityTopologyVersion [topVer=12, minorTopVer=0], AffinityTopologyVersion [topVer=15, minorTopVer=0]]]

      Looks like the reason of this issue is the data race in GridServiceProcessor class.

      How to reproduce:

      1)To simulate data race you should update next place in source code:

      Class: GridServiceProcessor
      Method: @Override public void onEvent(final DiscoveryEvent evt, final DiscoCache discoCache) {
      Place:

      ....

      try {
      svcName.set(dep.configuration().getName());

      ctx.cache().internalCache(UTILITY_CACHE_NAME).context().affinity().
      affinityReadyFuture(topVer).get();

      //HERE (between GET and REASSIGN) you should add Thread.sleep(100) for example.

      //try

      { //Thread.sleep(100); //}

      //catch (InterruptedException e1)

      { //e1.printStackTrace(); //}

      reassign(dep, topVer);
      }
      catch (IgniteCheckedException ex)

      { if (!(e instanceof ClusterTopologyCheckedException)) LT.error(log, ex, "Failed to do service reassignment (will retry): " + dep.configuration().getName()); retries.add(dep); }

      ...

      2)After that you should imitate start/shutdown iterations. For reproducing I used GridServiceProcessorBatchDeploySelfTest (but timeout on future.get should be increased to avoid timeout error)

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              aealeksandrov Andrei Aleksandrov
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated: