Uploaded image for project: 'Ignite'
  1. Ignite
  2. IGNITE-10898

Exchange coordinator failover breaks in some cases when node filter is used

Agile BoardAttach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • None
    • 2.8
    • None
    • None

    Description

      Currently if a node does not pass cache node filter, we do not store this cache affinity on the node unless the node is coordinator. This, however, may fail in the following scenario:
      1) A node passing node filter joins cluster
      2) During the join coordinator fails, new coordinator is selected for which previous exchange is completed
      3) Next coordinator attempts to fetch the affinity, and joining node resends partitions single message, but there are two problems here. First, exchange fast-reply does not wait for the new affinity initialization which results in IllegalStateException. Second, such an attempt to fetch affinity may lead either to deadlock or to incorrectly fetched affinity (basically, coordinator must be in consensus with other nodes passing node filter)

      Test attached reproduces the issue.

      I suggest to always calculate and keep affinity on all nodes, even ones not passing the filter. In this case, there will be no need to fetch and recalculate affinity (initCoordinatorCaches will go away.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            DmitriyGovorukhin Dmitriy Govorukhin
            agoncharuk Alexey Goncharuk
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - Not Specified
              Not Specified
              Remaining:
              Remaining Estimate - 0h
              0h
              Logged:
              Time Spent - 50m
              50m

              Slack

                Issue deployment