Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-21642

RequestReplyFunction recovery fails with a remote SDK

    XMLWordPrintableJSON

Details

    Description

      While extending our smoke e2e test to use the remote SDKS I've stumbled upon a bug in the RequestReplyFunction. We get a unknown state exception after recovery.

      The exact scenario that trigger that bug is:

      1. There was  request in flight.
      2. A  failure occurs that causes the job to restart.
      3. On restore, we start with no managed state
      4. But we try to re-send to the SDK exactly the same ToFunction message.
      5. That ToFunction contains state definitions from the previous attempt. (before the failure)
      6. The SDK processes this message normally (it has all the state definitions that it knows)
      7. The SDK responds with a state mutation.
      8. The PersistedRemoteFunctionValues fails with unknown state. 

       

      We need to treat the ToFunction messages as a retryBatch, instead of sending it as-is.

       

      Attachments

        Issue Links

          Activity

            People

              igal Igal Shilman
              igal Igal Shilman
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: