Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-30602 SPIP: Support push-based shuffle to improve shuffle efficiency
  3. SPARK-34840

Fix cases of corruption in merged shuffle blocks that are pushed

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.1.0
    • 3.1.2, 3.2.0
    • Shuffle
    • None

    Description

      The RemoteBlockPushResolver which handles the shuffle push blocks and merges them was introduced in #30062. We have identified 2 scenarios where the merged blocks get corrupted:

      1. StreamCallback.onFailure() is called more than once. Initially we assumed that the onFailure callback will be called just once per stream. However, we observed that this is called twice when a client connection is reset. When the client connection is reset then there are 2 events that get triggered in this order.
      • exceptionCaught. This event is propagated to StreamInterceptorStreamInterceptor.exceptionCaught() invokes callback.onFailure(streamId, cause). This is the first time StreamCallback.onFailure() will be invoked.
      • channelInactive. Since the channel closes, the channelInactive event gets triggered which again is propagated to StreamInterceptorStreamInterceptor.channelInactive() invokes callback.onFailure(streamId, new ClosedChannelException()). This is the second time StreamCallback.onFailure() will be invoked.
      1. The flag isWriting is set prematurely to true. This introduces an edge case where a stream that is trying to merge a duplicate block (created because of a speculative task) may interfere with an active stream if the duplicate stream fails.

      Also adding additional changes that improve the code.

      1. Using positional writes all the time because this simplifies the code and with microbenchmarking haven't seen any performance impact.
      2. Additional minor changes.

      Attachments

        Issue Links

          Activity

            People

              csingh Chandni Singh
              csingh Chandni Singh
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: