Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-34534

New protocol FetchShuffleBlocks in OneForOneBlockFetcher lead to data loss or correctness

Rank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersCreate sub-taskConvert to sub-taskLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    Description

      We will build a new rpc message `FetchShuffleBlocks` when `OneForOneBlockFetcher` init in replace of `OpenBlocks` to use adaptive feature, this introduce additional problems as follows.

      `OneForOneBlockFetcher` will init a `blockIds` String array to catch chunk fetch success, it will use index in `blockIds` to fetch blocks and match blockId in `blockIds` when chunk data return. So the `blockIds` 's order must be consistent with fetchChunk index, but the new `FetchShuffleBlocks` return chunk order is not same as `blockIds`.

      This will lead to the return data not match the blockId,  and this can lead to data corretness when retry to fetch after fetch block chunk failed.

      Fetch chunk orker code and match blockId when rerun data code as follows: 

      Howerver, the fetch order in shuffle service,

      So, it will fetch some wrong block data when chunk fetch failed beause the blocks's wrong order.

       

       

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            yuhaiyang haiyangyu
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment