Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-34534

New protocol FetchShuffleBlocks in OneForOneBlockFetcher lead to data loss or correctness

Log workAgile BoardRank to TopRank to BottomAttach filesAttach ScreenshotVotersStop watchingWatchersCreate sub-taskConvert to sub-taskLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

      Description

      We will build a new rpc message `FetchShuffleBlocks` when `OneForOneBlockFetcher` init in replace of `OpenBlocks` to use adaptive feature, this introduce additional problems as follows.

      `OneForOneBlockFetcher` will init a `blockIds` String array to catch chunk fetch success, it will use index in `blockIds` to fetch blocks and match blockId in `blockIds` when chunk data return. So the `blockIds` 's order must be consistent with fetchChunk index, but the new `FetchShuffleBlocks` return chunk order is not same as `blockIds`.

      This will lead to the return data not match the blockId,  and this can lead to data corretness when retry to fetch after fetch block chunk failed.

      Fetch chunk orker code and match blockId when rerun data code as follows: 

      Howerver, the fetch order in shuffle service,

      So, it will fetch some wrong block data when chunk fetch failed beause the blocks's wrong order.

       

       

        Attachments

          Activity

          $i18n.getText('security.level.explanation', $currentSelection) Viewable by All Users
          Cancel

            People

              Dates

              • Created:
                Updated:
                Resolved:

                Issue deployment