Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-34534

New protocol FetchShuffleBlocks in OneForOneBlockFetcher lead to data loss or correctness

    XMLWordPrintableJSON

Details

    Description

      We will build a new rpc message `FetchShuffleBlocks` when `OneForOneBlockFetcher` init in replace of `OpenBlocks` to use adaptive feature, this introduce additional problems as follows.

      `OneForOneBlockFetcher` will init a `blockIds` String array to catch chunk fetch success, it will use index in `blockIds` to fetch blocks and match blockId in `blockIds` when chunk data return. So the `blockIds` 's order must be consistent with fetchChunk index, but the new `FetchShuffleBlocks` return chunk order is not same as `blockIds`.

      This will lead to the return data not match the blockId,  and this can lead to data corretness when retry to fetch after fetch block chunk failed.

      Fetch chunk orker code and match blockId when rerun data code as follows: 

      Howerver, the fetch order in shuffle service,

      So, it will fetch some wrong block data when chunk fetch failed beause the blocks's wrong order.

       

       

      Attachments

        1. image-2021-02-25-11-31-59-110.png
          326 kB
          haiyangyu
        2. image-2021-02-25-11-30-03-834.png
          692 kB
          haiyangyu
        3. image-2021-02-25-11-28-31-255.png
          369 kB
          haiyangyu
        4. image-2021-02-25-11-27-34-429.png
          680 kB
          haiyangyu
        5. image-2021-02-25-11-17-12-714.png
          712 kB
          haiyangyu

        Activity

          People

            Unassigned Unassigned
            yuhaiyang haiyangyu
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: