Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-34537

Repartition miss/duplicated data

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Not A Problem
    • 3.0.1
    • None
    • SQL
    • None

    Description

      We have a SQL

      INSERT OVERWRITE TABLE t1 
      SELECT /*+ repartition(300) */ * from t2.

      Below is SQL metrics of the repartition ShuffleExchange. we can see that the shuffle record written and records read is not same. 

      In the result table, there are some data missing and some data duplicated.

      We can see that InsertIntoHadoopFsRelationCommand's output is save as repartition Exchange's record read(reducer side)

      and repartition Exchange's shuffle record written (mapper side written) is same as Filter's output.

      So we can see that repartition's Exchange return wrong data.

       

      In our env, AQE and speculation is open.

      Attachments

        1. image-2021-02-25-19-47-10-005.png
          81 kB
          angerszhu
        2. image-2021-02-25-19-46-52-809.png
          212 kB
          angerszhu
        3. image-2021-02-25-19-43-49-687.png
          47 kB
          angerszhu

        Activity

          People

            Unassigned Unassigned
            angerszhuuu angerszhu
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: