[SPARK-34537] Repartition miss/duplicated data - ASF JIRA

XML

Word

Printable

JSON

We have a SQL

INSERT OVERWRITE TABLE t1 
SELECT /*+ repartition(300) */ * from t2.

Below is SQL metrics of the repartition ShuffleExchange. we can see that the shuffle record written and records read is not same.

In the result table, there are some data missing and some data duplicated.

We can see that InsertIntoHadoopFsRelationCommand's output is save as repartition Exchange's record read(reducer side)

and repartition Exchange's shuffle record written (mapper side written) is same as Filter's output.

So we can see that repartition's Exchange return wrong data.

In our env, AQE and speculation is open.