Description
(In this example the dataset is of type Tuple3, and the columns are named _1, _2 and _3)
I found then when AQE is enabled, that the following code does not produce sorted output (.drop() also have the same problem), unless spark.sql.optimizer.plannedWrite.enabled is set to false.
After further investigation, spark actually sorted wrong column in the following code.
dataset.sort("_1")
.select("_2", "_3")
.write()
.partitionBy("_2")
.text("output");
(the following workaround is no longer necessary)
However, if I insert an identity mapper between select and write, the output would be sorted as expected.
dataset = dataset.sort("_1")
.select("_2", "_3");
dataset.map((MapFunction<Row, Row>) row -> row, dataset.encoder())
.write()
.partitionBy("_2")
.text("output")
Below is the complete code that reproduces the problem.
Attachments
Attachments
Issue Links
- is caused by
-
SPARK-37287 Pull out dynamic partition and bucket sort from FileFormatWriter
- Resolved