[SPARK-44512] dataset.sort.select.write.partitionBy sorts wrong column - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Blocker
Resolution: Not A Problem
Affects Version/s: 3.4.1
Fix Version/s: None
Component/s: Optimizer, SQL
Labels:
None

Description

(In this example the dataset is of type Tuple3, and the columns are named _1, _2 and _3)

I found ~~then when AQE is enabled,~~ that the following code does not produce sorted output (.drop() also have the same problem), unless spark.sql.optimizer.plannedWrite.enabled is set to false.

After further investigation, spark actually sorted wrong column in the following code.

dataset.sort("_1")
.select("_2", "_3")
.write()
.partitionBy("_2")
.text("output");

(the following workaround is no longer necessary)
~~However, if I insert an identity mapper between select and write, the output would be sorted as expected.~~

~~dataset = dataset.sort("_1")~~
~~.select("_2", "_3");~~
~~dataset.map((MapFunction<Row, Row>) row -> row, dataset.encoder())~~
~~.write()~~
~~.partitionBy("_2")~~
~~.text("output")~~

Below is the complete code that reproduces the problem.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

Test-Details-for-Query-0.png
25/Jul/23 08:19
18 kB
Yiu-Chung Lee
Test-Details-for-Query-1.png
25/Jul/23 08:20
18 kB
Yiu-Chung Lee

Issue Links

is caused by

SPARK-37287 Pull out dynamic partition and bucket sort from FileFormatWriter

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Yiu-Chung Lee

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 23/Jul/23 06:05

Updated:: 13/Nov/23 03:34

Resolved:: 13/Nov/23 02:19