Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-44512

dataset.sort.select.write.partitionBy sorts wrong column

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Blocker
    • Resolution: Not A Problem
    • 3.4.1
    • None
    • Optimizer, SQL
    • None

    Description

      (In this example the dataset is of type Tuple3, and the columns are named _1, _2 and _3)

       

      I found then when AQE is enabled, that the following code does not produce sorted output (.drop() also have the same problem), unless spark.sql.optimizer.plannedWrite.enabled is set to false.

      After further investigation, spark actually sorted wrong column in the following code.

      dataset.sort("_1")
      .select("_2", "_3")
      .write()
      .partitionBy("_2")
      .text("output");

       
      (the following workaround is no longer necessary)
      However, if I insert an identity mapper between select and write, the output would be sorted as expected.

      dataset = dataset.sort("_1")
      .select("_2", "_3");
      dataset.map((MapFunction<Row, Row>) row -> row, dataset.encoder())
      .write()
      .partitionBy("_2")
      .text("output")

      Below is the complete code that reproduces the problem.

      Attachments

        1. Test-Details-for-Query-0.png
          18 kB
          Yiu-Chung Lee
        2. Test-Details-for-Query-1.png
          18 kB
          Yiu-Chung Lee

        Issue Links

          Activity

            People

              Unassigned Unassigned
              leeyc0 Yiu-Chung Lee
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: