Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-9978

Window functions require partitionBy to work as expected

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.4.1
    • 1.4.2, 1.5.0
    • PySpark
    • None

    Description

      I am trying to reproduce following SQL query:

      df.registerTempTable("df")
      sqlContext.sql("SELECT x, row_number() OVER (ORDER BY x) as rn FROM df").show()
      
      +----+--+
      |   x|rn|
      +----+--+
      |0.25| 1|
      | 0.5| 2|
      |0.75| 3|
      +----+--+
      

      using PySpark API. Unfortunately it doesn't work as expected:

      from pyspark.sql.window import Window
      from pyspark.sql.functions import rowNumber
      
      df = sqlContext.createDataFrame([{"x": 0.25}, {"x": 0.5}, {"x": 0.75}])
      df.select(df["x"], rowNumber().over(Window.orderBy("x")).alias("rn")).show()
      
      +----+--+
      |   x|rn|
      +----+--+
      | 0.5| 1|
      |0.25| 1|
      |0.75| 1|
      +----+--+
      

      As a workaround It is possible to call partitionBy without additional arguments:

      df.select(
          df["x"],
          rowNumber().over(Window.partitionBy().orderBy("x")).alias("rn")
      ).show()
      
      +----+--+
      |   x|rn|
      +----+--+
      |0.25| 1|
      | 0.5| 2|
      |0.75| 3|
      +----+--+
      

      but as far as I can tell it is not documented and is rather counterintuitive considering SQL syntax. Moreover this problem doesn't affect Scala API:

      import org.apache.spark.sql.expressions.Window
      import org.apache.spark.sql.functions.rowNumber
      
      case class Record(x: Double)
      val df = sqlContext.createDataFrame(Record(0.25) :: Record(0.5) :: Record(0.75))
      df.select($"x", rowNumber().over(Window.orderBy($"x")).alias("rn")).show
      
      +----+--+
      |   x|rn|
      +----+--+
      |0.25| 1|
      | 0.5| 2|
      |0.75| 3|
      +----+--+
      

      Attachments

        Activity

          People

            davies Davies Liu
            zero323 Maciej Szymkiewicz
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: