Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-17758

Spark Aggregate function LAST returns null on an empty partition

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.0.0
    • 2.0.2, 2.1.0
    • Input/Output
    • Spark 2.0.0

    Description

      My Environment
      Spark 2.0.0

      I have included the physical plan of my application below.
      Issue description
      The result from a query that uses the LAST function are incorrect.

      The output obtained for the column that corresponds to the last function is null .

      My input data contain 3 rows .

      The application resulted in 2 stages

      The first stage consisted of 3 tasks .

      The first task/partition contains 2 rows
      The second task/partition contains 1 row
      The last task/partition contain 0 rows

      The result from the query executed for the LAST column call is NULL which I believe is due to the PARTIAL_LAST on the last partition .

      I believe that this behavior is incorrect. The PARTIAL_LAST call on an empty partition should not return null .

      == Physical Plan ==
      InsertIntoHiveTable MetastoreRelation default, bdm_3449_tgt20, true, false
      +- *Project [last(C3_1)#51 AS field#102, cast(round(max(C3_0)#50, 0) as int) AS field1#103, cast(round(max(C3_0)#50, 0) as int) AS field2#104]
         +- SortAggregate(key=[], functions=[max(C3_0#40),last(C3_1#41, false)], output=[max(C3_0)#50,last(C3_1)#51])
            +- SortAggregate(key=[], functions=[partial_max(C3_0#40),partial_last(C3_1#41, false)], output=[max#91,last#92])
               +- *Project [CAST(sum(C1_0) AS DOUBLE)#27 AS C3_0#40, last(C1_1)#28 AS C3_1#41]
                  +- SortAggregate(key=[], functions=[sum(cast(C1_0#17 as bigint)),last(C1_1#18, false)], output=[CAST(sum(C1_0) AS DOUBLE)#27,last(C1_1)#28])
                     +- Exchange SinglePartition
                        +- SortAggregate(key=[], functions=[partial_sum(cast(C1_0#17 as bigint)),partial_last(C1_1#18, false)], output=[sum#95L,last#96])
                           +- *Project [field1#7 AS C1_0#17, field#6 AS C1_1#18]
                              +- HiveTableScan [field1#7, field#6], MetastoreRelation default, bdm_3449_src, alias
      

      Attachments

        Activity

          People

            hvanhovell Herman van Hövell
            tafranky@gmail.com Franck Tago
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: