Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-20008

hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() returns 1

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Incomplete
    • 2.0.2, 2.2.0
    • None
    • SQL

    Description

      hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() yields 1 against expected 0.

      This was not the case with spark 1.5.2. This is an api change from usage point of view and hence I consider this as a bug. May be a boundary case, not sure.

      Work around - For now I check the counts != 0 before this operation. Not good for performance. Hence creating a jira to track it.

      As Young Zhang explained in reply to my mail -
      Starting from Spark 2, these kind of operation are implemented in left anti join, instead of using RDD operation directly.

      Same issue also on sqlContext.

      scala> spark.version
      res25: String = 2.0.2

      spark.sqlContext.emptyDataFrame.except(spark.sqlContext.emptyDataFrame).explain(true)
      == Physical Plan ==
      *HashAggregate(keys=[], functions=[], output=[])
      +- Exchange SinglePartition
      +- *HashAggregate(keys=[], functions=[], output=[])
      +- BroadcastNestedLoopJoin BuildRight, LeftAnti, false
      :- Scan ExistingRDD[]
      +- BroadcastExchange IdentityBroadcastMode
      +- Scan ExistingRDD[]

      This arguably means a bug. But my guess is liking the logic of comparing NULL = NULL, should it return true or false, causing this kind of confusion.

      Attachments

        Activity

          People

            smilegator Xiao Li
            ravindra.bajpai Ravindra Bajpai
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: