[SPARK-20008] hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() returns 1 - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Incomplete
Affects Version/s: 2.0.2, 2.2.0
Fix Version/s: None
Component/s: SQL
Labels:
- bulk-closed

Description

hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() yields 1 against expected 0.

This was not the case with spark 1.5.2. This is an api change from usage point of view and hence I consider this as a bug. May be a boundary case, not sure.

Work around - For now I check the counts != 0 before this operation. Not good for performance. Hence creating a jira to track it.

As Young Zhang explained in reply to my mail -
Starting from Spark 2, these kind of operation are implemented in left anti join, instead of using RDD operation directly.

Same issue also on sqlContext.

scala> spark.version
res25: String = 2.0.2

spark.sqlContext.emptyDataFrame.except(spark.sqlContext.emptyDataFrame).explain(true)
== Physical Plan ==
*HashAggregate(keys=[], functions=[], output=[])
+- Exchange SinglePartition
+- *HashAggregate(keys=[], functions=[], output=[])
+- BroadcastNestedLoopJoin BuildRight, LeftAnti, false
:- Scan ExistingRDD[]
+- BroadcastExchange IdentityBroadcastMode
+- Scan ExistingRDD[]

This arguably means a bug. But my guess is liking the logic of comparing NULL = NULL, should it return true or false, causing this kind of confusion.

Attachments

Issue Links

links to

[Github] Pull Request #17392 (gatorsmile)

Activity

People

Assignee:: Xiao Li

Reporter:: Ravindra Bajpai

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 18/Mar/17 01:47

Updated:: 12/Dec/22 18:10

Resolved:: 21/May/19 04:13