[SPARK-6319] Should throw analysis exception when using binary type in groupby/join - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: 1.0.2, 1.1.1, 1.2.1, 1.3.0
Fix Version/s: 1.5.0
Component/s: SQL
Labels:
None

Target Version/s:

1.5.0

Description

Spark shell session for reproduction:

scala> import sqlContext.implicits._
scala> import org.apache.spark.sql.types._
scala> Seq(1, 1, 2, 2).map(i => Tuple1(i.toString)).toDF("c").select($"c" cast BinaryType).distinct.show()
...
CAST(c, BinaryType)
[B@43f13160
[B@5018b648
[B@3be22500
[B@476fc8a1

Spark SQL uses plain byte arrays to represent binary values. However, arrays are compared by reference rather than by value. On the other hand, the DISTINCT operator uses a HashSet and its .contains method to check for duplicated values. These two facts together cause the problem.

Attachments

Issue Links

relates to

SPARK-5553 Reimplement SQL binary type with more efficient data structure

Closed

links to

[Github] Pull Request #7787 (viirya)

Activity

People

Assignee:: L. C. Hsieh

Reporter:: Cheng Lian

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 13/Mar/15 12:10

Updated:: 01/Feb/16 18:03

Resolved:: 31/Jul/15 00:24