Details
Description
Spark shell session for reproduction:
scala> import sqlContext.implicits._ scala> import org.apache.spark.sql.types._ scala> Seq(1, 1, 2, 2).map(i => Tuple1(i.toString)).toDF("c").select($"c" cast BinaryType).distinct.show() ... CAST(c, BinaryType) [B@43f13160 [B@5018b648 [B@3be22500 [B@476fc8a1
Spark SQL uses plain byte arrays to represent binary values. However, arrays are compared by reference rather than by value. On the other hand, the DISTINCT operator uses a HashSet and its .contains method to check for duplicated values. These two facts together cause the problem.
Attachments
Issue Links
- relates to
-
SPARK-5553 Reimplement SQL binary type with more efficient data structure
- Closed
- links to