Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.5
Description
The collect_set() aggregate function should produce a set of distinct elements. When the column argument's type is BinayType this is not the case.
Example:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
case class R(id: String, value: String, bytes: Array[Byte])
def makeR(id: String, value: String) = R(id, value, value.getBytes)
val df = Seq(makeR("a", "dog"), makeR("a", "cat"), makeR("a", "cat"), makeR("b", "fish")).toDF()
// In the example below "bytesSet" erroneously has duplicates but "stringSet" does not (as expected).
df.agg(collect_set('value) as "stringSet", collect_set('bytes) as "byteSet").show(truncate=false)
// The same problem is displayed when using window functions.
val win = Window.partitionBy('id).rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
val result = df.select(
collect_set('value).over(win) as "stringSet",
collect_set('bytes).over(win) as "bytesSet"
)
.select('stringSet, 'bytesSet, size('stringSet) as "stringSetSize", size('bytesSet) as "bytesSetSize")
.show()