Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-31500

collect_set() of BinaryType returns duplicate elements

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.5
    • Fix Version/s: 2.4.6
    • Component/s: SQL
    • Labels:

      Description

      The collect_set() aggregate function should produce a set of distinct elements. When the column argument's type is BinayType this is not the case.

       

      Example:

      import org.apache.spark.sql.functions._
      import org.apache.spark.sql.expressions.Window

      case class R(id: String, value: String, bytes: Array[Byte])
      def makeR(id: String, value: String) = R(id, value, value.getBytes)
      val df = Seq(makeR("a", "dog"), makeR("a", "cat"), makeR("a", "cat"), makeR("b", "fish")).toDF()

       

      // In the example below "bytesSet" erroneously has duplicates but "stringSet" does not (as expected).

      df.agg(collect_set('value) as "stringSet", collect_set('bytes) as "byteSet").show(truncate=false)

       

      // The same problem is displayed when using window functions.
      val win = Window.partitionBy('id).rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
      val result = df.select(
        collect_set('value).over(win) as "stringSet",
        collect_set('bytes).over(win) as "bytesSet"
      )
      .select('stringSet, 'bytesSet, size('stringSet) as "stringSetSize", size('bytesSet) as "bytesSetSize")
      .show()

        Attachments

          Activity

            People

            • Assignee:
              planga82 Pablo Langa Blanco
              Reporter:
              ewasserman Eric Wasserman
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: