Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-27134

array_distinct function does not work correctly with columns containing array of array

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.4.0
    • Fix Version/s: 2.4.1, 3.0.0
    • Component/s: SQL
    • Labels:
    • Environment:

      Spark 2.4, scala 2.11.11

      Description

      The array_distinct function introduced in spark 2.4 is producing strange results when used on an array column which contains a nested array. The resulting output can still contain duplicate values, and furthermore, previously distinct values may be removed.

      This is easily repeatable, e.g. with this code:

      val df = Seq(
      Seq(Seq(1, 2), Seq(1, 2), Seq(1, 2), Seq(3, 4), Seq(4, 5))
      ).toDF("Number_Combinations")

      val dfWithDistinct = df.withColumn("distinct_combinations",
      array_distinct(col("Number_Combinations")))

       

      The initial 'df' DataFrame contains one row, where column 'Number_Combinations' contains the following values:

      [[1, 2], [1, 2], [1, 2], [3, 4], [4, 5]]

       

      The array_distinct function run on this column produces a new column containing the following values:

      [[1, 2], [1, 2], [1, 2]]

       

      As you can see, this contains three occurrences of the same value (1, 2), and furthermore, the distinct values (3, 4), (4, 5) have been removed.

       

       

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                dkbiswal Dilip Biswal
                Reporter:
                m1ke Mike Trenaman
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: