Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-27134

array_distinct function does not work correctly with columns containing array of array

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.4.0
    • 2.4.1, 3.0.0
    • SQL
    • Spark 2.4, scala 2.11.11

    Description

      The array_distinct function introduced in spark 2.4 is producing strange results when used on an array column which contains a nested array. The resulting output can still contain duplicate values, and furthermore, previously distinct values may be removed.

      This is easily repeatable, e.g. with this code:

      val df = Seq(
      Seq(Seq(1, 2), Seq(1, 2), Seq(1, 2), Seq(3, 4), Seq(4, 5))
      ).toDF("Number_Combinations")

      val dfWithDistinct = df.withColumn("distinct_combinations",
      array_distinct(col("Number_Combinations")))

       

      The initial 'df' DataFrame contains one row, where column 'Number_Combinations' contains the following values:

      [[1, 2], [1, 2], [1, 2], [3, 4], [4, 5]]

       

      The array_distinct function run on this column produces a new column containing the following values:

      [[1, 2], [1, 2], [1, 2]]

       

      As you can see, this contains three occurrences of the same value (1, 2), and furthermore, the distinct values (3, 4), (4, 5) have been removed.

       

       

      Attachments

        Issue Links

          Activity

            People

              dkbiswal Dilip Biswal
              m1ke Mike Trenaman
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: