Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-27134

array_distinct function does not work correctly with columns containing array of array

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.4.0
    • 2.4.1, 3.0.0
    • SQL
    • Spark 2.4, scala 2.11.11

    Description

      The array_distinct function introduced in spark 2.4 is producing strange results when used on an array column which contains a nested array. The resulting output can still contain duplicate values, and furthermore, previously distinct values may be removed.

      This is easily repeatable, e.g. with this code:

      val df = Seq(
      Seq(Seq(1, 2), Seq(1, 2), Seq(1, 2), Seq(3, 4), Seq(4, 5))
      ).toDF("Number_Combinations")

      val dfWithDistinct = df.withColumn("distinct_combinations",
      array_distinct(col("Number_Combinations")))

       

      The initial 'df' DataFrame contains one row, where column 'Number_Combinations' contains the following values:

      [[1, 2], [1, 2], [1, 2], [3, 4], [4, 5]]

       

      The array_distinct function run on this column produces a new column containing the following values:

      [[1, 2], [1, 2], [1, 2]]

       

      As you can see, this contains three occurrences of the same value (1, 2), and furthermore, the distinct values (3, 4), (4, 5) have been removed.

       

       

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            dkbiswal Dilip Biswal
            m1ke Mike Trenaman
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment