[SPARK-27134] array_distinct function does not work correctly with columns containing array of array - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.4.0
Fix Version/s: 2.4.1, 3.0.0
Component/s: SQL
Labels:
- correctness
Environment:

Spark 2.4, scala 2.11.11

Description

The array_distinct function introduced in spark 2.4 is producing strange results when used on an array column which contains a nested array. The resulting output can still contain duplicate values, and furthermore, previously distinct values may be removed.

This is easily repeatable, e.g. with this code:

val df = Seq(
Seq(Seq(1, 2), Seq(1, 2), Seq(1, 2), Seq(3, 4), Seq(4, 5))
).toDF("Number_Combinations")

val dfWithDistinct = df.withColumn("distinct_combinations",
array_distinct(col("Number_Combinations")))

The initial 'df' DataFrame contains one row, where column 'Number_Combinations' contains the following values:

[[1, 2], [1, 2], [1, 2], [3, 4], [4, 5]]

The array_distinct function run on this column produces a new column containing the following values:

[[1, 2], [1, 2], [1, 2]]

As you can see, this contains three occurrences of the same value (1, 2), and furthermore, the distinct values (3, 4), (4, 5) have been removed.

Attachments

Issue Links

is caused by

SPARK-23912 High-order function: array_distinct(x) → array

Resolved

links to

GitHub Pull Request #24073

Activity

People

Assignee:: Dilip Biswal

Reporter:: Mike Trenaman

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 12/Mar/19 10:39

Updated:: 17/Jul/19 00:44

Resolved:: 16/Mar/19 19:33