[SPARK-21459] Some aggregation functions change the case of nested field names - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Cannot Reproduce
Affects Version/s: 1.6.0
Fix Version/s: None
Component/s: Spark Core
Labels:
None

Description

When working with DataFrames with nested schemas, the behavior of the aggregation functions is inconsistent with respect to preserving the case of the nested field names.

For example, first() preserves the case of the field names, but collect_set() and collect_list() force the field names to lowercase.

Expected behavior: Field name case is preserved (or is at least consistent and documented)

Spark-shell session to reproduce:

Update: After trying different versions, I discovered that this problem occurs in the version of Spark 1.6.0 shipped with Cloudera CDH, not plain Spark.
The plain Spark 1.6.0 does not support structs in aggregation operations such as collect_set at all.

case class Inner(Key:String, Value:String)
case class Outer(ID:Long, Pairs:Array[Inner])

val rdd = sc.parallelize(Seq(Outer(1L, Array(Inner("foo", "bar")))))
val df = sqlContext.createDataFrame(rdd)

scala> df
... = [ID: bigint, Pairs: array<struct<Key:string,Value:string>>]

scala>df.groupBy("ID").agg(first("Pairs"))
... = [ID: bigint, first(Pairs)(): array<struct<Key:string,Value:string>>]
// Note that Key and Value preserve their original case

scala>df.groupBy("ID").agg(collect_set("Pairs"))
... = [ID: bigint, collect_set(Pairs): array<struct<key:string,value:string>>]
// Note that key and value are now lowercased

Additionally, the column name (generated during aggregation) is inconsistent: first(Pairs)() versus collect_set(Pairs) - note the extra parentheses in the first name.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: David Allsopp

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 18/Jul/17 15:50

Updated:: 17/Oct/17 16:56

Resolved:: 17/Oct/17 07:03