Details
-
Bug
-
Status: Resolved
-
Minor
-
Resolution: Cannot Reproduce
-
1.6.0
-
None
-
None
Description
When working with DataFrames with nested schemas, the behavior of the aggregation functions is inconsistent with respect to preserving the case of the nested field names.
For example, first() preserves the case of the field names, but collect_set() and collect_list() force the field names to lowercase.
Expected behavior: Field name case is preserved (or is at least consistent and documented)
Spark-shell session to reproduce:
Update: After trying different versions, I discovered that this problem occurs in the version of Spark 1.6.0 shipped with Cloudera CDH, not plain Spark.
The plain Spark 1.6.0 does not support structs in aggregation operations such as collect_set at all.
case class Inner(Key:String, Value:String) case class Outer(ID:Long, Pairs:Array[Inner]) val rdd = sc.parallelize(Seq(Outer(1L, Array(Inner("foo", "bar"))))) val df = sqlContext.createDataFrame(rdd) scala> df ... = [ID: bigint, Pairs: array<struct<Key:string,Value:string>>] scala>df.groupBy("ID").agg(first("Pairs")) ... = [ID: bigint, first(Pairs)(): array<struct<Key:string,Value:string>>] // Note that Key and Value preserve their original case scala>df.groupBy("ID").agg(collect_set("Pairs")) ... = [ID: bigint, collect_set(Pairs): array<struct<key:string,value:string>>] // Note that key and value are now lowercased
Additionally, the column name (generated during aggregation) is inconsistent: first(Pairs)() versus collect_set(Pairs) - note the extra parentheses in the first name.