Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-21459

Some aggregation functions change the case of nested field names

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Cannot Reproduce
    • 1.6.0
    • None
    • Spark Core
    • None

    Description

      When working with DataFrames with nested schemas, the behavior of the aggregation functions is inconsistent with respect to preserving the case of the nested field names.

      For example, first() preserves the case of the field names, but collect_set() and collect_list() force the field names to lowercase.

      Expected behavior: Field name case is preserved (or is at least consistent and documented)

      Spark-shell session to reproduce:

      Update: After trying different versions, I discovered that this problem occurs in the version of Spark 1.6.0 shipped with Cloudera CDH, not plain Spark.
      The plain Spark 1.6.0 does not support structs in aggregation operations such as collect_set at all.

      case class Inner(Key:String, Value:String)
      case class Outer(ID:Long, Pairs:Array[Inner])
      
      val rdd = sc.parallelize(Seq(Outer(1L, Array(Inner("foo", "bar")))))
      val df = sqlContext.createDataFrame(rdd)
      
      scala> df
      ... = [ID: bigint, Pairs: array<struct<Key:string,Value:string>>]
      
      scala>df.groupBy("ID").agg(first("Pairs"))
      ... = [ID: bigint, first(Pairs)(): array<struct<Key:string,Value:string>>]
      // Note that Key and Value preserve their original case
      
      scala>df.groupBy("ID").agg(collect_set("Pairs"))
      ... = [ID: bigint, collect_set(Pairs): array<struct<key:string,value:string>>]
      // Note that key and value are now lowercased
      
      

      Additionally, the column name (generated during aggregation) is inconsistent: first(Pairs)() versus collect_set(Pairs) - note the extra parentheses in the first name.

      Attachments

        Activity

          People

            Unassigned Unassigned
            dallsoppuk David Allsopp
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: