Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-19214

Inconsistencies between DataFrame and Dataset APIs

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Trivial
    • Resolution: Won't Fix
    • 2.0.0, 2.0.1, 2.0.2, 2.1.0
    • None
    • None
    • None

    Description

      I am not sure whether this has been reported already, but there are some confusing & annoying inconsistencies when programming the same expression in the Dataset and the DataFrame APIs.

      Consider the following minimal example executed in a Spark Shell:

      case class Point(x: Int, y: Int, z: Int)
      
      val ps = spark.createDataset(for {
        x <- 1 to 10
        y <- 1 to 10
        z <- 1 to 10
      } yield Point(x, y, z))
      
      // Problem 1:
      // count produces different fields in the Dataset / DataFrame variants
      
      // count() on grouped DataFrame: field name is `count`
      ps.groupBy($"x").count().printSchema
      // root
      //  |-- x: integer (nullable = false)
      //  |-- count: long (nullable = false)
      
      // count() on grouped Dataset: field name is `count(1)`
      ps.groupByKey(_.x).count().printSchema
      // root
      //  |-- value: integer (nullable = true)
      //  |-- count(1): long (nullable = false)
      
      // Problem 2:
      // groupByKey produces different `key` field name depending
      // on the result type
      // this is especially confusing in the first case below (simple key types)
      // where the key field is actually named `value`
      
      // simple key types
      ps.groupByKey(p => p.x).count().printSchema
      // root
      //  |-- value: integer (nullable = true)
      //  |-- count(1): long (nullable = false)
      
      // complex key types
      ps.groupByKey(p => (p.x, p.y)).count().printSchema
      // root
      //  |-- key: struct (nullable = false)
      //  |    |-- _1: integer (nullable = true)
      //  |    |-- _2: integer (nullable = true)
      //  |-- count(1): long (nullable = false)
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            aalexandrov Alexander Alexandrov
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: