[SPARK-19214] Inconsistencies between DataFrame and Dataset APIs - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Trivial
Resolution: Won't Fix
Affects Version/s: 2.0.0, 2.0.1, 2.0.2, 2.1.0
Fix Version/s: None
Component/s: None
Labels:
None

Description

I am not sure whether this has been reported already, but there are some confusing & annoying inconsistencies when programming the same expression in the Dataset and the DataFrame APIs.

Consider the following minimal example executed in a Spark Shell:

case class Point(x: Int, y: Int, z: Int)

val ps = spark.createDataset(for {
  x <- 1 to 10
  y <- 1 to 10
  z <- 1 to 10
} yield Point(x, y, z))

// Problem 1:
// count produces different fields in the Dataset / DataFrame variants

// count() on grouped DataFrame: field name is `count`
ps.groupBy($"x").count().printSchema
// root
//  |-- x: integer (nullable = false)
//  |-- count: long (nullable = false)

// count() on grouped Dataset: field name is `count(1)`
ps.groupByKey(_.x).count().printSchema
// root
//  |-- value: integer (nullable = true)
//  |-- count(1): long (nullable = false)

// Problem 2:
// groupByKey produces different `key` field name depending
// on the result type
// this is especially confusing in the first case below (simple key types)
// where the key field is actually named `value`

// simple key types
ps.groupByKey(p => p.x).count().printSchema
// root
//  |-- value: integer (nullable = true)
//  |-- count(1): long (nullable = false)

// complex key types
ps.groupByKey(p => (p.x, p.y)).count().printSchema
// root
//  |-- key: struct (nullable = false)
//  |    |-- _1: integer (nullable = true)
//  |    |-- _2: integer (nullable = true)
//  |-- count(1): long (nullable = false)

Attachments

Issue Links

links to

[Github] Pull Request #16577 (aray)

Activity

People

Assignee:: Unassigned

Reporter:: Alexander Alexandrov

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 13/Jan/17 14:04

Updated:: 24/Jul/17 14:53

Resolved:: 24/Jul/17 14:53