[SPARK-9999] Dataset API on top of Catalyst/DataFrame - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Story
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.0.0
Component/s: SQL
Labels:
None

Target Version/s:

2.0.0

Description

The RDD API is very flexible, and as a result harder to optimize its execution in some cases. The DataFrame API, on the other hand, is much easier to optimize, but lacks some of the nice perks of the RDD API (e.g. harder to use UDFs, lack of strong types in Scala/Java).

The goal of Spark Datasets is to provide an API that allows users to easily express transformations on domain objects, while also providing the performance and robustness advantages of the Spark SQL execution engine.

Requirements

Fast - In most cases, the performance of Datasets should be equal to or better than working with RDDs. Encoders should be as fast or faster than Kryo and Java serialization, and unnecessary conversion should be avoided.
Typesafe - Similar to RDDs, objects and functions that operate on those objects should provide compile-time safety where possible. When converting from data where the schema is not known at compile-time (for example data read from an external source such as JSON), the conversion function should fail-fast if there is a schema mismatch.
Support for a variety of object models - Default encoders should be provided for a variety of object models: primitive types, case classes, tuples, POJOs, JavaBeans, etc. Ideally, objects that follow standard conventions, such as Avro SpecificRecords, should also work out of the box.
Java Compatible - Datasets should provide a single API that works in both Scala and Java. Where possible, shared types like Array will be used in the API. Where not possible, overloaded functions should be provided for both languages. Scala concepts, such as ClassTags should not be required in the user-facing API.
Interoperates with DataFrames - Users should be able to seamlessly transition between Datasets and DataFrames, without specifying conversion boiler-plate. When names used in the input schema line-up with fields in the given class, no extra mapping should be necessary. Libraries like MLlib should not need to provide different interfaces for accepting DataFrames and Datasets as input.

For a detailed outline of the complete proposed API: marmbrus/dataset-api
For an initial discussion of the design considerations in this API: design doc

The initial version of the Dataset API has been merged in Spark 1.6. However, it will take a few more future releases to flush everything out.

Attachments

Issue Links

is related to

SPARK-1021 sortByKey() launches a cluster job when it shouldn't

Resolved

SPARK-2991 RDD transforms for scan and scanLeft

Resolved

SPARK-2315 drop, dropRight and dropWhile which take RDD input and return RDD

Closed

SPARK-2992 The transforms formerly known as non-lazy

Resolved

SPARK-8360 Structured Streaming (aka Streaming DataFrames)

Resolved

relates to

SPARK-12171 Support DataSet API in SparkR

Closed

(1 relates to)

Sub-Tasks

1.	Inital code generated encoder for product types	Resolved	Michael Armbrust
2.	Initial code generated construction of Product classes from InternalRow	Resolved	Michael Armbrust
3.	Initial API Draft	Resolved	Michael Armbrust
4.	add encoder/decoder for external row	Resolved	Wenchen Fan
5.	Java API support & test cases	Resolved	Wenchen Fan
6.	Implement cogroup	Resolved	Wenchen Fan
7.	Support for joining two datasets, returning a tuple of objects	Resolved	Michael Armbrust
8.	GroupedIterator's hasNext is not idempotent	Resolved	Unassigned
9.	groupBy on column expressions	Resolved	Michael Armbrust
10.	Typed-safe aggregations	Resolved	Michael Armbrust
11.	add map/flatMap to GroupedDataset	Resolved	Wenchen Fan
12.	User facing api for typed aggregation	Resolved	Michael Armbrust
13.	Improve toString Function	Resolved	Michael Armbrust
14.	add java test for typed aggregate	Resolved	Wenchen Fan
15.	Support as on Classes defined in the REPL	Resolved	Michael Armbrust
16.	add reduce to GroupedDataset	Resolved	Wenchen Fan
17.	support typed aggregate in project list	Resolved	Wenchen Fan
18.	split ExpressionEncoder into FlatEncoder and ProductEncoder	Resolved	Wenchen Fan
19.	org.apache.spark.sql.AnalysisException: Can't extract value from a#12	Resolved	Wenchen Fan
20.	collect, first, and take should use encoders for serialization	Resolved	Reynold Xin
21.	Kryo-based encoder for opaque types	Resolved	Reynold Xin
22.	Dataset self join returns incorrect result	Resolved	Wenchen Fan
23.	Java-based encoder for opaque types	Resolved	Reynold Xin
24.	nice error message for missing encoder	Resolved	Wenchen Fan
25.	Add Java tests for Kryo/Java encoders	Resolved	Reynold Xin
26.	add type cast if the real type is different but compatible with encoder schema	Resolved	Wenchen Fan
27.	Incorrect results are returned when using null	Resolved	Wenchen Fan
28.	support typed aggregate for complex buffer schema	Resolved	Wenchen Fan
29.	fix `nullable` of encoder schema	Resolved	Wenchen Fan
30.	fix encoder life cycle for CoGroup	Resolved	Wenchen Fan
31.	Encoder for JavaBeans / POJOs	Resolved	Wenchen Fan
32.	[SQL] Adding joinType into joinWith	Closed	Unassigned
33.	Add missing APIs in Dataset	Resolved	Xiao Li
34.	[SQL] Support Persist/Cache and Unpersist in Dataset APIs	Resolved	Xiao Li
35.	refactor MapObjects to make it less hacky	Resolved	Wenchen Fan
36.	WrapOption should not have type constraint for child	Resolved	Apache Spark
37.	throw exception if the number of fields does not line up for Tuple encoder	Resolved	Wenchen Fan
38.	use true as default value for propagateNull in NewInstance	Resolved	Apache Spark
39.	Add BINARY to Encoders	Resolved	Apache Spark
40.	Eliminate serialization for back to back operations	Resolved	Michael Armbrust
41.	Move encoder definition into Aggregator interface	Resolved	Reynold Xin
42.	Explicit APIs in Scala for specifying encoders	Resolved	Reynold Xin

Activity

People

Assignee:: Michael Armbrust

Reporter:: Reynold Xin

Votes:: 4 Vote for this issue

Watchers:: 65 Start watching this issue

Dates

Created:: 14/Aug/15 20:20

Updated:: 30/Jun/16 17:46

Resolved:: 15/Mar/16 22:59