[SPARK-6116] DataFrame API improvement umbrella ticket (Spark 1.5) - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Umbrella
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.5.0
Component/s: SQL
Labels:
- DataFrame

Target Version/s:

1.5.0
Sprint:
Spark 1.5 doc/QA sprint

Description

An umbrella ticket for DataFrame API improvements for Spark 1.5.

~~SPARK-9576~~ is the ticket for Spark 1.6.

Attachments

Issue Links

relates to

SPARK-6956 Improve DataFrame API compatibility with Pandas

Resolved

SPARK-5166 Stabilize Spark SQL APIs

Resolved

SPARK-7239 Statistic functions for DataFrames

Resolved

SPARK-8002 Support virtual columns in SQL and DataFrames

Resolved

Sub-Tasks

describe function for summary statistics

Resolved

Andrey Zagrebin

DataFrame.dropna support

Resolved

Reynold Xin

DataFrame.fillna

Resolved

Reynold Xin

Add RDD methods to DataFrame to preserve schema

Resolved

Joseph K. Bradley

SQLContext.implicits should provide automatic conversion for RDD[Row]

Closed

Unassigned

DataFrame.na.replace value support in Scala/Java

Resolved

Reynold Xin

DataFrame.na.replace value support for Python

Resolved

Adrian Wang

SQLContext.emptyDataFrame should contain 0 rows, not 1 row

Resolved

Reynold Xin

SQLContext.registerFunction -> SQLContext.udf.register

Resolved

Davies Liu

10.

Alias DataFrame.na.fill/drop in Python

Resolved

Reynold Xin

11.

Make DataFrame.rdd a lazy val

Resolved

Cheng Lian

12.

Decide on semantics for string identifiers in DataFrame API

Resolved

Reynold Xin

13.

not able to resolve dot('.') in field name

Resolved

Wenchen Fan

14.

Join on two tables (generated from same one) is broken

Resolved

Reynold Xin

15.

Create a DataFrame join API to facilitate equijoin and self join

Resolved

Reynold Xin

16.

Missing alias function on Python DataFrame

Resolved

Yin Huai

17.

Drop __getattr__ on pyspark.sql.DataFrame

Closed

Unassigned

18.

Stabilize Spark SQL data type API followup

Resolved

Reynold Xin

19.

Stabilize data types

Resolved

Reynold Xin

20.

UDF clean up

Resolved

Reynold Xin

21.

Remove PrimitiveType

Resolved

Reynold Xin

22.

Rename NativeType -> AtomicType

Resolved

Reynold Xin

23.

Clean up Python data type hierarchy

Resolved

Davies Liu

24.

Implement struct, array, and map field accessor using apply in Scala and __getitem__ in Python

Resolved

Wenchen Fan

25.

Expression for monotonically increasing IDs

Resolved

Reynold Xin

26.

Document what the options: Map[String, String] does on DataFrame.save and DataFrame.saveAsTable

Closed

Unassigned

27.

Support math functions in DataFrames

Resolved

Burak Yavuz

28.

Support math functions in DataFrames in Python

Resolved

Burak Yavuz

29.

SQLContext.range()

Resolved

Adrian Wang

30.

Correlation methods for DataFrame

Closed

Burak Yavuz

31.

Add a Column expression for partition ID

Resolved

Reynold Xin

32.

Add randomSplit method to DataFrame

Resolved

Burak Yavuz

33.

Add approximate stratified sampling to DataFrame

Resolved

Xiangrui Meng

34.

collect and take return different results

Resolved

Cheng Hao

35.

Make repartition and coalesce a part of the query plan

Resolved

Burak Yavuz

36.

Support math functions in R DataFrame

Resolved

Qian Huang

37.

Support fillna / dropna in R DataFrame

Resolved

Sun Rui

38.

Create Column expression for array/struct creation

Resolved

Reynold Xin

39.

Add a method for dropping a column in Java/Scala

Resolved

Rakesh Chalasani

40.

withColumn is very slow on dataframe with large number of columns

Resolved

Wenchen Fan

41.

Audit missing Hive functions

Resolved

Reynold Xin

42.

Add Pandas' shift method to the Dataframe API

Closed

Unassigned

43.

Random number generators for DataFrames

Resolved

Burak Yavuz

44.

Add a between function in Column

Resolved

Chen Song

45.

Add bitwise operations to DataFrame DSL

Resolved

Shiti Saxena

46.

Improve the output from DataFrame.show()

Resolved

Chen Song

47.

Add rollup and cube support to DataFrame Java/Scala DSL

Resolved

Cheng Hao

48.

Add Column expression for conditional statements (if, case)

Resolved

Chen Song

49.

Window function support in Scala/Java DataFrame DSL

Resolved

Cheng Hao

50.

Add DataFrame.dropDuplicates

Resolved

Reynold Xin

51.

Move mathfunctions into functions

Resolved

Burak Yavuz

52.

Add coalesce Spark SQL function to PySpark API

Resolved

Olivier Girardot

53.

By default retain group by columns in aggregate

Resolved

Reynold Xin

54.

Provide DataFrame.zip (analog of RDD.zip) to merge two data frames

Closed

Ram Sriharsha

55.

pyspark.sql.types.StructType and Row should implement __iter__()

Closed

Unassigned

56.

pyspark.sql.types.StructType.fromJson() is incorrectly named

Closed

Unassigned

57.

Add drop column to Python DataFrame API

Resolved

Reynold Xin

58.

Break dataframe.py into multiple files

Resolved

Davies Liu

59.

Add explode expression

Resolved

Michael Armbrust

60.

Don't split by dot if within backticks for DataFrame attribute resolution

Resolved

Wenchen Fan

61.

Document all SQL/DataFrame public methods with @since tag

Resolved

Reynold Xin

62.

Document all PySpark SQL/DataFrame public methods with @since tag

Resolved

Davies Liu

63.

DataFrameReader and DataFrameWriter for input/output API

Resolved

Reynold Xin

64.

make explode support struct type

Closed

Unassigned

65.

DataFrame reader/writer API in Python

Resolved

Davies Liu

66.

Figure out what to do with insertInto w.r.t. DataFrameWriter API

Closed

Yin Huai

67.

Add standard deviation aggregate expression

Closed

Unassigned

68.

Add rollup and cube support to DataFrame Python DSL

Resolved

Davies Liu

69.

Window function support in Python DataFrame DSL

Resolved

Davies Liu

70.

Better error for unresolved window functions.

Resolved

Michael Armbrust

71.

DataFrame.ntile() should only accept Int as parameter

Resolved

Davies Liu

72.

Add JavaDoc style deprecation for deprecated DataFrame methods

Resolved

Reynold Xin

73.

Support SQLContext.range(end)

Resolved

Animesh Baranawal

74.

Improve DataFrame Python exception

Closed

Davies Liu

75.

crosstab should use 0 instead of null for pairs that don't appear

Resolved

Reynold Xin

76.

Add methods to facilitate equi-join on multiple join keys

Resolved

L. C. Hsieh

77.

Python DataFrame: support passing a list into describe

Resolved

Amey Chaugule

78.

Improve DataFrame.show() output

Resolved

Akhil Thatipamula

79.

Improve frequent items documentation

Resolved

Burak Yavuz

80.

DataFrameReader/Writer in Python does not match Scala

Resolved

Davies Liu

81.

Add Column.alias to Scala/Java API

Resolved

Reynold Xin

82.

Design an easier way to construct schema for both Scala and Python

Resolved

Ilya Ganelin

83.

Improve Python reader/writer interface doc and testing

Resolved

Reynold Xin

84.

Better AnalysisException for writing DataFrame with identically named columns

Resolved

Animesh Baranawal

85.

DataFrame Python API: Alias replace in DataFrameNaFunctions

Resolved

Reynold Xin

86.

Improve error message reporting for DataFrame and SQL

Resolved

Michael Armbrust

87.

DataFrame hint for broadcast join

Resolved

Reynold Xin

88.

Python DataFrameReader/Writer should mirror scala

Resolved

Cheolsoo Park

89.

Reconcile callUDF and callUdf

Resolved

Benjamin Fradet

90.

Prevent accidental use of "and" and "or" to build invalid expressions in Python

Resolved

Davies Liu

91.

For PySpark's DataFrame API, we need to throw exceptions when users try to use and/or/not

Resolved

Davies Liu

92.

expr function to convert SQL expression into a Column

Resolved

Joseph Batchik

93.

dataframe left joins are not working as expected in pyspark

Resolved

Davies Liu

94.

partitionBy in Python DataFrame reader/writer interface should not default to empty tuple

Resolved

Reynold Xin

95.

Add a "pretty" parameter to show

Resolved

Shixiong Zhu

96.

DataFrame Python API should work with column which has non-ascii character in it

Resolved

Davies Liu

97.

In should not take Any not Column

Resolved

Unassigned

98.

Maintain binary compatibility for in function

Closed

Unassigned

99.

Good errors for invalid input to ExpectsInput expressions

Resolved

Michael Armbrust

100.

Hide JVM stack trace for IllegalArgumentException in Python

Resolved

L. C. Hsieh

101.

Rename inSet to isin to match Pandas function

Resolved

Reynold Xin

Activity

People

Assignee:: Reynold Xin

Reporter:: Reynold Xin

Votes:: 0 Vote for this issue

Watchers:: 13 Start watching this issue

Dates

Created:: 02/Mar/15 21:08

Updated:: 11/Sep/15 17:20

Resolved:: 11/Sep/15 10:20

Time Tracking

Estimated:

Remaining:

Logged:

Not Specified

Include sub-tasks