[SPARK-41279] Feature parity: DataFrame API in Spark Connect - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Umbrella
Status: Open
Priority: Critical
Resolution: Unresolved
Affects Version/s: 3.4.0
Fix Version/s: None
Component/s: Connect
Labels:
None

Epic Link:
Spark Connect

Description

Implement DataFrame API in Spark Connect.

Attachments

Issue Links

split from

SPARK-39674 Initial protobuf definition for Spark Connect API

Resolved

Sub-Tasks

1.	Implement DataFrame.withColumn(s)		Resolved	Rui Wang
2.	Support Collect() in Python client		Resolved	Rui Wang
3.	Support Alias for every Relation		Resolved	Rui Wang
4.	SELECT * shouldn't be empty project list in proto.		Resolved	Rui Wang
5.	Refactor server side tests to only use DataFrame API		Resolved	Rui Wang
6.	Initial DSL framework for protobuf testing		Resolved	Rui Wang
7.	Implement `DataFrame.fillna ` and `DataFrame.na.fill `		Resolved	Ruifeng Zheng
8.	Python: rename LogicalPlan.collect to LogicalPlan.to_proto		Resolved	Rui Wang
9.	Input relation can be optional for Project in Connect proto		Resolved	Rui Wang
10.	[Python] Implement `DataFrame.sample`		Resolved	Ruifeng Zheng
11.	Support Repartition in Connect DSL		Resolved	Rui Wang
12.	Implement `DataFrame.approxQuantile` and `DataFrame.stat.approxQuantile`		Resolved	Ruifeng Zheng
13.	Support CreateView in Connect DSL		Resolved	Rui Wang
14.	Implement `DataFrame.SelectExpr` in Python client		Resolved	Rui Wang
15.	Add Deduplicate to Connect proto		Resolved	Rui Wang
16.	Implement `DataFrame.summary`		Resolved	Ruifeng Zheng
17.	Show detailed differences in dataframe comparison		Resolved	Ruifeng Zheng
18.	Add Intersect to Connect proto and DSL		Resolved	Unassigned
19.	Reimplement df.stat.{cov, corr} with built-in sql functions		Resolved	Ruifeng Zheng
20.	Reimplement `frequentItems` with dataframe operations		Resolved	Ruifeng Zheng
21.	DataFrame `withColumnsRenamed` can be implemented through `RenameColumns` proto		Resolved	Rui Wang
22.	Implement `DataFrame.stat.cov`		Resolved	Ruifeng Zheng
23.	Add .agg() to Connect DSL		Resolved	Rui Wang
24.	Add groupby to connect DSL and test more than one grouping expressions		Resolved	Rui Wang
25.	Support toDF(columnNames) in Connect DSL		Resolved	Rui Wang
26.	Compatible `take`, `head` and `first` API in Python client		Resolved	Rui Wang
27.	Improve SET operation support in the proto and the server		Resolved	Rui Wang
28.	Reimplement `crosstab` with dataframe operations		Resolved	Ruifeng Zheng
29.	Implement DataFrame.CreateGlobalView in Python client		Resolved	Rui Wang
30.	Implement `DataFrame.sparkSession` in Python client		Resolved	Rui Wang
31.	Update relations.proto to follow Connect Proto development guidance		Resolved	Rui Wang
32.	Throw exception for Collect() and recommend to use toPandas()		Resolved	Rui Wang
33.	Complete Support for Except and Intersect in Python client		Resolved	Rui Wang
34.	Implement `DataFrame.dropna ` and `DataFrame.na.drop `		Resolved	Ruifeng Zheng
35.	Add WHERE to Connect proto and DSL		Resolved	Rui Wang
36.	Add as(alias: String) to connect DSL		Resolved	Rui Wang
37.	Add a dedicated logical plan for `Summary`		Resolved	Ruifeng Zheng
38.	`columns` API should use `schema` API to avoid data fetching		Resolved	Rui Wang
39.	Support SelectExpr which apply Projection by expressions in Strings in Connect DSL		Resolved	Rui Wang
40.	Implement `DataFrame.stat.corr`		Resolved	Ruifeng Zheng
41.	Implement DataFrame cross join		Resolved	Xinrong Meng
42.	Explain API can support different modes		Resolved	Rui Wang
43.	Support Join UsingColumns in proto		Resolved	Rui Wang
44.	Remove `str` from Aggregate expression type		Resolved	Rui Wang
45.	Implement `DataFrame.sortWithinPartitions`		Resolved	Ruifeng Zheng
46.	Implement `DataFrame.show`		Resolved	Ruifeng Zheng
47.	Support List[Column] for Join's on argument.		Resolved	Rui Wang
48.	Add limit and offset to Connect DSL		Resolved	Rui Wang
49.	Add Sample to proto and DSL		Resolved	Rui Wang
50.	Implement `DataFrame.__repr__` and `DataFrame.dtypes`		Resolved	Ruifeng Zheng
51.	Implement `DataFrame.isEmpty`		Resolved	Ruifeng Zheng
52.	Connect Proto should carry unparsed identifiers		Resolved	Rui Wang
53.	Reimplement `summary` with dataframe operations		Resolved	Ruifeng Zheng
54.	Implement `DataFrame.crosstab` and `DataFrame.stat.crosstab`		Resolved	Ruifeng Zheng
55.	DataFrame.to_pandas should not return optional pandas dataframe		Resolved	Rui Wang
56.	Improve `on` in Join in Python client		Resolved	Rui Wang
57.	Add missing `limit(n)` in DataFrame.head		Resolved	Ruifeng Zheng
58.	Complete Support for Union in Python client		Resolved	Rui Wang
59.	Implement `DataFrame.drop`		Resolved	Ruifeng Zheng
60.	Extend support for Join Relation		Resolved	Rui Wang
61.	Dataframe.transform in Python client support		Resolved	Martin Grund
62.	StructType should contain a list of StructField and each field should have a name		Resolved	Rui Wang
63.	AnalyzeResult should use struct for schema		Resolved	Rui Wang
64.	Change default serialization from 'broken' CSV to Spark DF JSON		Resolved	Martin Grund
65.	Imports more from connect proto package to avoid calling `proto.` for Connect DSL		Resolved	Rui Wang
66.	Support other data type conversion in the DataTypeProtoConverter		Resolved	Unassigned
67.	Adopt `optional` keyword from proto3 which offers `hasXXX` to differentiate if a field is set or unset		Resolved	Rui Wang
68.	Add ClientType to proto to indicate which client sends a request		Resolved	Rui Wang
69.	Make AnalyzePlan support multiple analysis tasks		Resolved	Ruifeng Zheng
70.	Removing unused code in connect		Resolved	Deng Ziming
71.	`DataFrame.explain` should print and return None		Resolved	Ruifeng Zheng
72.	Support string sql expressions in DF.where()		Resolved	Martin Grund
73.	Add missing docs for DataFrame API		Resolved	Rui Wang
74.	Improve `DataFrame.count()`		Resolved	Rui Wang
75.	Implement DataFrame.toDF		Resolved	Rui Wang
76.	Implement DataFrame.withColumnRenamed		Resolved	Rui Wang
77.	Implement `DataFrame.replace ` and `DataFrame.na.replace `		Resolved	Ruifeng Zheng
78.	Add missing avg() to DF group		Resolved	Martin Grund
79.	Bug in Deduplicate Python transformation		Resolved	Martin Grund
80.	Improve Documentation for Take,Tail, Limit and Offset		Resolved	Rui Wang
81.	Add orderBy and drop_duplicates		Resolved	Ruifeng Zheng
82.	Make `Groupby.{min, max, sum, avg, mean}` compatible with PySpark		Resolved	Ruifeng Zheng
83.	Implement `DataFrame.hint`		Resolved	Deng Ziming
84.	Implement `DataFrame.repartitionByRange`		Resolved	Deng Ziming
85.	DF.groupby.agg() API should be compatible		Resolved	Martin Grund
86.	Support DataFrame TempView		Resolved	Rui Wang
87.	Implement `DataFrame.cube`		Resolved	Ruifeng Zheng
88.	Should use SQLExpression for str arguments in Projection		Resolved	Unassigned
89.	Implement DataFrame.describe		Resolved	Jiaan Geng
90.	Implement DataFrame. colRegex		Resolved	Ruifeng Zheng
91.	Implement `DataFrame.melt` and `DataFrame.unpivot`		Resolved	Ruifeng Zheng
92.	Implement DataFrame.randomSplit		Resolved	Jiaan Geng
93.	Implement DataFrame.subtract		Resolved	Jiaan Geng
94.	Implement DataFrame.to		Resolved	Jiaan Geng
95.	pyspark_types_to_proto_types should supports StructType.		Resolved	Jiaan Geng
96.	Factor GroupedData out to group.py		Resolved	Hyukjin Kwon
97.	implement `DataFrame.rollup`		Resolved	Ruifeng Zheng
98.	Implement `GroupedData.pivot`		Resolved	Ruifeng Zheng
99.	pyspark_types_to_proto_types should supports MapType		Resolved	Jiaan Geng
100.	Implement the command logic for print and _repr_html_		Resolved	Hyukjin Kwon
101.	pyspark_types_to_proto_types should supports ArrayType		Resolved	Jiaan Geng
102.	Implement `GroupedData.{min, max, avg, sum}`		Resolved	Ruifeng Zheng
103.	Support multiple arguments in groupBy.max(...)		Resolved	Hyukjin Kwon
104.	Support multiple arguments in groupBy.avg(...)		Resolved	Hyukjin Kwon
105.	Support multiple arguments in groupBy.min(...)		Resolved	Hyukjin Kwon
106.	Support multiple arguments in groupBy.sum(...)		Resolved	Apache Spark
107.	Implement `DataFrame.freqItems ` and `DataFrame.stat.freqItems `		Resolved	Unassigned
108.	Implement `DataFrame.sampleBy ` and `DataFrame.stat.sampleBy `		Resolved	Ruifeng Zheng
109.	Support star in groupBy.agg()		Resolved	Ruifeng Zheng
110.	groupBy(...).agg(...).sort does not actually sort the output		Resolved	Martin Grund
111.	Make getitem support filter and select		Resolved	Ruifeng Zheng
112.	Implement `GroupedData.mean`		Resolved	Ruifeng Zheng
113.	DataFrame.join creating ambiguous column names		Resolved	Ruifeng Zheng
114.	Implement Dataframe.rdd getNumPartitions		Resolved	Unassigned
115.	Fix `isnan` function		Resolved	Ruifeng Zheng
116.	DataFrame windowspec functions : unresolved columns		Resolved	Ruifeng Zheng
117.	DataFrame.show(): 'Column' object is not callable		Resolved	Ruifeng Zheng
118.	Fix DataFrame.describe		Resolved	Jiaan Geng
119.	DataFrame.collect() output parity with pyspark		Resolved	Ruifeng Zheng
120.	DataFrame hint parameter can be str, float or int		Resolved	Sandeep Singh
121.	`DataFrame.collect` should handle None/NaN properly		Resolved	Ruifeng Zheng
122.	DataFrame.show formatting int as double		Resolved	Ruifeng Zheng
123.	Implement Dataframe.sort,sortWithinPartitions Ordering		Resolved	Ruifeng Zheng
124.	Fix DataFrame.sample parameters		Resolved	Sandeep Singh
125.	DataFrame.groupBy requires all cols be Column or str		Resolved	Ruifeng Zheng
126.	DataFrame.transform: Only Column or String can be used for projections		Resolved	Ruifeng Zheng
127.	Implement DataFrame.explain format to be similar to PySpark		Resolved	Jiaan Geng
128.	DataFrame dropDuplicates should throw error on non list argument		Resolved	Hyukjin Kwon
129.	Throw proper errors in Dataset.to()		Resolved	Jiaan Geng
130.	Window.rowsBetween should handle `float("-inf")` and `float("+inf")` as argument		Resolved	Sandeep Singh
131.	Make StructType support metadata and Implement `DataFrame.withMetadata`		Resolved	Ruifeng Zheng
132.	Enable the doctest for `DataFrame.hint`		Resolved	Ruifeng Zheng
133.	DataFrame.createDataFrame converting int to bigint		Resolved	Ruifeng Zheng
134.	Handle Function `rand() `		Resolved	Hyukjin Kwon
135.	Python: connect client lost column data with pyarrow.Table.to_pylist		Resolved	Jiaan Geng
136.	Add `DataFrame.writeTo` to the unsupported list		Resolved	Ruifeng Zheng
137.	Add the unsupported list for `GroupedData`		Resolved	Ruifeng Zheng
138.	Make `withMetadata` reuse the `withColumns` proto		Resolved	Ruifeng Zheng
139.	Function `slice` should handle string in params		Resolved	Hyukjin Kwon
140.	Fix Function `nth_value` functions output		Resolved	Unassigned
141.	`DataFrame.collect` should support nested types		Resolved	Apache Spark
142.	Function `sampleby` return parity		Resolved	Jiaan Geng
143.	`DataFrame.intersect` doctest output has different order		Resolved	Jiaan Geng
144.	Support DataFrame hint parameter to be list		Resolved	Ruifeng Zheng
145.	DataFrame.unionByName output is wrong		Resolved	Sandeep Singh
146.	Implement DataFrame `semanticHash`		Resolved	Unassigned
147.	Better type errors when passing wrong parameters		In Progress	Unassigned
148.	Implement DataFrame.observe		Resolved	Jiaan Geng
149.	Parity in Error types between pyspark and connect functions		Resolved	Sandeep Singh
150.	Implement DataFrame `sameSemantics`		Resolved	Unassigned
151.	Implement DataFrame `toLocalIterator`		Resolved	Takuya Ueshin
152.	createDataFrame supports column with map type.		Resolved	Unassigned
153.	Decouple plan transformation and validation on server side		Open	Unassigned
154.	DataFrame.join: ambiguous column		Resolved	Ruifeng Zheng
155.	DataFrame.createOrReplaceGlobalTempView - SparkConnectException: requirement failed		Resolved	Takuya Ueshin
156.	DataFrame.createDataFrame datatype conversion error		Resolved	Ruifeng Zheng
157.	DataFrame.show() fix map printing		Resolved	Ruifeng Zheng
158.	DataFrame mapfield,structlist invalid type		Resolved	Ruifeng Zheng
159.	Implement DataFrame `pandas_api`		Resolved	Sandeep Singh
160.	DataFrame `toPandas` parity in return types		Resolved	Hyukjin Kwon
161.	Support StreamingQueryListener for DataFrame.observe		Resolved	Jiaan Geng
162.	Parity in String representation of Column		Resolved	Hyukjin Kwon
163.	Parity in String representation of higher_order_function's output		Resolved	Ruifeng Zheng
164.	Different exception message in DataFrame.unpivot		Resolved	Takuya Ueshin
165.	Fix map_filter and map_zip_with output order		Resolved	Jiaan Geng
166.	Factor data conversion `arrow -> rows` out to `conversion.py`		Resolved	Ruifeng Zheng
167.	Make `from_arrow_schema` support nested types		Resolved	Ruifeng Zheng
168.	Different result in nested lambda function		Resolved	Ruifeng Zheng
169.	Failed to test ClientE2ETestSuite with maven		Resolved	Yang Jie
170.	DataFrame.createTempView - SparkConnectGrpcException: requirement failed		Resolved	Takuya Ueshin
171.	Support left_outer join		Resolved	Ruifeng Zheng
172.	Different exception in DataFrame.sample		Resolved	Takuya Ueshin
173.	DataFrame.drop should handle duplicated columns properly		Resolved	Ruifeng Zheng
174.	Make `DataFrame.select` support `a.*`		Resolved	Ruifeng Zheng
175.	Union avoid calling `output` before analysis		Resolved	Ruifeng Zheng
176.	Refactor the AnalyzePlan RPC and add `session.version`		Resolved	Ruifeng Zheng
177.	Implement DataFrame.registerTempTable		Resolved	Takuya Ueshin
178.	Fix toPandas to handle timezone and map types properly.		Resolved	Takuya Ueshin
179.	Implement cache, persist, unpersist, and storageLevel		Resolved	Takuya Ueshin
180.	make mapInPandas / mapInArrow support "is_barrier"		Resolved	Weichen Xu
181.	Fix the comparison the result with Arrow optimization enabled/disabled.		Resolved	Takuya Ueshin
182.	Fix createDataFrame from pandas with map type		Resolved	Takuya Ueshin
183.	Fix the error message of createDataFrame from np.array(0)		Resolved	Takuya Ueshin
184.	Fix test_createDataFrame_with_single_data_type.		Resolved	Takuya Ueshin
185.	Fix createDataFrame from pandas to respect session timezone.		Resolved	Takuya Ueshin
186.	Fix DataFrame.collect with null struct.		Resolved	Takuya Ueshin
187.	Implement eager evaluation.		Resolved	Takuya Ueshin
188.	Decouple handle command and send response on server side		Open	Unassigned
189.	Implement DataFrame.foreach		Resolved	Hyukjin Kwon
190.	Implement DataFrame.foreachPartition		Resolved	Hyukjin Kwon
191.	Investigate the behavior difference in self-join		Open	Unassigned

Activity

People

Assignee:: Ruifeng Zheng

Reporter:: Hyukjin Kwon

Shepherd:: Hyukjin Kwon

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 28/Nov/22 01:25

Updated:: 29/Aug/23 04:15