Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-41279

Feature parity: DataFrame API in Spark Connect

    XMLWordPrintableJSON

Details

    • Umbrella
    • Status: Open
    • Critical
    • Resolution: Unresolved
    • 3.4.0
    • None
    • Connect
    • None

    Description

      Implement DataFrame API in Spark Connect.

      Attachments

        Issue Links

          1.
          Implement DataFrame.withColumn(s) Sub-task Resolved Rui Wang
          2.
          Support Collect() in Python client Sub-task Resolved Rui Wang
          3.
          Support Alias for every Relation Sub-task Resolved Rui Wang
          4.
          SELECT * shouldn't be empty project list in proto. Sub-task Resolved Rui Wang
          5.
          Refactor server side tests to only use DataFrame API Sub-task Resolved Rui Wang
          6.
          Initial DSL framework for protobuf testing Sub-task Resolved Rui Wang
          7.
          Implement `DataFrame.fillna ` and `DataFrame.na.fill ` Sub-task Resolved Ruifeng Zheng
          8.
          Python: rename LogicalPlan.collect to LogicalPlan.to_proto Sub-task Resolved Rui Wang
          9.
          Input relation can be optional for Project in Connect proto Sub-task Resolved Rui Wang
          10.
          [Python] Implement `DataFrame.sample` Sub-task Resolved Ruifeng Zheng
          11.
          Support Repartition in Connect DSL Sub-task Resolved Rui Wang
          12.
          Implement `DataFrame.approxQuantile` and `DataFrame.stat.approxQuantile` Sub-task Resolved Ruifeng Zheng
          13.
          Support CreateView in Connect DSL Sub-task Resolved Rui Wang
          14.
          Implement `DataFrame.SelectExpr` in Python client Sub-task Resolved Rui Wang
          15.
          Add Deduplicate to Connect proto Sub-task Resolved Rui Wang
          16.
          Implement `DataFrame.summary` Sub-task Resolved Ruifeng Zheng
          17.
          Show detailed differences in dataframe comparison Sub-task Resolved Ruifeng Zheng
          18.
          Add Intersect to Connect proto and DSL Sub-task Resolved Unassigned
          19.
          Reimplement df.stat.{cov, corr} with built-in sql functions Sub-task Resolved Ruifeng Zheng
          20.
          Reimplement `frequentItems` with dataframe operations Sub-task Resolved Ruifeng Zheng
          21.
          DataFrame `withColumnsRenamed` can be implemented through `RenameColumns` proto Sub-task Resolved Rui Wang
          22.
          Implement `DataFrame.stat.cov` Sub-task Resolved Ruifeng Zheng
          23.
          Add .agg() to Connect DSL Sub-task Resolved Rui Wang
          24.
          Add groupby to connect DSL and test more than one grouping expressions Sub-task Resolved Rui Wang
          25.
          Support toDF(columnNames) in Connect DSL Sub-task Resolved Rui Wang
          26.
          Compatible `take`, `head` and `first` API in Python client Sub-task Resolved Rui Wang
          27.
          Improve SET operation support in the proto and the server Sub-task Resolved Rui Wang
          28.
          Reimplement `crosstab` with dataframe operations Sub-task Resolved Ruifeng Zheng
          29.
          Implement DataFrame.CreateGlobalView in Python client Sub-task Resolved Rui Wang
          30.
          Implement `DataFrame.sparkSession` in Python client Sub-task Resolved Rui Wang
          31.
          Update relations.proto to follow Connect Proto development guidance Sub-task Resolved Rui Wang
          32.
          Throw exception for Collect() and recommend to use toPandas() Sub-task Resolved Rui Wang
          33.
          Complete Support for Except and Intersect in Python client Sub-task Resolved Rui Wang
          34.
          Implement `DataFrame.dropna ` and `DataFrame.na.drop ` Sub-task Resolved Ruifeng Zheng
          35.
          Add WHERE to Connect proto and DSL Sub-task Resolved Rui Wang
          36.
          Add as(alias: String) to connect DSL Sub-task Resolved Rui Wang
          37.
          Add a dedicated logical plan for `Summary` Sub-task Resolved Ruifeng Zheng
          38.
          `columns` API should use `schema` API to avoid data fetching Sub-task Resolved Rui Wang
          39.
          Support SelectExpr which apply Projection by expressions in Strings in Connect DSL Sub-task Resolved Rui Wang
          40.
          Implement `DataFrame.stat.corr` Sub-task Resolved Ruifeng Zheng
          41.
          Implement DataFrame cross join Sub-task Resolved Xinrong Meng
          42.
          Explain API can support different modes Sub-task Resolved Rui Wang
          43.
          Support Join UsingColumns in proto Sub-task Resolved Rui Wang
          44.
          Remove `str` from Aggregate expression type Sub-task Resolved Rui Wang
          45.
          Implement `DataFrame.sortWithinPartitions` Sub-task Resolved Ruifeng Zheng
          46.
          Implement `DataFrame.show` Sub-task Resolved Ruifeng Zheng
          47.
          Support List[Column] for Join's on argument. Sub-task Resolved Rui Wang
          48.
          Add limit and offset to Connect DSL Sub-task Resolved Rui Wang
          49.
          Add Sample to proto and DSL Sub-task Resolved Rui Wang
          50.
          Implement `DataFrame.__repr__` and `DataFrame.dtypes` Sub-task Resolved Ruifeng Zheng
          51.
          Implement `DataFrame.isEmpty` Sub-task Resolved Ruifeng Zheng
          52.
          Connect Proto should carry unparsed identifiers Sub-task Resolved Rui Wang
          53.
          Reimplement `summary` with dataframe operations Sub-task Resolved Ruifeng Zheng
          54.
          Implement `DataFrame.crosstab` and `DataFrame.stat.crosstab` Sub-task Resolved Ruifeng Zheng
          55.
          DataFrame.to_pandas should not return optional pandas dataframe Sub-task Resolved Rui Wang
          56.
          Improve `on` in Join in Python client Sub-task Resolved Rui Wang
          57.
          Add missing `limit(n)` in DataFrame.head Sub-task Resolved Ruifeng Zheng
          58.
          Complete Support for Union in Python client Sub-task Resolved Rui Wang
          59.
          Implement `DataFrame.drop` Sub-task Resolved Ruifeng Zheng
          60.
          Extend support for Join Relation Sub-task Resolved Rui Wang
          61.
          Dataframe.transform in Python client support Sub-task Resolved Martin Grund
          62.
          StructType should contain a list of StructField and each field should have a name Sub-task Resolved Rui Wang
          63.
          AnalyzeResult should use struct for schema Sub-task Resolved Rui Wang
          64.
          Change default serialization from 'broken' CSV to Spark DF JSON Sub-task Resolved Martin Grund
          65.
          Imports more from connect proto package to avoid calling `proto.` for Connect DSL Sub-task Resolved Rui Wang
          66.
          Support other data type conversion in the DataTypeProtoConverter Sub-task Resolved Unassigned
          67.
          Adopt `optional` keyword from proto3 which offers `hasXXX` to differentiate if a field is set or unset Sub-task Resolved Rui Wang
          68.
          Add ClientType to proto to indicate which client sends a request Sub-task Resolved Rui Wang
          69.
          Make AnalyzePlan support multiple analysis tasks Sub-task Resolved Ruifeng Zheng
          70.
          Removing unused code in connect Sub-task Resolved Deng Ziming
          71.
          `DataFrame.explain` should print and return None Sub-task Resolved Ruifeng Zheng
          72.
          Support string sql expressions in DF.where() Sub-task Resolved Martin Grund
          73.
          Add missing docs for DataFrame API Sub-task Resolved Rui Wang
          74.
          Improve `DataFrame.count()` Sub-task Resolved Rui Wang
          75.
          Implement DataFrame.toDF Sub-task Resolved Rui Wang
          76.
          Implement DataFrame.withColumnRenamed Sub-task Resolved Rui Wang
          77.
          Implement `DataFrame.replace ` and `DataFrame.na.replace ` Sub-task Resolved Ruifeng Zheng
          78.
          Add missing avg() to DF group Sub-task Resolved Martin Grund
          79.
          Bug in Deduplicate Python transformation Sub-task Resolved Martin Grund
          80.
          Improve Documentation for Take,Tail, Limit and Offset Sub-task Resolved Rui Wang
          81.
          Add orderBy and drop_duplicates Sub-task Resolved Ruifeng Zheng
          82.
          Make `Groupby.{min, max, sum, avg, mean}` compatible with PySpark Sub-task Resolved Ruifeng Zheng
          83.
          Implement `DataFrame.hint` Sub-task Resolved Deng Ziming
          84.
          Implement `DataFrame.repartitionByRange` Sub-task Resolved Deng Ziming
          85.
          DF.groupby.agg() API should be compatible Sub-task Resolved Martin Grund
          86.
          Support DataFrame TempView Sub-task Resolved Rui Wang
          87.
          Implement `DataFrame.cube` Sub-task Resolved Ruifeng Zheng
          88.
          Should use SQLExpression for str arguments in Projection Sub-task Resolved Unassigned
          89.
          Implement DataFrame.describe Sub-task Resolved Jiaan Geng
          90.
          Implement DataFrame. colRegex Sub-task Resolved Ruifeng Zheng
          91.
          Implement `DataFrame.melt` and `DataFrame.unpivot` Sub-task Resolved Ruifeng Zheng
          92.
          Implement DataFrame.randomSplit Sub-task Resolved Jiaan Geng
          93.
          Implement DataFrame.subtract Sub-task Resolved Jiaan Geng
          94.
          Implement DataFrame.to Sub-task Resolved Jiaan Geng
          95.
          pyspark_types_to_proto_types should supports StructType. Sub-task Resolved Jiaan Geng
          96.
          Factor GroupedData out to group.py Sub-task Resolved Hyukjin Kwon
          97.
          implement `DataFrame.rollup` Sub-task Resolved Ruifeng Zheng
          98.
          Implement `GroupedData.pivot` Sub-task Resolved Ruifeng Zheng
          99.
          pyspark_types_to_proto_types should supports MapType Sub-task Resolved Jiaan Geng
          100.
          Implement the command logic for print and _repr_html_ Sub-task Resolved Hyukjin Kwon
          101.
          pyspark_types_to_proto_types should supports ArrayType Sub-task Resolved Jiaan Geng
          102.
          Implement `GroupedData.{min, max, avg, sum}` Sub-task Resolved Ruifeng Zheng
          103.
          Support multiple arguments in groupBy.max(...) Sub-task Resolved Hyukjin Kwon
          104.
          Support multiple arguments in groupBy.avg(...) Sub-task Resolved Hyukjin Kwon
          105.
          Support multiple arguments in groupBy.min(...) Sub-task Resolved Hyukjin Kwon
          106.
          Support multiple arguments in groupBy.sum(...) Sub-task Resolved Apache Spark
          107.
          Implement `DataFrame.freqItems ` and `DataFrame.stat.freqItems ` Sub-task Resolved Unassigned
          108.
          Implement `DataFrame.sampleBy ` and `DataFrame.stat.sampleBy ` Sub-task Resolved Ruifeng Zheng
          109.
          Support star in groupBy.agg() Sub-task Resolved Ruifeng Zheng
          110.
          groupBy(...).agg(...).sort does not actually sort the output Sub-task Resolved Martin Grund
          111.
          Make getitem support filter and select Sub-task Resolved Ruifeng Zheng
          112.
          Implement `GroupedData.mean` Sub-task Resolved Ruifeng Zheng
          113.
          DataFrame.join creating ambiguous column names Sub-task Resolved Ruifeng Zheng
          114.
          Implement Dataframe.rdd getNumPartitions Sub-task Resolved Unassigned
          115.
          Fix `isnan` function Sub-task Resolved Ruifeng Zheng
          116.
          DataFrame windowspec functions : unresolved columns Sub-task Resolved Ruifeng Zheng
          117.
          DataFrame.show(): 'Column' object is not callable Sub-task Resolved Ruifeng Zheng
          118.
          Fix DataFrame.describe Sub-task Resolved Jiaan Geng
          119.
          DataFrame.collect() output parity with pyspark Sub-task Resolved Ruifeng Zheng
          120.
          DataFrame hint parameter can be str, float or int Sub-task Resolved Sandeep Singh
          121.
          `DataFrame.collect` should handle None/NaN properly Sub-task Resolved Ruifeng Zheng
          122.
          DataFrame.show formatting int as double Sub-task Resolved Ruifeng Zheng
          123.
          Implement Dataframe.sort,sortWithinPartitions Ordering Sub-task Resolved Ruifeng Zheng
          124.
          Fix DataFrame.sample parameters Sub-task Resolved Sandeep Singh
          125.
          DataFrame.groupBy requires all cols be Column or str Sub-task Resolved Ruifeng Zheng
          126.
          DataFrame.transform: Only Column or String can be used for projections Sub-task Resolved Ruifeng Zheng
          127.
          Implement DataFrame.explain format to be similar to PySpark Sub-task Resolved Jiaan Geng
          128.
          DataFrame dropDuplicates should throw error on non list argument Sub-task Resolved Hyukjin Kwon
          129.
          Throw proper errors in Dataset.to() Sub-task Resolved Jiaan Geng
          130.
          Window.rowsBetween should handle `float("-inf")` and `float("+inf")` as argument Sub-task Resolved Sandeep Singh
          131.
          Make StructType support metadata and Implement `DataFrame.withMetadata` Sub-task Resolved Ruifeng Zheng
          132.
          Enable the doctest for `DataFrame.hint` Sub-task Resolved Ruifeng Zheng
          133.
          DataFrame.createDataFrame converting int to bigint Sub-task Resolved Ruifeng Zheng
          134.
          Handle Function `rand() ` Sub-task Resolved Hyukjin Kwon
          135.
          Python: connect client lost column data with pyarrow.Table.to_pylist Sub-task Resolved Jiaan Geng
          136.
          Add `DataFrame.writeTo` to the unsupported list Sub-task Resolved Ruifeng Zheng
          137.
          Add the unsupported list for `GroupedData` Sub-task Resolved Ruifeng Zheng
          138.
          Make `withMetadata` reuse the `withColumns` proto Sub-task Resolved Ruifeng Zheng
          139.
          Function `slice` should handle string in params Sub-task Resolved Hyukjin Kwon
          140.
          Fix Function `nth_value` functions output Sub-task Resolved Unassigned
          141.
          `DataFrame.collect` should support nested types Sub-task Resolved Apache Spark
          142.
          Function `sampleby` return parity Sub-task Resolved Jiaan Geng
          143.
          `DataFrame.intersect` doctest output has different order Sub-task Resolved Jiaan Geng
          144.
          Support DataFrame hint parameter to be list Sub-task Resolved Ruifeng Zheng
          145.
          DataFrame.unionByName output is wrong Sub-task Resolved Sandeep Singh
          146.
          Implement DataFrame `semanticHash` Sub-task Resolved Unassigned
          147.
          Better type errors when passing wrong parameters Sub-task In Progress Unassigned
          148.
          Implement DataFrame.observe Sub-task Resolved Jiaan Geng
          149.
          Parity in Error types between pyspark and connect functions Sub-task Resolved Sandeep Singh
          150.
          Implement DataFrame `sameSemantics` Sub-task Resolved Unassigned
          151.
          Implement DataFrame `toLocalIterator` Sub-task Resolved Takuya Ueshin
          152.
          createDataFrame supports column with map type. Sub-task Resolved Unassigned
          153.
          Decouple plan transformation and validation on server side Sub-task Open Unassigned
          154.
          DataFrame.join: ambiguous column Sub-task Resolved Ruifeng Zheng
          155.
          DataFrame.createOrReplaceGlobalTempView - SparkConnectException: requirement failed Sub-task Resolved Takuya Ueshin
          156.
          DataFrame.createDataFrame datatype conversion error Sub-task Resolved Ruifeng Zheng
          157.
          DataFrame.show() fix map printing Sub-task Resolved Ruifeng Zheng
          158.
          DataFrame mapfield,structlist invalid type Sub-task Resolved Ruifeng Zheng
          159.
          Implement DataFrame `pandas_api` Sub-task Resolved Sandeep Singh
          160.
          DataFrame `toPandas` parity in return types Sub-task Resolved Hyukjin Kwon
          161.
          Support StreamingQueryListener for DataFrame.observe Sub-task Resolved Jiaan Geng
          162.
          Parity in String representation of Column Sub-task Resolved Hyukjin Kwon
          163.
          Parity in String representation of higher_order_function's output Sub-task Resolved Ruifeng Zheng
          164.
          Different exception message in DataFrame.unpivot Sub-task Resolved Takuya Ueshin
          165.
          Fix map_filter and map_zip_with output order Sub-task Resolved Jiaan Geng
          166.
          Factor data conversion `arrow -> rows` out to `conversion.py` Sub-task Resolved Ruifeng Zheng
          167.
          Make `from_arrow_schema` support nested types Sub-task Resolved Ruifeng Zheng
          168.
          Different result in nested lambda function Sub-task Resolved Ruifeng Zheng
          169.
          Failed to test ClientE2ETestSuite with maven Sub-task Resolved Yang Jie
          170.
          DataFrame.createTempView - SparkConnectGrpcException: requirement failed Sub-task Resolved Takuya Ueshin
          171.
          Support left_outer join Sub-task Resolved Ruifeng Zheng
          172.
          Different exception in DataFrame.sample Sub-task Resolved Takuya Ueshin
          173.
          DataFrame.drop should handle duplicated columns properly Sub-task Resolved Ruifeng Zheng
          174.
          Make `DataFrame.select` support `a.*` Sub-task Resolved Ruifeng Zheng
          175.
          Union avoid calling `output` before analysis Sub-task Resolved Ruifeng Zheng
          176.
          Refactor the AnalyzePlan RPC and add `session.version` Sub-task Resolved Ruifeng Zheng
          177.
          Implement DataFrame.registerTempTable Sub-task Resolved Takuya Ueshin
          178.
          Fix toPandas to handle timezone and map types properly. Sub-task Resolved Takuya Ueshin
          179.
          Implement cache, persist, unpersist, and storageLevel Sub-task Resolved Takuya Ueshin
          180.
          make mapInPandas / mapInArrow support "is_barrier" Sub-task Resolved Weichen Xu
          181.
          Fix the comparison the result with Arrow optimization enabled/disabled. Sub-task Resolved Takuya Ueshin
          182.
          Fix createDataFrame from pandas with map type Sub-task Resolved Takuya Ueshin
          183.
          Fix the error message of createDataFrame from np.array(0) Sub-task Resolved Takuya Ueshin
          184.
          Fix test_createDataFrame_with_single_data_type. Sub-task Resolved Takuya Ueshin
          185.
          Fix createDataFrame from pandas to respect session timezone. Sub-task Resolved Takuya Ueshin
          186.
          Fix DataFrame.collect with null struct. Sub-task Resolved Takuya Ueshin
          187.
          Implement eager evaluation. Sub-task Resolved Takuya Ueshin
          188.
          Decouple handle command and send response on server side Sub-task Open Unassigned
          189.
          Implement DataFrame.foreach Sub-task Resolved Hyukjin Kwon
          190.
          Implement DataFrame.foreachPartition Sub-task Resolved Hyukjin Kwon
          191.
          Investigate the behavior difference in self-join Sub-task Open Unassigned

          Activity

            People

              podongfeng Ruifeng Zheng
              gurwls223 Hyukjin Kwon
              Hyukjin Kwon Hyukjin Kwon
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated: