Details
-
New Feature
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
3.4.0
Description
PySpark allows to transform a DataFrame via Pandas and Arrow API:
def map_arrow(iter: Iterator[pyarrow.RecordBatch]) -> Iterator[pyarrow.RecordBatch]: return iter def map_pandas(iter: Iterator[pandas.DataFrame]) -> Iterator[pandas.DataFrame]: return iter df.mapInArrow(map_arrow, schema="...") df.mapInPandas(map_pandas, schema="...")
A grouped DataFrame currently supports only the Pandas API:
def apply_pandas(df: pandas.DataFrame) -> pandas.DataFrame: return df df.groupBy("id").applyInPandas(apply_pandas, schema="...")
A similar method for the Arrow API would useful, especially given that Arrow is used by many other libraries.
An Arrow-based method allows to process the DataFrame with any Arrow-based API, e.g. Polars:
def apply_polars(df: polars.DataFrame) -> polars.DataFrame: return df def apply_arrow(iter: Iterator[pyarrow.RecordBatch]) -> Iterator[pyarrow.RecordBatch]: for batch in iter: df = polars.from_arrow(pyarrow.Table.from_batches([batch])) for b in apply_polars(df).to_arrow().to_batches(): yield b df.groupBy("id").applyInArrow(apply_arrow, schema="...")
https://stackoverflow.com/questions/71606278/is-there-an-apache-arrow-equivalent-of-the-spark-pandas-udf
https://stackoverflow.com/questions/73203318/how-to-transform-spark-dataframe-to-polars-dataframe