[SPARK-40559] Add applyInArrow to pyspark.sql.GroupedData - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.4.0
Fix Version/s: 4.0.0
Component/s: PySpark
Labels:
- pull-request-available

Description

PySpark allows to transform a DataFrame via Pandas and Arrow API:

def map_arrow(iter: Iterator[pyarrow.RecordBatch]) -> Iterator[pyarrow.RecordBatch]:
    return iter

def map_pandas(iter: Iterator[pandas.DataFrame]) -> Iterator[pandas.DataFrame]:
    return iter

df.mapInArrow(map_arrow, schema="...")
df.mapInPandas(map_pandas, schema="...")

A grouped DataFrame currently supports only the Pandas API:

def apply_pandas(df: pandas.DataFrame) -> pandas.DataFrame:
    return df

df.groupBy("id").applyInPandas(apply_pandas, schema="...")

A similar method for the Arrow API would useful, especially given that Arrow is used by many other libraries.

An Arrow-based method allows to process the DataFrame with any Arrow-based API, e.g. Polars:

def apply_polars(df: polars.DataFrame) -> polars.DataFrame:
  return df

def apply_arrow(iter: Iterator[pyarrow.RecordBatch]) -> Iterator[pyarrow.RecordBatch]:
  for batch in iter:
    df = polars.from_arrow(pyarrow.Table.from_batches([batch]))
    for b in apply_polars(df).to_arrow().to_batches():
      yield b

df.groupBy("id").applyInArrow(apply_arrow, schema="...")