Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-40559

Add applyInArrow to pyspark.sql.GroupedData

    XMLWordPrintableJSON

Details

    Description

      PySpark allows to transform a DataFrame via Pandas and Arrow API:

      def map_arrow(iter: Iterator[pyarrow.RecordBatch]) -> Iterator[pyarrow.RecordBatch]:
          return iter
      
      def map_pandas(iter: Iterator[pandas.DataFrame]) -> Iterator[pandas.DataFrame]:
          return iter
      
      df.mapInArrow(map_arrow, schema="...")
      df.mapInPandas(map_pandas, schema="...")
      

      A grouped DataFrame currently supports only the Pandas API:

      def apply_pandas(df: pandas.DataFrame) -> pandas.DataFrame:
          return df
      
      df.groupBy("id").applyInPandas(apply_pandas, schema="...")
      

      A similar method for the Arrow API would useful, especially given that Arrow is used by many other libraries.

      An Arrow-based method allows to process the DataFrame with any Arrow-based API, e.g. Polars:

      def apply_polars(df: polars.DataFrame) -> polars.DataFrame:
        return df
      
      def apply_arrow(iter: Iterator[pyarrow.RecordBatch]) -> Iterator[pyarrow.RecordBatch]:
        for batch in iter:
          df = polars.from_arrow(pyarrow.Table.from_batches([batch]))
          for b in apply_polars(df).to_arrow().to_batches():
            yield b
      
      df.groupBy("id").applyInArrow(apply_arrow, schema="...")
      

      https://stackoverflow.com/questions/71606278/is-there-an-apache-arrow-equivalent-of-the-spark-pandas-udf
      https://stackoverflow.com/questions/73203318/how-to-transform-spark-dataframe-to-polars-dataframe

      Attachments

        Activity

          People

            enricomi Enrico Minack
            enricomi Enrico Minack
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: