Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-12774

DataFrame.mapPartitions apply function operates on Pandas DataFrame instead of a generator or rows

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Won't Fix
    • None
    • None
    • PySpark

    Description

      Currently DataFrame.mapPatitions is analogous to DataFrame.rdd.mapPatitions in both Spark and pySpark. The function that is applied to each partition f must operate on a list generator. This is however very inefficient in Python. It would be more logical and efficient if the apply function f operated on Pandas DataFrames instead and also returned a DataFrame. This avoids unnecessary iteration in Python which is slow.

      Currently:

      def apply_function(rows):
          df = pd.DataFrame(list(rows))
          df = df % 100   # Do something on df
          return df.values.tolist()
      
      table = sqlContext.read.parquet("")
      table = table.mapPatitions(apply_function)
      

      New apply function would accept a Pandas DataFrame and return a DataFrame:

      def apply_function(df):
          df = df % 100   # Do something on df
          return df
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            joshlk Josh
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: