Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-16449

unionAll raises "Task not serializable"

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Duplicate
    • 1.6.1
    • None
    • PySpark
    • None
    • AWS EMR, Jupyter notebook

    Description

      Goal: Take the output from `describe` on a large DataFrame, then use a loop to calculate `skewness` and `kurtosis` from pyspark.sql.functions for each column, build them into a DataFrame of two rows, then use `unionAll` to merge them together.

      Issue: Despite having the same column names, in the same order with the same dtypes, the `unionAll` fails with "Task not serializable". However, if I build two test rows using dummy data then `unionAll` works fine. Also, if I collect my results then turn them straight back into DataFrames, `unionAll` succeeds.

      Step-by-step code and output with comments can be seen here: https://github.com/UrbanInstitute/pyspark-tutorials/blob/master/unionAll%20error.ipynb
      The issue appears to be in the way the loop in code block 6 is building the rows before parallelizing, but the results look no different from the test rows that do work. I reproduced this on multiple datasets, so downloading the notebook and pointing it to any data of your own should replicate it.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              jlevy Jeff Levy
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: