Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-29040

Support pyspark.createDataFrame from a pyarrow.Table

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.1.0
    • None
    • PySpark, SQL
    • None

    Description

      PySpark createDataFrame currently supports creating a spark DataFrame from Pandas, using Arrow if enabled. This could be extended to accept a pyarrow.Table which has the added benefit of being able to efficiently use columns with nested struct types.

      It is possible to convert a pyarrow.Table with nested columns into a pandas.DataFrame, but the data becomes dictionaries, and is not a performant way to parallelize the data.

      Time/Date columns would need to be handled specially, since pyspark currently uses pandas to convert Arrow data of these types to the required Spark internal format.

      This follows from a mailing list discussion at http://apache-spark-user-list.1001560.n3.nabble.com/question-about-pyarrow-Table-to-pyspark-DataFrame-conversion-td36110.html

      Attachments

        Activity

          People

            Unassigned Unassigned
            bryanc Bryan Cutler
            Votes:
            5 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated: