Details

    • Type: Sub-task
    • Status: In Progress
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 2.1.1
    • Fix Version/s: None
    • Component/s: PySpark, SQL
    • Labels:
      None

      Description

      The current code for creating a Spark DataFrame from a Pandas DataFrame uses `to_records` to convert the DataFrame to a list of records and then converts each record to a list. Following this, there are a number of calls to serialize and transfer this data to the JVM. This process is very inefficient and also discards all schema metadata, requiring another pass over the data to infer types.

      Using Apache Arrow, the Pandas DataFrame could be efficiently converted to Arrow data and directly transferred to the JVM to create the Spark DataFrame. The performance will be better and the Pandas schema will also be used so that the correct types will be used.

      Issues with the poor type inference have come up before, causing confusion and frustration with users because it is not clear why it fails or doesn't use the same type from Pandas. Fixing this with Apache Arrow will solve another pain point for Python users and the following JIRAs could be closed:

        Issue Links

          Activity

          Hide
          bryanc Bryan Cutler added a comment -

          I will work on this pending SPARK-13534 being merged.

          Show
          bryanc Bryan Cutler added a comment - I will work on this pending SPARK-13534 being merged.
          Hide
          apachespark Apache Spark added a comment -

          User 'BryanCutler' has created a pull request for this issue:
          https://github.com/apache/spark/pull/19459

          Show
          apachespark Apache Spark added a comment - User 'BryanCutler' has created a pull request for this issue: https://github.com/apache/spark/pull/19459

            People

            • Assignee:
              Unassigned
              Reporter:
              bryanc Bryan Cutler
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:

                Development