Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-21187 Complete support for remaining Spark data types in Arrow Converters
  3. SPARK-13534

Implement Apache Arrow serializer for Spark DataFrame for use in DataFrame.toPandas

Rank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersConvert to IssueLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.1.0
    • 2.3.0
    • PySpark
    • None

    Description

      The current code path for accessing Spark DataFrame data in Python using PySpark passes through an inefficient serialization-deserialiation process that I've examined at a high level here: https://gist.github.com/wesm/0cb5531b1c2e346a0007. Currently, RDD[Row] objects are being deserialized in pure Python as a list of tuples, which are then converted to pandas.DataFrame using its from_records alternate constructor. This also uses a large amount of memory.

      For flat (no nested types) schemas, the Apache Arrow memory layout (https://github.com/apache/arrow/tree/master/format) can be deserialized to pandas.DataFrame objects with comparatively small overhead compared with memcpy / system memory bandwidth – Arrow's bitmasks must be examined, replacing the corresponding null values with pandas's sentinel values (None or NaN as appropriate).

      I will be contributing patches to Arrow in the coming weeks for converting between Arrow and pandas in the general case, so if Spark can send Arrow memory to PySpark, we will hopefully be able to increase the Python data access throughput by an order of magnitude or more. I propose to add an new serializer for Spark DataFrame and a new method that can be invoked from PySpark to request a Arrow memory-layout byte stream, prefixed by a data header indicating array buffer offsets and sizes.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            bryanc Bryan Cutler
            wesm Wes McKinney
            Reynold Xin Reynold Xin
            Votes:
            6 Vote for this issue
            Watchers:
            72 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment