Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-288

Implement Arrow adapter for Spark Datasets

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: C++, Java - Vectors
    • Labels:
      None

      Description

      It would be valuable for applications that use Arrow to be able to

      • Convert between Spark DataFrames/Datasets and Java Arrow vectors
      • Send / Receive Arrow record batches / Arrow file format RPCs to / from Spark
      • Allow PySpark to use Arrow for messaging in UDF evaluation

        Issue Links

          Activity

          Hide
          freiss Frederick Reiss added a comment -

          Wes McKinney, are you planning to do this yourself? Would you like some help with this task?

          Show
          freiss Frederick Reiss added a comment - Wes McKinney , are you planning to do this yourself? Would you like some help with this task?
          Hide
          julienledem Julien Le Dem added a comment -

          Frederick Reiss Hi Frederick, I'll let Wes confirm but I believe that this JIRA is not immediately planned. Help is always welcome. We're happy to assist with reviews and answering questions.

          Show
          julienledem Julien Le Dem added a comment - Frederick Reiss Hi Frederick, I'll let Wes confirm but I believe that this JIRA is not immediately planned. Help is always welcome. We're happy to assist with reviews and answering questions.
          Hide
          wesmckinn Wes McKinney added a comment -

          Yes, we'd happily accept help with this. There is lots of work to do both in Arrow (e.g. integration testing / conforming the Java and C++ implementations) and Spark.

          Show
          wesmckinn Wes McKinney added a comment - Yes, we'd happily accept help with this. There is lots of work to do both in Arrow (e.g. integration testing / conforming the Java and C++ implementations) and Spark.
          Hide
          jlaskowski Jacek Laskowski added a comment -

          I've scheduled a Spark/Scala meetup next week and found the issue that we could help with somehow. We've got no experience with Arrow but quite fine with Spark SQL's Datasets.

          Could you Wes McKinney or Julien Le Dem describe the very small steps needed for the task? They could also just be a subtasks of the "umbrella" task. Thanks.

          Show
          jlaskowski Jacek Laskowski added a comment - I've scheduled a Spark/Scala meetup next week and found the issue that we could help with somehow. We've got no experience with Arrow but quite fine with Spark SQL's Datasets. Could you Wes McKinney or Julien Le Dem describe the very small steps needed for the task? They could also just be a subtasks of the "umbrella" task. Thanks.
          Hide
          freiss Frederick Reiss added a comment - - edited

          Apologies for my delay in replying here; it's been a very hectic week.

          Along the lines of what Jacek Laskowski says above, I think it would be good to break this overall task into smaller, bite-size chunks.

          One top-level question that we'll need to answer before we can break things down properly: Should we use Arrow's Java APIs or Arrow's C++ APIs to perform the conversion?

          If we use the Java APIs to convert the data, then the "collect Dataset to Arrow" will go roughly like this:

          1. Determine that the Spark Dataset can indeed be expressed in Arrow format.
          2. Obtain low-level access to the internal columnar representation of the Dataset.
          3. Convert Spark's columnar representation to Arrow using the Arrow Java APIs.
          4. Ship the Arrow buffer over the Py4j socket to the Python process as an array of bytes.
          5. Cast the array of bytes to a Python Arrow array.

          All these steps will be contingent on Spark accepting a dependency on Arrow's Java API. This last point might be a bit tricky, given that the API doesn't have any users right now. At the least, we would need to break out some testing/documentation activities to create greater confidence in the robustness of the Java APIs.

          If we use Arrow's C++ API to do the conversion, the flow would go as follows:

          1. Determine that the Spark Dataset can be expressed in Arrow format
          2. Obtain low-level access to the internal columnar representation of the Dataset
          3. Ship chunks of column values over the Py4j socket to the Python process as arrays of primitive types
          4. Insert the column values into an Arrow buffer on the Python side, using C++ APIs
            Note that the last step here could potentially be implemented against Pandas dataframes instead of Arrow as a short-term expedient.

          A third possibility is to use Parquet as an intermediate format:

          1. Determine that the Spark Dataset can be expressed in Arrow format.
          2. Write the Dataset to a Parquet file in a location that the Python process can access.
          3. Read the Parquet file back into an Arrow buffer in the Python process using C++ APIs.

          This approach would involve a lot less code, but it would of course require creating and deleting temporary files.

          Show
          freiss Frederick Reiss added a comment - - edited Apologies for my delay in replying here; it's been a very hectic week. Along the lines of what Jacek Laskowski says above, I think it would be good to break this overall task into smaller, bite-size chunks. One top-level question that we'll need to answer before we can break things down properly: Should we use Arrow's Java APIs or Arrow's C++ APIs to perform the conversion? If we use the Java APIs to convert the data, then the "collect Dataset to Arrow" will go roughly like this: Determine that the Spark Dataset can indeed be expressed in Arrow format. Obtain low-level access to the internal columnar representation of the Dataset. Convert Spark's columnar representation to Arrow using the Arrow Java APIs. Ship the Arrow buffer over the Py4j socket to the Python process as an array of bytes. Cast the array of bytes to a Python Arrow array. All these steps will be contingent on Spark accepting a dependency on Arrow's Java API. This last point might be a bit tricky, given that the API doesn't have any users right now. At the least, we would need to break out some testing/documentation activities to create greater confidence in the robustness of the Java APIs. If we use Arrow's C++ API to do the conversion, the flow would go as follows: Determine that the Spark Dataset can be expressed in Arrow format Obtain low-level access to the internal columnar representation of the Dataset Ship chunks of column values over the Py4j socket to the Python process as arrays of primitive types Insert the column values into an Arrow buffer on the Python side, using C++ APIs Note that the last step here could potentially be implemented against Pandas dataframes instead of Arrow as a short-term expedient. A third possibility is to use Parquet as an intermediate format: Determine that the Spark Dataset can be expressed in Arrow format. Write the Dataset to a Parquet file in a location that the Python process can access. Read the Parquet file back into an Arrow buffer in the Python process using C++ APIs. This approach would involve a lot less code, but it would of course require creating and deleting temporary files.
          Hide
          wesmckinn Wes McKinney added a comment -

          hi Frederick Reiss and Jacek Laskowski – we made pretty big progress on the C++ side to be able to be closer to full interoperability with the Arrow Java library. We still need to do some integration testing, but it would be great to start exploring the technical plan for making this happen. I was just talking with Reynold Xin about this the other day, so there may be someone on the Spark side who could help with this effort, too.

          The first step is to convert a Spark Dataset into 1 or more Arrow record batches, including metadata conversion, and then converting back. The Java <> C++ data movement itself is a comparatively minor task because that is just sending a serialized byte buffer through the existing protocol. We can test this out in Python using the Arrow <> pandas bridge which has already been completed.

          Let me know if anyone will have the bandwidth to work on this and we can coordinate. thanks!

          Show
          wesmckinn Wes McKinney added a comment - hi Frederick Reiss and Jacek Laskowski – we made pretty big progress on the C++ side to be able to be closer to full interoperability with the Arrow Java library. We still need to do some integration testing, but it would be great to start exploring the technical plan for making this happen. I was just talking with Reynold Xin about this the other day, so there may be someone on the Spark side who could help with this effort, too. The first step is to convert a Spark Dataset into 1 or more Arrow record batches, including metadata conversion, and then converting back. The Java < > C++ data movement itself is a comparatively minor task because that is just sending a serialized byte buffer through the existing protocol. We can test this out in Python using the Arrow < > pandas bridge which has already been completed. Let me know if anyone will have the bandwidth to work on this and we can coordinate. thanks!
          Hide
          freiss Frederick Reiss added a comment -

          Bryan Cutler, holdenk, and Xusen Yin are looking into this item on our end. We should have an update around the end of this week.

          Show
          freiss Frederick Reiss added a comment - Bryan Cutler , holdenk , and Xusen Yin are looking into this item on our end. We should have an update around the end of this week.

            People

            • Assignee:
              Unassigned
              Reporter:
              wesmckinn Wes McKinney
            • Votes:
              3 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

              • Created:
                Updated:

                Development