Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-19489

Stable serialization format for external & native code integration

    Details

    • Type: New Feature
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 2.2.0
    • Fix Version/s: None
    • Component/s: Spark Core, SQL
    • Labels:
      None
    • Target Version/s:

      Description

      As a Spark user, I want access to a (semi) stable serialization format that is high performance so I can integrate Spark with my application written in native code (C, C++, Rust, etc).

        Issue Links

          Activity

          Hide
          wesmckinn Wes McKinney added a comment -

          I'm really glad to see this is becoming a priority in 2017.

          In the same way that Google internally standardized on "RecordIO" and "ColumnIO" record-oriented and column-oriented serialization formats (and protocol buffers for anything not fitting those models), it may make sense to support both orientation-styles in a binary protocol to support different kinds of native code plugins. Spark SQL is internally record-oriented (but maybe in-memory columnar someday?), but native code plugins may be column-oriented.

          I've been helping lead efforts in Apache Arrow to have a stable lightweight/zero-copy column-oriented binary format for Python, R, and Java applications – some of the results on integration with pandas and Parquet are encouraging:

          The initial Spark-Arrow work in SPARK-13534 is also promising, but having these kinds of fast IPC tools more deeply integrated into the Spark SQL execution engine (especially being able to collect results from task executors as serialized column batches) would unlock significantly higher performance.

          I'll be interested to learn more about the broader requirements of external serialization formats and the different types of use cases.

          Show
          wesmckinn Wes McKinney added a comment - I'm really glad to see this is becoming a priority in 2017. In the same way that Google internally standardized on "RecordIO" and "ColumnIO" record-oriented and column-oriented serialization formats (and protocol buffers for anything not fitting those models), it may make sense to support both orientation-styles in a binary protocol to support different kinds of native code plugins. Spark SQL is internally record-oriented (but maybe in-memory columnar someday?), but native code plugins may be column-oriented. I've been helping lead efforts in Apache Arrow to have a stable lightweight/zero-copy column-oriented binary format for Python, R, and Java applications – some of the results on integration with pandas and Parquet are encouraging: http://wesmckinney.com/blog/high-perf-arrow-to-pandas/ http://wesmckinney.com/blog/arrow-streaming-columnar/ http://wesmckinney.com/blog/python-parquet-update/ The initial Spark-Arrow work in SPARK-13534 is also promising, but having these kinds of fast IPC tools more deeply integrated into the Spark SQL execution engine (especially being able to collect results from task executors as serialized column batches) would unlock significantly higher performance. I'll be interested to learn more about the broader requirements of external serialization formats and the different types of use cases.
          Hide
          maropu Takeshi Yamamuro added a comment -

          Aha, I see.

          Show
          maropu Takeshi Yamamuro added a comment - Aha, I see.
          Hide
          rxin Reynold Xin added a comment -

          The ticket doesn't dictate either.

          Show
          rxin Reynold Xin added a comment - The ticket doesn't dictate either.
          Hide
          maropu Takeshi Yamamuro added a comment -

          This ticket means we need to make new formats for that purpose? Or, do we need to integrate Spark with existing efficient formats? If you mean the existing ones, do you have any candidate for now?

          Show
          maropu Takeshi Yamamuro added a comment - This ticket means we need to make new formats for that purpose? Or, do we need to integrate Spark with existing efficient formats? If you mean the existing ones, do you have any candidate for now?

            People

            • Assignee:
              Unassigned
              Reporter:
              rxin Reynold Xin
            • Votes:
              6 Vote for this issue
              Watchers:
              40 Start watching this issue

              Dates

              • Created:
                Updated:

                Development