I'm really glad to see this is becoming a priority in 2017.
In the same way that Google internally standardized on "RecordIO" and "ColumnIO" record-oriented and column-oriented serialization formats (and protocol buffers for anything not fitting those models), it may make sense to support both orientation-styles in a binary protocol to support different kinds of native code plugins. Spark SQL is internally record-oriented (but maybe in-memory columnar someday?), but native code plugins may be column-oriented.
I've been helping lead efforts in Apache Arrow to have a stable lightweight/zero-copy column-oriented binary format for Python, R, and Java applications – some of the results on integration with pandas and Parquet are encouraging:
The initial Spark-Arrow work in
SPARK-13534 is also promising, but having these kinds of fast IPC tools more deeply integrated into the Spark SQL execution engine (especially being able to collect results from task executors as serialized column batches) would unlock significantly higher performance.
I'll be interested to learn more about the broader requirements of external serialization formats and the different types of use cases.