Like Parquet, this issue aims to depend on the latest Apache ORC 1.4 for Apache Spark 2.3. There are key benefits for now.
- Stability: Apache ORC 1.4.0 has many fixes and we can depend on ORC community more.
- Maintainability: Reduce the Hive dependency and can remove old legacy code later.
Later, we can get the following two key benefits by adding new ORCFileFormat in
- Usability: User can use ORC data sources without hive module, i.e, -Phive.
- Speed: Use both Spark ColumnarBatch and ORC RowBatch together. This is faster than the current implementation in Spark.
SPARK-20901 Feature parity for ORC with Parquet
SPARK-20682 Add new ORCFileFormat based on Apache ORC
SPARK-20728 Make ORCFileFormat configurable between sql/hive and sql/core
- links to