Description
The Optimized Row Columnar (ORC) file format provides a highly efficient way to store data on hdfs.ORC file format has many advantages such as:
1 a single file as the output of each task, which reduces the NameNode's load
2 Hive type support including datetime, decimal, and the complex types (struct, list, map, and union)
3 light-weight indexes stored within the file
skip row groups that don't pass predicate filtering
seek to a given row
4 block-mode compression based on data type
run-length encoding for integer columns
dictionary encoding for string columns
5 concurrent reads of the same file using separate RecordReaders
6 ability to split files without scanning for markers
7 bound the amount of memory needed for reading or writing
8 metadata stored using Protocol Buffers, which allows addition and removal of fields
Now spark sql support Parquet, support ORC provide people more opts.
Attachments
Attachments
Issue Links
- duplicates
-
SPARK-2883 Spark Support for ORCFile format
- Resolved
- links to