Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-3720

support ORC in spark sql

    XMLWordPrintableJSON

    Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Major
    • Resolution: Duplicate
    • Affects Version/s: 1.1.0
    • Fix Version/s: None
    • Component/s: SQL
    • Labels:
      None
    • Target Version/s:

      Description

      The Optimized Row Columnar (ORC) file format provides a highly efficient way to store data on hdfs.ORC file format has many advantages such as:

      1 a single file as the output of each task, which reduces the NameNode's load
      2 Hive type support including datetime, decimal, and the complex types (struct, list, map, and union)
      3 light-weight indexes stored within the file
      skip row groups that don't pass predicate filtering
      seek to a given row
      4 block-mode compression based on data type
      run-length encoding for integer columns
      dictionary encoding for string columns
      5 concurrent reads of the same file using separate RecordReaders
      6 ability to split files without scanning for markers
      7 bound the amount of memory needed for reading or writing
      8 metadata stored using Protocol Buffers, which allows addition and removal of fields

      Now spark sql support Parquet, support ORC provide people more opts.

        Attachments

        1. orc.diff
          41 kB
          Zhan Zhang

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                scwf Fei Wang
              • Votes:
                1 Vote for this issue
                Watchers:
                10 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: