Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-3720

support ORC in spark sql

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 1.1.0
    • None
    • SQL
    • None

    Description

      The Optimized Row Columnar (ORC) file format provides a highly efficient way to store data on hdfs.ORC file format has many advantages such as:

      1 a single file as the output of each task, which reduces the NameNode's load
      2 Hive type support including datetime, decimal, and the complex types (struct, list, map, and union)
      3 light-weight indexes stored within the file
      skip row groups that don't pass predicate filtering
      seek to a given row
      4 block-mode compression based on data type
      run-length encoding for integer columns
      dictionary encoding for string columns
      5 concurrent reads of the same file using separate RecordReaders
      6 ability to split files without scanning for markers
      7 bound the amount of memory needed for reading or writing
      8 metadata stored using Protocol Buffers, which allows addition and removal of fields

      Now spark sql support Parquet, support ORC provide people more opts.

      Attachments

        1. orc.diff
          41 kB
          Zhan Zhang

        Issue Links

          Activity

            People

              Unassigned Unassigned
              scwf Fei Wang
              Votes:
              1 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: