Uploaded image for project: 'ORC'
  1. ORC
  2. ORC-744

LazyIO of non-filter columns

VotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments


    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.7.0
    • 1.7.0
    • Reader



      This feature request started as a result of a large search that is performed with the following characteristics:

      • The search fields are not part of partition, bucket or sort fields.
      • The table is a very large table.
      • The predicates result in very few rows compared to the scan size.
      • The search columns are a significant subset of selection columns in the query.

      Initial analysis showed that we could have a significant benefit by lazily reading the non-search columns only when we have a match. We explore the design and some benchmarks in subsequent sections.


      This builds further on ORC-577 which currently only restricts deserialization for some selected data types but does not improve on IO.

      On a high level the design includes the following components:

      • SArg to Filter: Converts Search Arguments passed down into filters for efficient application during scans.
      • Read: Performs the lazy read using the filters.
        • Read Filter Columns: Read the filter columns from the file.
        • Apply Filter: Apply the filter on the read filter columns.
        • Read Select Columns: If filter selects at least a row then read the remaining columns.


      This issue has the following tasks that provides further details on the design of the respective components:

      1. ORC-741: Bug fix related to schema evolution of missing columns in the presence of filters
      2. ORC-742: LazyIO of non-filter columns
      3. ORC-743: Conversion of SArg to Filter



      We evaluated this approach against a search job with the following stats:

      • Table
        • Size: ~420 TB
        • Data fields: ~120
        • Partition fields: 3
      • Scan
        • Search fields: 3 data fields with large (~ 1000 value) IN clauses compounded by OR.
        • Select fields: 16 data fields (includes the 3 search fields), 1 partition field
        • Search:
          • Size: ~180 TB
          • Records: 3.99 T
        • Selected:
          • Size: ~100 MB
          • Records: 1 M

      We have observed the following reductions compared with the absence of the patch:

      Test IO Reduction % CPU Reduction %
      Select 16 columns 45 47
      SELECT * 70 87
      • The savings are more significant as you increase the number of select columns with respect to the search columns
      • When the filter selects most data, no significant penalty observed as a result of 2 IO compared with a single IO
        • We do have a penalty as a result of the filter application on the selected records.


        Issue Links


          This comment will be Viewable by All Users Viewable by All Users


            planka Pavan Lanka
            planka Pavan Lanka
            0 Vote for this issue
            6 Start watching this issue




                Issue deployment