Uploaded image for project: 'ORC'
  1. ORC
  2. ORC-744

LazyIO of non-filter columns

VotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.7.0
    • 1.7.0
    • Reader

    Description

      Background

      This feature request started as a result of a large search that is performed with the following characteristics:

      • The search fields are not part of partition, bucket or sort fields.
      • The table is a very large table.
      • The predicates result in very few rows compared to the scan size.
      • The search columns are a significant subset of selection columns in the query.

      Initial analysis showed that we could have a significant benefit by lazily reading the non-search columns only when we have a match. We explore the design and some benchmarks in subsequent sections.

      Design

      This builds further on ORC-577 which currently only restricts deserialization for some selected data types but does not improve on IO.

      On a high level the design includes the following components:

      • SArg to Filter: Converts Search Arguments passed down into filters for efficient application during scans.
      • Read: Performs the lazy read using the filters.
        • Read Filter Columns: Read the filter columns from the file.
        • Apply Filter: Apply the filter on the read filter columns.
        • Read Select Columns: If filter selects at least a row then read the remaining columns.

       

      This issue has the following tasks that provides further details on the design of the respective components:

      1. ORC-741: Bug fix related to schema evolution of missing columns in the presence of filters
      2. ORC-742: LazyIO of non-filter columns
      3. ORC-743: Conversion of SArg to Filter

       

      Tests

      We evaluated this approach against a search job with the following stats:

      • Table
        • Size: ~420 TB
        • Data fields: ~120
        • Partition fields: 3
      • Scan
        • Search fields: 3 data fields with large (~ 1000 value) IN clauses compounded by OR.
        • Select fields: 16 data fields (includes the 3 search fields), 1 partition field
        • Search:
          • Size: ~180 TB
          • Records: 3.99 T
        • Selected:
          • Size: ~100 MB
          • Records: 1 M

      We have observed the following reductions compared with the absence of the patch:

      Test IO Reduction % CPU Reduction %
      Select 16 columns 45 47
      SELECT * 70 87
      • The savings are more significant as you increase the number of select columns with respect to the search columns
      • When the filter selects most data, no significant penalty observed as a result of 2 IO compared with a single IO
        • We do have a penalty as a result of the filter application on the selected records.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            planka Pavan Lanka
            planka Pavan Lanka
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment