Uploaded image for project: 'ORC'
  1. ORC
  2. ORC-577

Allow row-level filtering

    XMLWordPrintableJSON

    Details

    • Type: New Feature
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.7.0
    • Fix Version/s: 1.7.0
    • Component/s: None
    • Labels:

      Description

      Currently, ORC filters at three levels:

      • File level
      • Stripe (64 to 256mb) level
      • Row group (10k row) level

      The filters are specified as Sargs (Search Arguments), which have a relatively small vocabulary. Furthermore, they only filter sets of rows if they can guarantee that none of the rows can pass the filter.

      There are some use cases where the user needs to read a subset of the columns and apply more detailed row level filters. I'd suggest that we add a new method in Reader.Options

      setRowFilter(String[] filterColumnNames, Consumer<VectorizedRowBatch> filterCallback))

      Where the columns named in columnNames are read expanded first, then the filter is run and the rest of the data is read only if the predicate returns true.

        Attachments

        1. RowFilterBenchTimestamp.out
          0.9 kB
          Panagiotis Garefalakis
        2. RowFilterBenchString.out
          0.8 kB
          Panagiotis Garefalakis
        3. RowFilterBenchDouble.out
          0.9 kB
          Panagiotis Garefalakis
        4. RowFilterBenchDecimal.out
          2 kB
          Panagiotis Garefalakis
        5. RowFilterBenchBoolean.out
          0.9 kB
          Panagiotis Garefalakis

          Issue Links

            Activity

              People

              • Assignee:
                pgaref Panagiotis Garefalakis
                Reporter:
                omalley Owen O'Malley
              • Votes:
                0 Vote for this issue
                Watchers:
                13 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 20m
                  20m