Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-22731

MJ probe decode with row-level filtering

Log workAgile BoardRank to TopRank to BottomAdd voteVotersWatch issueWatchersCreate sub-taskMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Patch Available
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Hive, llap

      Description

      Currently, RecordReaders such as ORC support filtering at coarser-grained levels, namely: File, Stripe (64 to 256mb), and Row group (10k row) level. They only filter sets of rows if they can guarantee that none of the rows can pass a filter (usually given as searchable argument).

      However, a significant amount of time can be spend decoding rows with multiple columns that are not even used in the final result. See figure where original is what happens today and in LazyDecode we skip decoding rows that do not match the key.

      To enable a more fine-grained filtering in the particular case of a MapJoin we could utilize the key HashTable created from the smaller table to skip deserializing row columns at the larger table that do not match any key and thus save CPU time. 
      This Jira investigates this direction. 

        Attachments

        1. HIVE-22731.WIP.patch
          32 kB
          Panagiotis Garefalakis
        2. decode_time_bars.pdf
          13 kB
          Panagiotis Garefalakis

        Issue Links

          Activity

          $i18n.getText('security.level.explanation', $currentSelection) Viewable by All Users
          Cancel

            People

            • Assignee:
              pgaref Panagiotis Garefalakis Assign to me
              Reporter:
              pgaref Panagiotis Garefalakis

              Dates

              • Created:
                Updated:

              Time Tracking

              Estimated:
              Original Estimate - Not Specified
              Not Specified
              Remaining:
              Remaining Estimate - 0h
              0h
              Logged:
              Time Spent - 11h 50m
              11h 50m

                Issue deployment