Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-9452 Use HBase to store Hive metadata
  3. HIVE-11500

implement file footer / splits cache in HBase metastore

Log workAgile BoardRank to TopRank to BottomBulk Copy AttachmentsBulk Move AttachmentsAdd voteVotersWatch issueWatchersConvert to IssueMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • Metastore
    • None

    Description

      We need to cache file metadata (e.g. ORC file footers) for split generation (which, on FSes that support fileId, will be valid permanently and only needs to be removed lazily when ORC file is erased or compacted), and potentially even some information about splits (e.g. grouping based on location that would be good for some short time), in HBase metastore.
      It should be queryable by table. Partition predicate pushdown should be supported. If bucket pruning is added, that too. Given that we cannot cache file lists (we have to check FS for new/changed files anyway), and the difficulty of passing of data about partitions/etc. to split generation compared to paths, we will probably just filter by paths and fileIds. It might be different for splits

      In later phases, it would be nice to save the (first category above) results of expensive work done by jobs, e.g. data size after decompression/decoding per column, etc. to avoid surprises when ORC encoding is very good, or very bad. Perhaps it can even be lazily generated. Here's a pony: 🐴

      Attachments

        1. HBase metastore split cache.pdf
          119 kB
          Sergey Shelukhin

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            sershe Sergey Shelukhin Assign to me
            sershe Sergey Shelukhin

            Dates

              Created:
              Updated:

              Slack

                Issue deployment