Hive
  1. Hive
  2. HIVE-5834

Avoid reading ORC footers for files which will not be split in OrcInputFormat::getSplits()

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: tez-branch
    • Fix Version/s: tez-branch
    • Component/s: Tez
    • Labels:
    • Release Note:
      Avoid reading ORC footers for files where data and footer are in the same HDFS block

      Description

      OrcInputFormat::getSplits() fires off a SplitGenerator task for every file in the task.

      The footer & data are on the same node for all files with only 1 hdfs block. On top of that, it will never need a further split as long as its total size is < context.maxSize.

      Reading that footer locally is faster than reading it in the split gen and sending it from the AM.

        Activity

        Hide
        Gopal V added a comment -

        Tested with a count(1) with a filter

        For a table of 1500 x 70mb ORC files.

        Before = 26 seconds
        After = 18 seconds

        For a table of 23699 x ~2mb ORC files

        Before = 32.9 seconds
        After = 23.0 seconds

        Show
        Gopal V added a comment - Tested with a count(1) with a filter For a table of 1500 x 70mb ORC files. Before = 26 seconds After = 18 seconds For a table of 23699 x ~2mb ORC files Before = 32.9 seconds After = 23.0 seconds
        Hide
        Gunther Hagleitner added a comment -

        Nice find. LGTM.

        Show
        Gunther Hagleitner added a comment - Nice find. LGTM.
        Hide
        Gunther Hagleitner added a comment -

        Committed to branch. Thanks Gopal!

        Show
        Gunther Hagleitner added a comment - Committed to branch. Thanks Gopal!

          People

          • Assignee:
            Gopal V
            Reporter:
            Gopal V
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development