Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-7052

Optimize split calculation time

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 0.14.0
    • None
    • hive + tez

    • Reviewed

    Description

      When running a TPC-DS query (query_27), significant amount of time was spent in split computation on a dataset of size 200 GB (ORC format).

      Profiling revealed that,
      1. Lot of time was spent in Config's subtitutevar (regex) in HiveInputFormat.getSplits() method.
      2. FileSystem was created repeatedly in OrcInputFormat.generateSplitsInfo().

      I will attach the profiler snapshots soon.

      Attachments

        1. HIVE-7052-profiler-1.png
          221 kB
          Rajesh Balamohan
        2. HIVE-7052-profiler-2.png
          142 kB
          Rajesh Balamohan
        3. HIVE-7052-v3.patch
          5 kB
          Rajesh Balamohan
        4. HIVE-7052-v7.patch
          5 kB
          Rajesh Balamohan
        5. HIVE-7052.7.patch
          5 kB
          Prasanth Jayachandran

        Issue Links

          Activity

            People

              rajesh.balamohan Rajesh Balamohan
              rajesh.balamohan Rajesh Balamohan
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: