Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-4486

FetchOperator slows down SMB map joins by 50% when there are many partitions

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 0.12.0
    • Fix Version/s: 0.11.1, 0.12.0
    • Component/s: Query Processor
    • Labels:
      None
    • Environment:

      Ubuntu LXC 12.10

    • Release Note:
      Avoid creating new HiveConf() within the FetchOperator loop

      Description

      While looking at log files for SMB joins in hive, it was noticed that the actual join op didn't show up as a significant fraction of the time spent. Most of the time was spent parsing configuration files.

      To confirm, I put log lines in the HiveConf constructor and eventually made the following edit to the code

      --- ql/src/java/org/apache/hadoop/hive/ql/exec/FetchOperator.java
      +++ ql/src/java/org/apache/hadoop/hive/ql/exec/FetchOperator.java
      @@ -648,8 +648,7 @@ public ObjectInspector getOutputObjectInspector() throws HiveException {
          * @return list of file status entries
          */
         private FileStatus[] listStatusUnderPath(FileSystem fs, Path p) throws IOException {
      -    HiveConf hiveConf = new HiveConf(job, FetchOperator.class);
      -    boolean recursive = hiveConf.getBoolVar(HiveConf.ConfVars.HADOOPMAPREDINPUTDIRRECURSIVE);
      +    boolean recursive = false;
           if (!recursive) {
             return fs.listStatus(p);
           }
      

      And re-ran my query to compare timings.

        Before After
      Cumulative CPU 731.07 sec 386.0 sec
      Total time 347.66 seconds 218.855 seconds

      The query used was

      INSERT OVERWRITE LOCAL DIRECTORY
      '/grid/0/smb/'
      select inv_item_sk
      from
           inventory inv
           join store_sales ss on (ss.ss_item_sk = inv.inv_item_sk)
      limit 100000
      ;
      

      On a scale=2 tpcds data-set, where both store_sales & inventory are bucketed into 4 buckets, with store_sales split into 7 partitions and inventory into 261 partitions.

      78% of all CPU time was spent within new HiveConf(). The yourkit profiler runs are attached.

        Attachments

        1. HIVE-4486.patch
          0.8 kB
          Gopal Vijayaraghavan
        2. smb-profile.html
          275 kB
          Gopal Vijayaraghavan

          Issue Links

            Activity

              People

              • Assignee:
                gopalv Gopal Vijayaraghavan
                Reporter:
                gopalv Gopal Vijayaraghavan
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: