Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-4488

BucketizedHiveInputFormat is pessimistic with SMB split generation

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 0.12.0
    • None
    • Query Processor
    • None
    • Ubuntu LXC

    Description

      BucketizedHiveInputFormat generates fewer splits than possible when faced with a table structure where both tables are partitioned.

      When debugging query82 from the TPC-DS spec, there were 7 partitions in the lhs (store_sales) & 8 partitions in the rhs (inventory), with 1 bucket each.

      Only 7 splits are generated from the mapper, instead of a potential 56 mappers.

      13/05/01 07:08:22 INFO mapred.FileInputFormat: Total input paths to process : 1
      13/05/01 07:08:22 INFO io.BucketizedHiveInputFormat: 7 bucketized splits generated from 344 original splits.
      

      The loop that generates the splits is as follows

              InputSplit[] iss = inputFormat.getSplits(newjob, 0);
              if (iss != null && iss.length > 0) {
                numOrigSplits += iss.length;
                result.add(new BucketizedHiveInputSplit(iss, inputFormatClass
                    .getName()));
              }
      

      As is clear from above, even though the more granular (per-file/per-partition) splits coming off the getSplits() is being added to a single bucket split.

      Logically, in our mapper we get

      store_sales(2003)/000000_1)
      join MergeQueue(
        inv(1998-01-01)/000000_0
        inv(1998-01-08)/000000_0
        inv(1998-01-15)/000000_0
        inv(1998-01-22)/000000_0
        inv(1998-01-29)/000000_0
        inv(1998-02-05)/000000_0
        inv(1998-02-12)/000000_0
        inv(1998-02-19)/000000_0
        inv(1998-02-26)/000000_0
        )
      

      Where ideally, we could've used a CombineFileInputFormat to get node locality for the merge queue inputs (viz BucketizedHiveInputSplit).

      This would be far better in generating splits & in getting more out of short-circuit reads.

      Attachments

        Activity

          People

            Unassigned Unassigned
            gopalv Gopal Vijayaraghavan
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: