Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-2088

Improve Hive Query split calculation performance over large partitions (potentially 5x speedup)

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • None
    • None
    • None
    • None

    Description

      While working on improving start up time for query spanning over large number of partitions I found a particular inefficiency in CombineHiveInputFormat.

      It seems that for each input partition we create a CombineFilter to make sure that files are combined within a particular partition.

      CombineHiveInputFormat:

      Path filterPath = path;
      if (!poolSet.contains(filterPath))

      { LOG.info("CombineHiveInputSplit creating pool for " + path + "; using filter path " + filterPath); combine.createPool(job, new CombineFilter(filterPath)); poolSet.add(filterPath); }

      These filters are passed then passed to CombineFileInputFormat along with all the input paths.
      CombineFileInputFormat computes a list of all the files in the input paths.
      It them loops over each filter and then checks whether a particular file belongs to a particular filter.

      ConbineFileInputFormat:

      for (MultiPathFilter onepool : pools) {
      ArrayList<Path> myPaths = new ArrayList<Path>();

      // pick one input path. If it matches all the filters in a pool,
      // add it to the output set
      for (int i = 0; i < paths.length; i++) {
      ...

      This results in computation of the order O(p*f) where p is the number of partitions and f is the total number of files in all partitions.
      For a case of 10,000 partitions with 10 files in each partition, this results in 1,000,000,000 iterations.

      We can replace this code with processing splits for one input path at a time without passing any filters at all.

      Do you happen to see a case where this approach will not work?

      Attachments

        Activity

          People

            vaggarw Vaibhav Aggarwal
            vaggarw Vaibhav Aggarwal
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: