Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-6872

Explore options of optimizing FileSinkOperator-->getDynOutPaths()

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Critical
    • Resolution: Unresolved
    • None
    • None
    • None
    • None

    Description

      1. Download hive-testbench from https://github.com/cartershanklin/hive-testbench
      2. Generate data using "./tpcds-setup.sh 10 /user/hive/external partitioned"
      3. Most of the data population for tables with "partition + bucket + sorted data" will run a lot slower even with scale factor of 10 on 20 node cluster.

      Bottleneck seems to be in FileSinkOperator-->getDynOutPaths() where it tries to close FSPath writers. Every call takes almost 150-200 ms.

      set hive.enforce.bucketing=true;
      set hive.exec.dynamic.partition.mode=nonstrict;
      set hive.exec.max.dynamic.partitions.pernode=4096;

      With the above setting, one of the data loading (for web_sales table) took almost 4096 * 150 = 600 seconds just in closing the writers sequentially.

      Purpose of this jira is to figure out options of optimizing FileSinkOperator-->getDynOutPaths() code path. This will be beneficial especially in ETL type of workloads.

      Attachments

        1. HIVE-6782-v3.patch
          13 kB
          Rajesh Balamohan
        2. HIVE-6782-v4.patch
          13 kB
          Rajesh Balamohan

        Activity

          People

            rajesh.balamohan Rajesh Balamohan
            rajesh.balamohan Rajesh Balamohan
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: