[HIVE-6872] Explore options of optimizing FileSinkOperator-->getDynOutPaths() - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Critical
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
None

Description

1. Download hive-testbench from https://github.com/cartershanklin/hive-testbench
2. Generate data using "./tpcds-setup.sh 10 /user/hive/external partitioned"
3. Most of the data population for tables with "partition + bucket + sorted data" will run a lot slower even with scale factor of 10 on 20 node cluster.

Bottleneck seems to be in FileSinkOperator-->getDynOutPaths() where it tries to close FSPath writers. Every call takes almost 150-200 ms.

set hive.enforce.bucketing=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions.pernode=4096;

With the above setting, one of the data loading (for web_sales table) took almost 4096 * 150 = 600 seconds just in closing the writers sequentially.

Purpose of this jira is to figure out options of optimizing FileSinkOperator-->getDynOutPaths() code path. This will be beneficial especially in ETL type of workloads.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HIVE-6782-v3.patch
15/Apr/14 13:09
13 kB
Rajesh Balamohan
HIVE-6782-v4.patch
15/Apr/14 13:20
13 kB
Rajesh Balamohan

Activity

People

Assignee:: Rajesh Balamohan

Reporter:: Rajesh Balamohan

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 09/Apr/14 10:56

Updated:: 15/Apr/14 23:39