[HIVE-15546] Optimize Utilities.getInputPaths() so each listStatus of a partition is done in parallel - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.3.0
Component/s: Hive
Labels:
None

Description

When running on blobstores (like S3) where metadata operations (like listStatus) are costly, Utilities.getInputPaths() can add significant overhead when setting up the input paths for an MR / Spark / Tez job.

The method performs a listStatus on all input paths in order to check if the path is empty. If the path is empty, a dummy file is created for the given partition. This is all done sequentially. This can be really slow when there are a lot of empty partitions. Even when all partitions have input data, this can take a long time.

We should either:

(1) Just remove the logic to check if each input path is empty, and handle any edge cases accordingly.

(2) Multi-thread the listStatus calls

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HIVE-15546.6.patch
23/Jan/17 22:11
9 kB
Sahil Takiar
HIVE-15546.5.patch
18/Jan/17 23:11
9 kB
Sahil Takiar
HIVE-15546.4.patch
18/Jan/17 03:15
9 kB
Sahil Takiar
HIVE-15546.3.patch
06/Jan/17 23:33
6 kB
Sahil Takiar
HIVE-15546.2.patch
06/Jan/17 18:26
6 kB
Sahil Takiar
HIVE-15546.1.patch
05/Jan/17 22:37
0.7 kB
Sahil Takiar

Issue Links

breaks

HIVE-16949 Leak of threads from Get-Input-Paths and Get-Input-Summary thread pool

Closed

relates to

HIVE-21546 hiveserver2 - “mapred.FileInputFormat: Total input files to process” - why single threaded?

Open

links to

Activity

People

Assignee:: Sahil Takiar

Reporter:: Sahil Takiar

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 05/Jan/17 22:32

Updated:: 01/Apr/19 08:01

Resolved:: 24/Jan/17 22:48