[PIG-2573] Automagically setting parallelism based on input file size does not work with HCatalog - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.11
Component/s: None
Labels:
None

Description

~~PIG-2334~~ was helpful in understanding this issue. Short version is input file size is only computed if the path begins with a whitelisted prefix, currently:

/
hdfs:
file:
s3n:

As HCatalog locations use the form dbname.tablename the input file size is not computed, and the size-based parallelism optimization breaks.

DETAILS:

I discovered this issue comparing two runs on the same script, one loading regular HDFS paths, and one with HCatalog db.table names. I just happened to notice the "Setting number of reducers" line difference.

Loading HDFS files reducers is set to 99

2012-03-08 01:33:56,522 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - BytesPerReducer=1000000000 maxReducers=999 totalInputFileSize=98406674162
2012-03-08 01:33:56,522 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Neither PARALLEL nor default parallelism is set for this job. Setting number of reducers to 99

Loading with an HCatalog db.table name

2012-03-08 01:06:02,283 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - BytesPerReducer=1000000000 maxReducers=999 totalInputFileSize=0
2012-03-08 01:06:02,283 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Neither PARALLEL nor default parallelism is set for this job. Setting number of reducers to 1

Possible fix: Pig should just ask the loader for the size of its inputs rather than special-casing certain location types.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

PIG-2573_get_size_from_stats_if_possible_2.diff
12/Mar/12 23:24
20 kB
Travis Crawford
PIG-2573_get_size_from_stats_if_possible_3.diff
13/Mar/12 19:03
20 kB
Travis Crawford
PIG-2573_get_size_from_stats_if_possible_4.diff
13/Mar/12 23:17
22 kB
Travis Crawford
PIG-2573_get_size_from_stats_if_possible.diff
11/Mar/12 05:14
19 kB
Travis Crawford
PIG-2573_move_getinputbytes_to_loadfunc.diff
08/Mar/12 21:07
12 kB
Travis Crawford

Activity

People

Assignee:: Travis Crawford

Reporter:: Travis Crawford

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 08/Mar/12 01:54

Updated:: 22/Feb/13 04:53

Resolved:: 16/Mar/12 22:11