Description
Pig on Tez jobs with larger tables hit PIG-4443. Running on HDFS data which has even triple the number of splits(100K+ splits and tasks) does not hit that issue.
HCatBaseInputFormat.java: //Call getSplit on the InputFormat, create an //HCatSplit for each underlying split //NumSplits is 0 for our purposes org.apache.hadoop.mapred.InputSplit[] baseSplits = inputFormat.getSplits(jobConf, 0); for(org.apache.hadoop.mapred.InputSplit split : baseSplits) { splits.add(new HCatSplit( partitionInfo, split,allCols)); }
Each hcatSplit duplicates partition schema and table schema.
Attachments
Attachments
Issue Links
- is related to
-
HIVE-11344 HIVE-9845 makes HCatSplit.write modify the split so that PartInfo objects are unusable after it
- Closed
- relates to
-
PIG-4443 Write inputsplits in Tez to disk if the size is huge and option to compress pig input splits
- Closed