[HIVE-9845] HCatSplit repeats information making input split data size huge - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.2.0
Component/s: HCatalog
Labels:
None

Description

Pig on Tez jobs with larger tables hit ~~PIG-4443~~. Running on HDFS data which has even triple the number of splits(100K+ splits and tasks) does not hit that issue.

HCatBaseInputFormat.java:
 //Call getSplit on the InputFormat, create an
      //HCatSplit for each underlying split
      //NumSplits is 0 for our purposes
      org.apache.hadoop.mapred.InputSplit[] baseSplits = 
        inputFormat.getSplits(jobConf, 0);

      for(org.apache.hadoop.mapred.InputSplit split : baseSplits) {
        splits.add(new HCatSplit(
            partitionInfo,
            split,allCols));
      }

Each hcatSplit duplicates partition schema and table schema.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HIVE-9845.6.patch
06/May/15 16:50
14 kB
Sushanth Sowmyan
HIVE-9845.5.patch
05/May/15 18:33
14 kB
Mithun Radhakrishnan
HIVE-9845.4.patch
10/Apr/15 18:25
13 kB
Mithun Radhakrishnan
HIVE-9845.3.patch
31/Mar/15 18:12
13 kB
Mithun Radhakrishnan
HIVE-9845.1.patch
20/Mar/15 21:36
10 kB
Mithun Radhakrishnan

Issue Links

is related to

HIVE-11344 HIVE-9845 makes HCatSplit.write modify the split so that PartInfo objects are unusable after it

Closed

relates to

PIG-4443 Write inputsplits in Tez to disk if the size is huge and option to compress pig input splits

Closed

Activity

People

Assignee:: Mithun Radhakrishnan

Reporter:: Rohini Palaniswamy

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 03/Mar/15 22:55

Updated:: 22/Jul/15 21:29

Resolved:: 06/May/15 21:05