Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-9845

HCatSplit repeats information making input split data size huge

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 1.2.0
    • HCatalog
    • None

    Description

      Pig on Tez jobs with larger tables hit PIG-4443. Running on HDFS data which has even triple the number of splits(100K+ splits and tasks) does not hit that issue.

      HCatBaseInputFormat.java:
       //Call getSplit on the InputFormat, create an
            //HCatSplit for each underlying split
            //NumSplits is 0 for our purposes
            org.apache.hadoop.mapred.InputSplit[] baseSplits = 
              inputFormat.getSplits(jobConf, 0);
      
            for(org.apache.hadoop.mapred.InputSplit split : baseSplits) {
              splits.add(new HCatSplit(
                  partitionInfo,
                  split,allCols));
            }
      

      Each hcatSplit duplicates partition schema and table schema.

      Attachments

        1. HIVE-9845.1.patch
          10 kB
          Mithun Radhakrishnan
        2. HIVE-9845.3.patch
          13 kB
          Mithun Radhakrishnan
        3. HIVE-9845.4.patch
          13 kB
          Mithun Radhakrishnan
        4. HIVE-9845.5.patch
          14 kB
          Mithun Radhakrishnan
        5. HIVE-9845.6.patch
          14 kB
          Sushanth Sowmyan

        Issue Links

          Activity

            People

              mithun Mithun Radhakrishnan
              rohini Rohini Palaniswamy
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: