Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-4868

When reading an ORC file by an MR job, some Mappers may not be able to process data in some cases

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Duplicate
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Let's say a stripe of an ORC file is 256 MB and we set the split size for an MR job to 64 MB. Right now, splits are created based on byte ranges.
      Here is an example:

      |<-The start of a stripe                |<-The end of a stripe
      v                                       v
      |---------------------------------------|
         ^                        ^ 
         |<- The start of a split |<- The end of a split
      

      So, for some Mappers, it is possible that there is no start of a stripe within the byte range of a split. Those Mappers will process 0 record. We can improve how splits are created for ORC.

        Issue Links

          Activity

          Hide
          yhuai Yin Huai added a comment -

          Assign to me first. If anyone wants to work on it, feel free to take it.

          Show
          yhuai Yin Huai added a comment - Assign to me first. If anyone wants to work on it, feel free to take it.
          Hide
          yhuai Yin Huai added a comment -

          HIVE-5102 will address this issue.

          Show
          yhuai Yin Huai added a comment - HIVE-5102 will address this issue.

            People

            • Assignee:
              yhuai Yin Huai
              Reporter:
              yhuai Yin Huai
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development