Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-6326

Split generation in ORC may generate wrong split boundaries because of unaccounted padded bytes

    XMLWordPrintableJSON

Details

    Description

      HIVE-5091 added padding to ORC files to avoid ORC stripes straddling HDFS blocks. The length of this padded bytes are not stored in stripe information. OrcInputFormat.getSplits() uses stripeInformation.getLength() for split computation. stripeInformation.getLength() is sum of index length, data length and stripe footer length. It does not account for the length of padded bytes which may result in wrong split boundary.

      The fix for this is to use the offset of next stripe as the length of current stripe which includes the padded bytes as well.

      Attachments

        1. HIVE-6326.1.patch
          13 kB
          Prasanth Jayachandran
        2. HIVE-6326.2.patch
          14 kB
          Prasanth Jayachandran
        3. HIVE-6326.3.patch
          0.9 kB
          Prasanth Jayachandran
        4. HIVE-6326.4.patch
          1 kB
          Prasanth Jayachandran

        Issue Links

          Activity

            People

              prasanth_j Prasanth Jayachandran
              prasanth_j Prasanth Jayachandran
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: