Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-20604

Minor compaction disables ORC column stats

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.0.0
    • Fix Version/s: 4.0.0
    • Component/s: Transactions
    • Labels:
      None
    • Target Version/s:

      Description

        @Override
        public org.apache.hadoop.hive.ql.exec.FileSinkOperator.RecordWriter
              getRawRecordWriter(Path path, Options options) throws IOException {
          final Path filename = AcidUtils.createFilename(path, options);
          final OrcFile.WriterOptions opts =
              OrcFile.writerOptions(options.getTableProperties(), options.getConfiguration());
          if (!options.isWritingBase()) {
            opts.bufferSize(OrcRecordUpdater.DELTA_BUFFER_SIZE)
                .stripeSize(OrcRecordUpdater.DELTA_STRIPE_SIZE)
                .blockPadding(false)
                .compress(CompressionKind.NONE)
                .rowIndexStride(0)
            ;
          }
      

      rowIndexStride(0) makes StripeStatistics.getColumnStatistics() return objects but with meaningless values, like min/max for IntegerColumnStatistics set to MIN_LONG/MAX_LONG.

      This interferes with ability to infer min ROW_ID for a split but also creates inefficient files.

        Attachments

        1. HIVE-20604.01.patch
          2 kB
          Eugene Koifman

          Issue Links

            Activity

              People

              • Assignee:
                ekoifman Eugene Koifman
                Reporter:
                ekoifman Eugene Koifman
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: