Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-20604

Minor compaction disables ORC column stats

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.0.0
    • 4.0.0-alpha-1
    • Transactions
    • None

    Description

        @Override
        public org.apache.hadoop.hive.ql.exec.FileSinkOperator.RecordWriter
              getRawRecordWriter(Path path, Options options) throws IOException {
          final Path filename = AcidUtils.createFilename(path, options);
          final OrcFile.WriterOptions opts =
              OrcFile.writerOptions(options.getTableProperties(), options.getConfiguration());
          if (!options.isWritingBase()) {
            opts.bufferSize(OrcRecordUpdater.DELTA_BUFFER_SIZE)
                .stripeSize(OrcRecordUpdater.DELTA_STRIPE_SIZE)
                .blockPadding(false)
                .compress(CompressionKind.NONE)
                .rowIndexStride(0)
            ;
          }
      

      rowIndexStride(0) makes StripeStatistics.getColumnStatistics() return objects but with meaningless values, like min/max for IntegerColumnStatistics set to MIN_LONG/MAX_LONG.

      This interferes with ability to infer min ROW_ID for a split but also creates inefficient files.

      Attachments

        1. HIVE-20604.01.patch
          2 kB
          Eugene Koifman

        Issue Links

          Activity

            People

              ekoifman Eugene Koifman
              ekoifman Eugene Koifman
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: