Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-5129

[Rust][Parquet] Column writer bug: check dictionary encoder when adding a new data page

    XMLWordPrintableJSON

Details

    Description

      As part of my weekly routine, I glanced over code in Parquet column writer and found that the way we check when to add a new data page is buggy. The idea is checking the current encoder and deciding if we have written enough bytes for a page to construct. The problem is that we only check value encoder, regardless whether or not dictionary encoder is enabled.

      Here is how we do it now: actual check (https://github.com/apache/arrow/blob/master/rust/parquet/src/column/writer.rs#L378) and the buggy function (https://github.com/apache/arrow/blob/master/rust/parquet/src/column/writer.rs#L423).

      In the case of sparse column and dictionary encoder we would write a single data page, even though we would have accumulated a large enough number of bytes for more than one page in encoder (value encoder will be empty, so it will always less than constant limit).

      I forgot that parquet-cpp has `current_encoder` as either value encoder or dictionary encoder (https://github.com/apache/parquet-cpp/blob/master/src/parquet/column_writer.cc#L544), but in parquet-rs we have them separate.

      So the fix could be something like this:

      /// Returns true if there is enough data for a data page, false otherwise.
      #[inline]
      fn should_add_data_page(&self) -> bool {
        match self.dict_encoder {
          Some(ref encoder) => {
            encoder.estimated_data_encoded_size() >= self.props.data_pagesize_limit()
          },
          None => {
            self.encoder.estimated_data_encoded_size() >= self.props.data_pagesize_limit()
          }
        }
      }
      

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              sadikovi Ivan Sadikov
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 50m
                  50m