Uploaded image for project: 'Parquet'
  1. Parquet
  2. PARQUET-1935

[C++][Parquet] nullptr access violation when writing arrays of non-nullable values

    XMLWordPrintableJSON

Details

    Description

      I'm updating ParquetSharp to build against Arrow 2.0.0 (currently using Arrow 1.0.1). One of our unit test is now throwing a nullptr access violation.

      I have narrowed it down to writing arrays of non-nullable values (in this case the column contains int[]) . If the values are nullable, the test passes.

      The parquet file schema is as following:

      • GroupNode("schema", LogicalType.None, Repetition.Required)
        • GroupNode("array_of_ints_column", LogicalType.List, Repetition.Optional)
          • GroupNode("list", LogicalType.None, Repetition.Repeated)
            • PrimitiveNode("item", LogicalType.Int(32, signed), Repetition.Required)

      The test crashes when calling TypedColumnWriter::WriteBatchSpaced with the following arguments:

      • num_values = 1
      • def_levels = {0}
      • rep_levels = {0}
      • valid_bits = {0}
      • valid_bit_offset = 0
      • values = {} (i.e. nullptr)

      This call is effectively trying to write a null array, and therefore (to my understanding) does not need to pass any values. Yet further down the callstack, the implementation tries to read one value out of values (which is nullptr).

      I believe the problem lies with

        void MaybeCalculateValidityBits(
          const int16_t* def_levels,
          int64_t batch_size,
          int64_t* out_values_to_write,
          int64_t* out_spaced_values_to_write,
          int64_t* null_count) {
          if (bits_buffer_ == nullptr) {
            if (!level_info_.HasNullableValues()) {
              *out_values_to_write = batch_size;
              *out_spaced_values_to_write = batch_size;
              *null_count = 0;
            } else {
              for (int x = 0; x < batch_size; x++) {
                *out_values_to_write += def_levels[x] == level_info_.def_level ? 1 : 0;
                *out_spaced_values_to_write +=
                    def_levels[x] >= level_info_.repeated_ancestor_def_level ? 1 : 0;
              }
              *null_count = *out_values_to_write - *out_spaced_values_to_write;
            }
            return;
          }
      
          // ...
        }
      

      In particular, level_info_.HasNullableValues() returns false given that the arrays cannot contain null-values. My understanding is that this is wrong, since the arrays themselves are nullable.

      This code appears to have been introduced by ARROW-9603.

      Attachments

        Issue Links

          Activity

            People

              emkornfield@gmail.com Micah Kornfield
              GPSnoopy Tanguy Fautre
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h
                  1h