Uploaded image for project: 'Parquet'
  1. Parquet
  2. PARQUET-1935

[C++][Parquet] nullptr access violation when writing arrays of non-nullable values

    XMLWordPrintableJSON

    Details

      Description

      I'm updating ParquetSharp to build against Arrow 2.0.0 (currently using Arrow 1.0.1). One of our unit test is now throwing a nullptr access violation.

      I have narrowed it down to writing arrays of non-nullable values (in this case the column contains int[]) . If the values are nullable, the test passes.

      The parquet file schema is as following:

      • GroupNode("schema", LogicalType.None, Repetition.Required)
        • GroupNode("array_of_ints_column", LogicalType.List, Repetition.Optional)
          • GroupNode("list", LogicalType.None, Repetition.Repeated)
            • PrimitiveNode("item", LogicalType.Int(32, signed), Repetition.Required)

      The test crashes when calling TypedColumnWriter::WriteBatchSpaced with the following arguments:

      • num_values = 1
      • def_levels = {0}
      • rep_levels = {0}
      • valid_bits = {0}
      • valid_bit_offset = 0
      • values = {} (i.e. nullptr)

      This call is effectively trying to write a null array, and therefore (to my understanding) does not need to pass any values. Yet further down the callstack, the implementation tries to read one value out of values (which is nullptr).

      I believe the problem lies with

        void MaybeCalculateValidityBits(
          const int16_t* def_levels,
          int64_t batch_size,
          int64_t* out_values_to_write,
          int64_t* out_spaced_values_to_write,
          int64_t* null_count) {
          if (bits_buffer_ == nullptr) {
            if (!level_info_.HasNullableValues()) {
              *out_values_to_write = batch_size;
              *out_spaced_values_to_write = batch_size;
              *null_count = 0;
            } else {
              for (int x = 0; x < batch_size; x++) {
                *out_values_to_write += def_levels[x] == level_info_.def_level ? 1 : 0;
                *out_spaced_values_to_write +=
                    def_levels[x] >= level_info_.repeated_ancestor_def_level ? 1 : 0;
              }
              *null_count = *out_values_to_write - *out_spaced_values_to_write;
            }
            return;
          }
      
          // ...
        }
      

      In particular, level_info_.HasNullableValues() returns false given that the arrays cannot contain null-values. My understanding is that this is wrong, since the arrays themselves are nullable.

      This code appears to have been introduced by ARROW-9603.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                emkornfield@gmail.com Micah Kornfield
                Reporter:
                GPSnoopy Tanguy Fautre
              • Votes:
                0 Vote for this issue
                Watchers:
                6 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h
                  1h