Uploaded image for project: 'Parquet'
  1. Parquet
  2. PARQUET-1935

[C++][Parquet] nullptr access violation when writing arrays of non-nullable values

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    Description

      I'm updating ParquetSharp to build against Arrow 2.0.0 (currently using Arrow 1.0.1). One of our unit test is now throwing a nullptr access violation.

      I have narrowed it down to writing arrays of non-nullable values (in this case the column contains int[]) . If the values are nullable, the test passes.

      The parquet file schema is as following:

      • GroupNode("schema", LogicalType.None, Repetition.Required)
        • GroupNode("array_of_ints_column", LogicalType.List, Repetition.Optional)
          • GroupNode("list", LogicalType.None, Repetition.Repeated)
            • PrimitiveNode("item", LogicalType.Int(32, signed), Repetition.Required)

      The test crashes when calling TypedColumnWriter::WriteBatchSpaced with the following arguments:

      • num_values = 1
      • def_levels = {0}
      • rep_levels = {0}
      • valid_bits = {0}
      • valid_bit_offset = 0
      • values = {} (i.e. nullptr)

      This call is effectively trying to write a null array, and therefore (to my understanding) does not need to pass any values. Yet further down the callstack, the implementation tries to read one value out of values (which is nullptr).

      I believe the problem lies with

        void MaybeCalculateValidityBits(
          const int16_t* def_levels,
          int64_t batch_size,
          int64_t* out_values_to_write,
          int64_t* out_spaced_values_to_write,
          int64_t* null_count) {
          if (bits_buffer_ == nullptr) {
            if (!level_info_.HasNullableValues()) {
              *out_values_to_write = batch_size;
              *out_spaced_values_to_write = batch_size;
              *null_count = 0;
            } else {
              for (int x = 0; x < batch_size; x++) {
                *out_values_to_write += def_levels[x] == level_info_.def_level ? 1 : 0;
                *out_spaced_values_to_write +=
                    def_levels[x] >= level_info_.repeated_ancestor_def_level ? 1 : 0;
              }
              *null_count = *out_values_to_write - *out_spaced_values_to_write;
            }
            return;
          }
      
          // ...
        }
      

      In particular, level_info_.HasNullableValues() returns false given that the arrays cannot contain null-values. My understanding is that this is wrong, since the arrays themselves are nullable.

      This code appears to have been introduced by ARROW-9603.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            emkornfield@gmail.com Micah Kornfield
            GPSnoopy Tanguy Fautre
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 1h
                1h

                Slack

                  Issue deployment