Uploaded image for project: 'Parquet'
  1. Parquet
  2. PARQUET-1936

WriteBatchSpaced writes incorrect value for parquet when input contains NULL list

Attach filesAttach ScreenshotAdd voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • parquet-cpp
    • None

    Description

      When trying to write a column of parquet lists, if there is a NULL list, WriteBatchSpaced will either throw an error (case 1 below) or incorrectly write the last value in the last list as the first value from the first list (case 2 below).

      CASE 1
      Data (3 lists):
      [
         "one"
      ]
      null
      [
         "two"
      ]
       
      Parameters to TypedColumnWriter<PhysicalType<parquet::Type::BYTE_ARRAY>>::WriteBatchSpaced:

      1. num_values: 3
      2. def_levels: [3, 0, 3]
      3. rep_levels: [0, 0, 0]
      4. valid_bits: 0x05 (bit representation 101)
      5. valid_bits_offset: 0
      6. values: ["one", nullptr, "two"]

      When I use WriteBatchSpaced(num_values, def_levels, rep_levels, valid_bits, valid_bits_offset, values), I get an error when running parquet-tools on the outputted parquet file:

      Additionally, if I add another list into the data that I write, then the last element of that additional list is incorrectly written as the first element of the first list. See below.
       
      CASE 2
      Data (4 lists):
      [
         "one"
      ]
      null
      [
         "two"
      ]
      [
         "three",
         "four"
      ]
       
      Parameters to TypedColumnWriter<PhysicalType<parquet::Type::BYTE_ARRAY>>::WriteBatchSpaced:

      1. num_values: 5
      2. def_levels: [3, 0, 3, 3, 3]
      3. rep_levels: [0, 0, 0, 0, 1]
      4. valid_bits: 0x29 (bit representation 11101)
      5. valid_bits_offset: 0
      6. values: ["one", nullptr, "two", "three", "four"]

      Outputted Parquet File: 


       
      Here we see that the "four" in the last list actually shows up as "one". 

      Attachments

        1. NULL list 1.png
          71 kB
          Ruta Dhaneshwar
        2. NULL list 3.png
          69 kB
          Ruta Dhaneshwar
        3. NULL list 4.png
          80 kB
          Ruta Dhaneshwar
        4. NULL list 2.png
          772 kB
          Ruta Dhaneshwar
        5. schema.png
          65 kB
          Ruta Dhaneshwar

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            Ruta Dhaneshwar Ruta Dhaneshwar

            Dates

              Created:
              Updated:

              Slack

                Issue deployment