Uploaded image for project: 'Parquet'
  1. Parquet
  2. PARQUET-1936

WriteBatchSpaced writes incorrect value for parquet when input contains NULL list

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: parquet-cpp
    • Labels:
      None

      Description

      When trying to write a column of parquet lists, if there is a NULL list, WriteBatchSpaced will either throw an error (case 1 below) or incorrectly write the last value in the last list as the first value from the first list (case 2 below).

      CASE 1
      Data (3 lists):
      [
         "one"
      ]
      null
      [
         "two"
      ]
       
      Parameters to TypedColumnWriter<PhysicalType<parquet::Type::BYTE_ARRAY>>::WriteBatchSpaced:

      1. num_values: 3
      2. def_levels: [3, 0, 3]
      3. rep_levels: [0, 0, 0]
      4. valid_bits: 0x05 (bit representation 101)
      5. valid_bits_offset: 0
      6. values: ["one", nullptr, "two"]

      When I use WriteBatchSpaced(num_values, def_levels, rep_levels, valid_bits, valid_bits_offset, values), I get an error when running parquet-tools on the outputted parquet file:

      Additionally, if I add another list into the data that I write, then the last element of that additional list is incorrectly written as the first element of the first list. See below.
       
      CASE 2
      Data (4 lists):
      [
         "one"
      ]
      null
      [
         "two"
      ]
      [
         "three",
         "four"
      ]
       
      Parameters to TypedColumnWriter<PhysicalType<parquet::Type::BYTE_ARRAY>>::WriteBatchSpaced:

      1. num_values: 5
      2. def_levels: [3, 0, 3, 3, 3]
      3. rep_levels: [0, 0, 0, 0, 1]
      4. valid_bits: 0x29 (bit representation 11101)
      5. valid_bits_offset: 0
      6. values: ["one", nullptr, "two", "three", "four"]

      Outputted Parquet File: 


       
      Here we see that the "four" in the last list actually shows up as "one". 

        Attachments

        1. NULL list 1.png
          71 kB
          Ruta Dhaneshwar
        2. NULL list 2.png
          772 kB
          Ruta Dhaneshwar
        3. NULL list 3.png
          69 kB
          Ruta Dhaneshwar
        4. NULL list 4.png
          80 kB
          Ruta Dhaneshwar
        5. schema.png
          65 kB
          Ruta Dhaneshwar

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              Ruta Dhaneshwar Ruta Dhaneshwar
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated: