Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
When trying to write a column of parquet lists, if there is a NULL list, WriteBatchSpaced will either throw an error (case 1 below) or incorrectly write the last value in the last list as the first value from the first list (case 2 below).
CASE 1
Data (3 lists):
[
"one"
]
null
[
"two"
]
Parameters to TypedColumnWriter<PhysicalType<parquet::Type::BYTE_ARRAY>>::WriteBatchSpaced:
- num_values: 3
- def_levels: [3, 0, 3]
- rep_levels: [0, 0, 0]
- valid_bits: 0x05 (bit representation 101)
- valid_bits_offset: 0
- values: ["one", nullptr, "two"]
When I use WriteBatchSpaced(num_values, def_levels, rep_levels, valid_bits, valid_bits_offset, values), I get an error when running parquet-tools on the outputted parquet file:
Additionally, if I add another list into the data that I write, then the last element of that additional list is incorrectly written as the first element of the first list. See below.
CASE 2
Data (4 lists):
[
"one"
]
null
[
"two"
]
[
"three",
"four"
]
Parameters to TypedColumnWriter<PhysicalType<parquet::Type::BYTE_ARRAY>>::WriteBatchSpaced:
- num_values: 5
- def_levels: [3, 0, 3, 3, 3]
- rep_levels: [0, 0, 0, 0, 1]
- valid_bits: 0x29 (bit representation 11101)
- valid_bits_offset: 0
- values: ["one", nullptr, "two", "three", "four"]
Outputted Parquet File:
Here we see that the "four" in the last list actually shows up as "one".