Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
0.16.0, 0.17.0
-
None
Description
IpcFormatWriter assigns dictionary IDs once when it writes the schema message. Then, when it writes dictionary batches, it assigns dictionary IDs again because it re-collects dictionaries from the given batch. So for example, if you have 5 dictionaries, the first dictionary will end up with ID 0 but be written with ID 5.
For example, this will fail with "'_error_or_value11.status()' failed with Key error: No record of dictionary type with id 9"
TEST_F(TestMetadata, DoPutDictionaries) { ASSERT_OK_AND_ASSIGN(auto sink, arrow::io::BufferOutputStream::Create()); std::shared_ptr<Schema> schema = ExampleDictSchema(); BatchVector expected_batches; ASSERT_OK(ExampleDictBatches(&expected_batches)); ASSERT_OK_AND_ASSIGN(auto writer, arrow::ipc::NewStreamWriter(sink.get(), schema)); for (auto& batch : expected_batches) { ASSERT_OK(writer->WriteRecordBatch(*batch)); } ASSERT_OK_AND_ASSIGN(auto buf, sink->Finish()); arrow::io::BufferReader source(buf); ASSERT_OK_AND_ASSIGN(auto reader, arrow::ipc::RecordBatchStreamReader::Open(&source)); AssertSchemaEqual(schema, reader->schema()); for (auto& batch : expected_batches) { ASSERT_OK_AND_ASSIGN(auto actual, reader->Next()); AssertBatchesEqual(*actual, *batch); } }
Attachments
Issue Links
- duplicates
-
ARROW-9660 [C++] IPC - dictionaries in maps
- Resolved