Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
1.0.0
Description
I created the following record batch which has a single column with a type of map<dict, string> where dict is defined as: dict<int8,string>:
arrow::MapBuilder map_builder(arrow::default_memory_pool(), std::make_shared<arrow::StringDictionaryBuilder>(), std::make_shared<arrow::StringBuilder>()); auto key_builder = dynamic_cast<arrow::StringDictionaryBuilder *>(map_builder.key_builder()); auto item_builder = dynamic_cast<arrow::StringBuilder *>(map_builder.item_builder()); // Add a first row with k<i>=v<i> for i 0..14; ASSERT_OK(map_builder.Append()); for (int i = 0; i < 15; ++i) { ASSERT_OK(key_builder->Append("k" + std::to_string(i))); ASSERT_OK(item_builder->Append("v" + std::to_string(i))); } // Add a second row with k<i>=w<i> for i 0..14; ASSERT_OK(map_builder.Append()); for (int i = 0; i < 15; ++i) { ASSERT_OK(key_builder->Append("k" + std::to_string(i))); ASSERT_OK(item_builder->Append("w" + std::to_string(i))); } std::shared_ptr<arrow::Array> array; ASSERT_OK(map_builder.Finish(&array)); std::shared_ptr<arrow::Schema> schema = arrow::schema({arrow::field("s", array->type())}); std::shared_ptr<arrow::RecordBatch> batch = arrow::RecordBatch::Make(schema, array->length(), {array});
When one attempts to send this in a round trip IPC:
- On IpcFormatWriter::Start(): The memo records one entry for field_to_id and id_to_type_ where the dict id = 0.
- On IpcFormatWriter::CollectDictionaries: The memo records a new entry for field_to_id and id_to_type with id=1 and also records in id_to_dictionary_. At this point we have 2 entries with the entry id=0 having no associated dict.
- On IpcFormatWriter;:WriteDictionaries: It writes the dict with entry = 1
When reading:
- GetSchema eventually gets to the nested dictionary in FieldFromFlatBuffer
- The recovered dict id is 0.
- This adds to the memo the field_to_id and id_to_type with id = 0
- My round trip code calls "ReadAll".
- RecordBatchStreamReaderImpl::ReadNext attempts to load the initial dicts
- It recovers id = 1
- The process aborts because id = 1 is not in the memo: dictionary_memo->GetDictionaryType(id, &value_type)
A similar example with a dict inside a "struct" worked fine and only used dict id = 0. So it looks like something wrong when gathering the schema for the map. Unless I did not construct the map correctly?
Attachments
Issue Links
- is duplicated by
-
ARROW-8749 [C++] IpcFormatWriter writes dictionary batches with wrong ID
- Resolved
- links to