Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-9660

[C++] IPC - dictionaries in maps

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.0.0
    • 2.0.0
    • C++

    Description

      I created the following record batch which has a single column with a type of map<dict, string> where dict is defined as: dict<int8,string>:

       

      arrow::MapBuilder map_builder(arrow::default_memory_pool(),
          std::make_shared<arrow::StringDictionaryBuilder>(),
          std::make_shared<arrow::StringBuilder>());
      auto key_builder = 
          dynamic_cast<arrow::StringDictionaryBuilder *>(map_builder.key_builder());
      auto item_builder = 
          dynamic_cast<arrow::StringBuilder *>(map_builder.item_builder());
      
      // Add a first row with k<i>=v<i> for i 0..14;
      ASSERT_OK(map_builder.Append());
      for (int i = 0; i < 15; ++i) {
        ASSERT_OK(key_builder->Append("k" + std::to_string(i)));
        ASSERT_OK(item_builder->Append("v" + std::to_string(i)));
      }
      // Add a second row with k<i>=w<i> for i 0..14;
      ASSERT_OK(map_builder.Append());
      for (int i = 0; i < 15; ++i) {
        ASSERT_OK(key_builder->Append("k" + std::to_string(i)));
        ASSERT_OK(item_builder->Append("w" + std::to_string(i)));
      }
      std::shared_ptr<arrow::Array> array;
      ASSERT_OK(map_builder.Finish(&array));
      std::shared_ptr<arrow::Schema> schema = 
          arrow::schema({arrow::field("s", array->type())});
      std::shared_ptr<arrow::RecordBatch> batch = 
          arrow::RecordBatch::Make(schema, array->length(), {array});
      

      When one attempts to send this in a round trip IPC:

      1. On IpcFormatWriter::Start(): The memo records one entry for field_to_id and id_to_type_ where the dict id = 0.
      2. On IpcFormatWriter::CollectDictionaries: The memo records a new entry for field_to_id and id_to_type with id=1 and also records in id_to_dictionary_. At this point we have 2 entries with the entry id=0 having no associated dict.
      3. On IpcFormatWriter;:WriteDictionaries: It writes the dict with entry = 1

      When reading:

      1. GetSchema eventually gets to the nested dictionary in FieldFromFlatBuffer
      2. The recovered dict id is 0.
      3. This adds to the memo the field_to_id and id_to_type with id = 0
      4. My round trip code calls "ReadAll".
      5. RecordBatchStreamReaderImpl::ReadNext attempts to load the initial dicts
      6. It recovers id = 1
      7. The process aborts because id = 1 is not in the memo: dictionary_memo->GetDictionaryType(id, &value_type)

      A similar example with a dict inside a "struct" worked fine and only used dict id = 0. So it looks like something wrong when gathering the schema for the map. Unless I did not construct the map correctly?

       

      Attachments

        Issue Links

          Activity

            People

              apitrou Antoine Pitrou
              belzilep Pierre Belzile
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 2h 20m
                  2h 20m