Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-8791

[Rust] Creating StringDictionaryBuilder with existing dictionary values

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 0.17.0
    • 1.0.0
    • Rust

    Description

      It might be useful to create a DictionaryArray that uses the same dictionary keys as another array. One usecase would be more efficient comparison between arrays if it is known that they use the same dictionary. Another could be more efficient grouping operations, across multiple chunks (ie a `Vec<DictionaryArray>`).

       

      A possible implementation could look like this:

       

      impl<K> StringDictionaryBuilder<K>
      where
          K: ArrowDictionaryKeyType,
      {
          pub fn new_with_dictionary(
              keys_builder: PrimitiveBuilder<K>,
              dictionary_values: &StringArray,
          ) -> Result<Self> {
              let mut values_builder = StringBuilder::with_capacity(
                  dictionary_values.len(),
                  dictionary_values.value_data().len(),
              );
              let mut map: HashMap<Box<[u8]>, K::Native> = HashMap::new();
              for i in 0..dictionary_values.len() {
                  if dictionary_values.is_valid(i) {
                      let value = dictionary_values.value(i);
                      map.insert(
                          value.as_bytes().into(),
                          K::Native::from_usize(i)
                              .ok_or(ArrowError::DictionaryKeyOverflowError)?,
                      );
                      values_builder.append_value(value);
                  } else {
                      values_builder.append_null();
                  }
              }
              Ok(Self {
                  keys_builder,
                  values_builder,
                  map,
              })
          }
      }

      I don't really like here that the map has to be reconstructed, maybe there is a more efficient way by passing in the HashMap directly, but it's probably not a good idea to expose the `Box<[u8]>` encoding of its keys.

      Attachments

        Issue Links

          Activity

            People

              jhorstmann Jörn Horstmann
              jhorstmann Jörn Horstmann
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 20m
                  1h 20m

                  Slack

                    Issue deployment