Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-2176

[C++] Extend DictionaryBuilder to support delta dictionaries



    • New Feature
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 0.9.0
    • C++


      The IPC format specifies a possibility of sending additional dictionary batches with a previously seen id and a isDelta flag to extend the existing dictionaries with new entries. Right now, the DictioniaryBuilder (as well as IPC writer and reader) do not support generation of delta dictionaries.

      This pull request contains a basic implementation of the DictionaryBuilder with delta dictionaries support. The use API can be seen in the dictionary tests (i.e. here). The basic idea is that the user just reuses the builder object after calling Finish(Array*) for the first time. Subsequent calls to Append will create new entries only for the unseen element and reuse id from previous dictionaries for the seen ones.

      Some considerations:

      1. The API is pretty implicit, and additional flag for Finish, which explicitly indicates a desire to use the builder for delta dictionary generation might be expedient from the error avoidance point of view.
      2. Right now the implementation uses an additional "overflow dictionary" to store the seen items. This adds a copy on each Finish call and an additional lookup at each GetItem or Append call. I assume, we might get away with returning Array slices at Finish, which would remove the need for an additional overflow dictionary. If the gist of the PR is approved, I can look into further optimizations.

      The Writer and Reader extensions would be pretty simple, since the DictionaryBuilder API remains basically the same. 


        Issue Links



              alendit Dimitri Vorona
              alendit Dimitri Vorona
              0 Vote for this issue
              4 Start watching this issue