Details
-
New Feature
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
Description
We use a hash table to extract unique values and dictionary indices. There may be an opportunity to consolidate common code from the dictionary encoding implementation implemented in parquet-cpp (but the dictionary indices will not be run-length encoded in Arrow):
https://github.com/apache/parquet-cpp/blob/master/src/parquet/encodings/dictionary-encoding.h
This functionality also needs to permit encoding split across multiple record batches – so the hash table would be a stateful entity, and we can continue to hash more chunks of data to dictionary-encode multiple arrays with a shared dictionary at the end.