parquet-cpp implemented this optimisation here: https://github.com/apache/parquet-cpp/pull/140/commits/3f10378c5fc56c346ce77bf9e9faf011ead9c5e6
The basic idea is to add a batched interface to DictDecoder and RleDecoder, and support passing in a dictionary to RleDecoder. It should then be possible to significantly optimise the decoding.
We should add a microbenchmark for DictDecoder. and updated the benchmark for RleDecoder so we can understand the perf.