I spent some time looking at the parquet scanner in perf top. There are a lot of cases where the code is inefficient in ways that are easily fixed. Together this could add up to a significant perf win for scans.
The assembly of the core MaterializeValueBatch() loop has a lot of obvious inefficiency:
- Many loads from memory of values that are constant within the loop
- The generated bit unpacking and dictionary decoding code has a lot of inefficiency, e.g. a complicated bounds check
- Hot functions like DictDecoder::Get() are not inlined.
A lot of time is also spent on some scans calling memset() on one or two bytes inside InitTuple().