Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
Impala 1.1
-
None
-
None
Description
The 1GB parquet block size restricts the degree of parallelism during scan. For example, if I've a 1GB file and I'm querying 75% of the columns, then it'll have to do scan 750MB using 1 disk. On the other hand, if I'm using Seq/Snappy with 128Mb block size, I can parallelize the scan and get the result a lot faster.
Nong and I discussed this problem and a user-configurable block size came to our mind. It still require some more thought on this problem.