The current Parquet scanner implementation treats a column in a row group as a "scan range". When reading a "scan range", Impala will issue a fopen RPC to the name node. Therefore, Impala will issue one RPC per column per row group. NN has a limited processing rate of fopen RPC and this can be a limiting factor on the query performance.
Fundamentally, there is no need to issue a fopen for each column. Impala should issue at most one fopen for each row group.
The current workaround of using file handle cache is not practical due to the large (1k byte) memory footprint per file handle cache. File handle cannot be shared by concurrent readers. So, if we have 10 queries reading the same file at the same time, we need 10 file handles cached.