Description
While working on IMPALA-6964, I noticed that sometimes the runtime profile for a HDFS_SCAN_NODE will include File Formats: PARQUET/NONE:2 and sometimes it won't (depending on the query). However, looking at the code, any scan of Parquet files should include this line.
I debugged the code and there seems to a be a race condition where HdfsScanNodeBase::StopAndFinalizeCounters can be called before HdfsParquetScanner::Close is called for all the scan ranges. This causes the File Formats issue above because HdfsParquetScanner::Close calls HdfsScanNodeBase::RangeComplete which updates the shared object file_type_counts_, which is read in StopAndFinalizeCounters (so StopAndFinalizeCounters will write out the contents of file_type_counts_ before all scanners can update it).
StopAndFinalizeCounters can be called in two places: HdfsScanNodeBase::Close and in HdfsScanNode::GetNext. It can be called in GetNext when GetNextInternal reads enough rows to cross the query defined limit. So GetNext will call StopAndFinalizeCounters once the limit is reached, but not necessarily before the scanners are closed.
I'm able to re-produce this locally by using the queries:
select * from functional_parquet.lineitem_sixblocks limit 10
The runtime profile does not include File Formats
select * from functional_parquet.lineitem_sixblocks order by l_orderkey limit 10
The runtime profile does include File Formats
I tried to simply remove the call to StopAndFinalizeCounters from GetNext but that doesn't seem to work. It actually caused several other RP messages to get deleted (not entirely sure why).