Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
1.7.0
-
None
-
None
Description
The ParquetGroupScan stores a list of ReadEntryWithPath in the form of 'entries' field as well as a hash set of file names in the 'fileSet' field.
The underlying data stored by both is essentially the same set of filenames. We should try to consolidate these into a single entity. This is not just useful for code simplification but has a real performance cost: when a ParquetGroupScan is serialized and sent as part of a Json plan fragment, the overhead is quite high if the number of files is large (tens of thousands or higher).