Currently the unit of parallelism is the number of parquet files being read.
For example, if we run a query against a Parquet table that consists of 8 partitions then we will attempt to run 8 async tasks in parallel and if there is a single Parquet file then we will only try and run 1 async task so this does not scale well. Also, if there are hundreds or thousands of Parquet files then we will try and process them all concurrently which also doesn't scale well.
These are the options for improving this situation:
- Use Parquet row groups as the unit of partitioning and divide the number of row groups by the desired level of concurrency (defaulting to number of cores)
- Keep file as the unit of partitions and add a RepartitionExec into the plan if there are fewer partitions (files) than cores and in the case where there are more files than cores, split the files up into lists so that each partition is a list of files rather than a single file. Each partition task will process one file at a time.