Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-10995

[Rust] [DataFusion] Improve parallelism when reading Parquet files

    XMLWordPrintableJSON

Details

    Description

      Currently the unit of parallelism is the number of parquet files being read.

      For example, if we run a query against a Parquet table that consists of 8 partitions then we will attempt to run 8 async tasks in parallel and if there is a single Parquet file then we will only try and run 1 async task so this does not scale well. Also, if there are hundreds or thousands of Parquet files then we will try and process them all concurrently which also doesn't scale well.

      These are the options for improving this situation:

       

      1. Use Parquet row groups as the unit of partitioning and divide the number of row groups by the desired level of concurrency (defaulting to number of cores)
      2. Keep file as the unit of partitions and add a RepartitionExec into the plan if there are fewer partitions (files) than cores and in the case where there are more files than cores, split the files up into lists so that each partition is a list of files rather than a single file. Each partition task will process one file at a time.

       

       

      Attachments

        Issue Links

          Activity

            People

              andygrove Andy Grove
              andygrove Andy Grove
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 2h 20m
                  2h 20m