Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-21056

InMemoryFileIndex.listLeafFiles should create at most one spark job when listing files in parallel

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 2.1.1, 2.2.0
    • None
    • SQL

    Description

      Given partitioned file relation (e.g. parquet):

      root/a=../b=../c=..
      

      InMemoryFileIndex.listLeafFiles runs numberOfPartitions(a) times numberOfPartitions(b) spark jobs sequentially to list leaf files, if both numberOfPartitions(a) and numberOfPartitions(b) are below spark.sql.sources.parallelPartitionDiscovery.threshold and numberOfPartitions(c) is above spark.sql.sources.parallelPartitionDiscovery.threshold

      Since the jobs are run sequentially, the overhead of the jobs dominates and the file listing operation can become significantly slower than listing the files from the driver.

      I propose that InMemoryFileIndex.listLeafFiles should launch at most one spark job for listing leaf files.

      Attachments

        Activity

          People

            Unassigned Unassigned
            bbossy Bertrand Bossy
            Votes:
            6 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: