Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-28853

Support conf to organize filePartitions by file path

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Resolved
    • Major
    • Resolution: Won't Fix
    • 3.0.0
    • None
    • SQL
    • None

    Description

      When dynamicly writing data to hdfs it may generates a lot of small files, so sometimes we need to merge those files. When reading this files and writing again, it will be helpful if the read file RDD partitions is formed by partitions on hdfs.

      Currently in FileSourceScanExec.createNonBucketedReadRDD after spliting files, spark will sort files with file size so it may scatter the partition distribution of the data files. It is a great help to support sort by file path here

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              ZhangYao ZhangYao
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: