[SPARK-28853] Support conf to organize filePartitions by file path - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Resolved
Priority: Major
Resolution: Won't Fix
Affects Version/s: 3.0.0
Fix Version/s: None
Component/s: SQL
Labels:
None

Description

When dynamicly writing data to hdfs it may generates a lot of small files, so sometimes we need to merge those files. When reading this files and writing again, it will be helpful if the read file RDD partitions is formed by partitions on hdfs.

Currently in FileSourceScanExec.createNonBucketedReadRDD after spliting files, spark will sort files with file size so it may scatter the partition distribution of the data files. It is a great help to support sort by file path here

Attachments

Issue Links

links to

GitHub Pull Request #25556

Activity

People

Assignee:: Unassigned

Reporter:: ZhangYao

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 22/Aug/19 12:47

Updated:: 26/Oct/19 20:54

Resolved:: 26/Oct/19 20:54