[SPARK-22357] SparkContext.binaryFiles ignore minPartitions parameter - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.1.2, 2.2.0
Fix Version/s: 2.4.0
Component/s: Spark Core
Labels:
- behavior-changes

Description

this is a bug in binaryFiles - even though we give it the partitions, binaryFiles ignores it.
This is a bug introduced in spark 2.1 from spark 2.0, in file PortableDataStream.scala the argument “minPartitions” is no longer used (with the push to master on 11/7/6):

/**
Allow minPartitions set by end-user in order to keep compatibility with old Hadoop API
which is set through setMaxSplitSize
*/
def setMinPartitions(sc: SparkContext, context: JobContext, minPartitions: Int) {
    val defaultMaxSplitBytes = sc.getConf.get(config.FILES_MAX_PARTITION_BYTES)
    val openCostInBytes = sc.getConf.get(config.FILES_OPEN_COST_IN_BYTES)
    val defaultParallelism = sc.defaultParallelism
    val files = listStatus(context).asScala
    val totalBytes = files.filterNot(.isDirectory).map(.getLen + openCostInBytes).sum
    val bytesPerCore = totalBytes / defaultParallelism
    val maxSplitSize = Math.min(defaultMaxSplitBytes, Math.max(openCostInBytes, bytesPerCore))
    super.setMaxSplitSize(maxSplitSize)
}

The code previously, in version 2.0, was:

def setMinPartitions(context: JobContext, minPartitions: Int) {
    val totalLen = listStatus(context).asScala.filterNot(.isDirectory).map(.getLen).sum
    val maxSplitSize = math.ceil(totalLen / math.max(minPartitions, 1.0)).toLong
    super.setMaxSplitSize(maxSplitSize)
}

The new code is very smart, but it ignores what the user passes in and uses the data size, which is kind of a breaking change in some sense
In our specific case this was a problem, because we initially read in just the files names and only after that the dataframe becomes very large, when reading in the images themselves – and in this case the new code does not handle the partitioning very well.
I’m not sure if it can be easily fixed because I don’t understand the full context of the change in spark (but at the very least the unused parameter should be removed to avoid confusion).

Attachments

Issue Links

links to

[Github] Pull Request #21638 (bomeng)

[Github] Pull Request #22356 (srowen)

Activity

People

Assignee:: Bo Meng

Reporter:: Weichen Xu

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 26/Oct/17 08:25

Updated:: 12/Dec/22 18:10

Resolved:: 29/Aug/18 00:39