[SPARK-16575] partition calculation mismatch with sc.binaryFiles - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: 1.6.1, 1.6.2
Fix Version/s: 2.1.0
Component/s: Input/Output, Java API, Shuffle, Spark Core, Spark Shell
Labels:
None

Target Version/s:

2.1.0

Description

sc.binaryFiles is always creating an RDD with number of partitions as 2.

Steps to reproduce: (Tested this bug on databricks community edition)
1. Try to create an RDD using sc.binaryFiles. In this example, airlines folder has 1922 files.
Ex:

val binaryRDD = sc.binaryFiles("/databricks-datasets/airlines/*")

2. check the number of partitions of the above RDD

binaryRDD.partitions.size = 2. (expected value is more than 2)

3. If the RDD is created using sc.textFile, then the number of partitions are 1921.

4. Using the same sc.binaryFiles will create 1921 partitions in Spark 1.5.1 version.

For explanation with screenshot, please look at the link below,
http://apache-spark-developers-list.1001551.n3.nabble.com/Partition-calculation-issue-with-sc-binaryFiles-on-Spark-1-6-2-tt18314.html

Attachments

Issue Links

links to

[Github] Pull Request #15327 (fidato13)

Activity

People

Assignee:: Tarun Kumar

Reporter:: Suhas

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 15/Jul/16 20:32

Updated:: 08/Nov/16 02:41

Resolved:: 08/Nov/16 02:41