Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
1.6.1, 1.6.2
-
None
Description
sc.binaryFiles is always creating an RDD with number of partitions as 2.
Steps to reproduce: (Tested this bug on databricks community edition)
1. Try to create an RDD using sc.binaryFiles. In this example, airlines folder has 1922 files.
Ex:
val binaryRDD = sc.binaryFiles("/databricks-datasets/airlines/*")
2. check the number of partitions of the above RDD
- binaryRDD.partitions.size = 2. (expected value is more than 2)
3. If the RDD is created using sc.textFile, then the number of partitions are 1921.
4. Using the same sc.binaryFiles will create 1921 partitions in Spark 1.5.1 version.
For explanation with screenshot, please look at the link below,
http://apache-spark-developers-list.1001551.n3.nabble.com/Partition-calculation-issue-with-sc-binaryFiles-on-Spark-1-6-2-tt18314.html