Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-16575

partition calculation mismatch with sc.binaryFiles

    XMLWordPrintableJSON

Details

    Description

      sc.binaryFiles is always creating an RDD with number of partitions as 2.

      Steps to reproduce: (Tested this bug on databricks community edition)
      1. Try to create an RDD using sc.binaryFiles. In this example, airlines folder has 1922 files.
      Ex:

      val binaryRDD = sc.binaryFiles("/databricks-datasets/airlines/*")

      2. check the number of partitions of the above RDD

      • binaryRDD.partitions.size = 2. (expected value is more than 2)

      3. If the RDD is created using sc.textFile, then the number of partitions are 1921.

      4. Using the same sc.binaryFiles will create 1921 partitions in Spark 1.5.1 version.

      For explanation with screenshot, please look at the link below,
      http://apache-spark-developers-list.1001551.n3.nabble.com/Partition-calculation-issue-with-sc-binaryFiles-on-Spark-1-6-2-tt18314.html

      Attachments

        Activity

          People

            fidato13 Tarun Kumar
            suhashm Suhas
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: