Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-22240

S3 CSV number of partitions incorrectly computed

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 2.2.0
    • None
    • Spark Core
    • Running on EMR 5.8.0 with Hadoop 2.7.3 and Spark 2.2.0

    • Important

    Description

      Reading CSV out of S3 using S3A protocol does not compute the number of partitions correctly in Spark 2.2.0.

      With Spark 2.2.0 I get only partition when loading a 14GB file

      scala> val input = spark.read.format("csv").option("header", "true").option("delimiter", "|").option("multiLine", "true").load("s3a://<s3_path>")
      input: org.apache.spark.sql.DataFrame = [PARTY_KEY: string, ROW_START_DATE: string ... 36 more fields]
      
      scala> input.rdd.getNumPartitions
      res2: Int = 1
      

      While in Spark 2.0.2 I had:

      scala> val input = spark.read.format("csv").option("header", "true").option("delimiter", "|").option("multiLine", "true").load("s3a://<s3_path>")
      input: org.apache.spark.sql.DataFrame = [PARTY_KEY: string, ROW_START_DATE: string ... 36 more fields]
      
      scala> input.rdd.getNumPartitions
      res2: Int = 115
      

      This introduces obvious performance issues in Spark 2.2.0. Maybe there is a property that should be set to have the number of partitions computed correctly.

      I'm aware that the .option("multiline","true") is not supported in Spark 2.0.2, it's not relevant here.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              artb Arthur Baudry
              Votes:
              1 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: