Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Incomplete
-
2.2.0
-
None
-
Running on EMR 5.8.0 with Hadoop 2.7.3 and Spark 2.2.0
-
Important
Description
Reading CSV out of S3 using S3A protocol does not compute the number of partitions correctly in Spark 2.2.0.
With Spark 2.2.0 I get only partition when loading a 14GB file
scala> val input = spark.read.format("csv").option("header", "true").option("delimiter", "|").option("multiLine", "true").load("s3a://<s3_path>") input: org.apache.spark.sql.DataFrame = [PARTY_KEY: string, ROW_START_DATE: string ... 36 more fields] scala> input.rdd.getNumPartitions res2: Int = 1
While in Spark 2.0.2 I had:
scala> val input = spark.read.format("csv").option("header", "true").option("delimiter", "|").option("multiLine", "true").load("s3a://<s3_path>") input: org.apache.spark.sql.DataFrame = [PARTY_KEY: string, ROW_START_DATE: string ... 36 more fields] scala> input.rdd.getNumPartitions res2: Int = 115
This introduces obvious performance issues in Spark 2.2.0. Maybe there is a property that should be set to have the number of partitions computed correctly.
I'm aware that the .option("multiline","true") is not supported in Spark 2.0.2, it's not relevant here.
Attachments
Issue Links
- is broken by
-
HADOOP-14943 Add common getFileBlockLocations() emulation for object stores, including S3A
- Patch Available