[SPARK-22240] S3 CSV number of partitions incorrectly computed - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Incomplete
Affects Version/s: 2.2.0
Fix Version/s: None
Component/s: Spark Core
Labels:
- bulk-closed
Environment:

Running on EMR 5.8.0 with Hadoop 2.7.3 and Spark 2.2.0

Flags:

Important

Description

Reading CSV out of S3 using S3A protocol does not compute the number of partitions correctly in Spark 2.2.0.

With Spark 2.2.0 I get only partition when loading a 14GB file

scala> val input = spark.read.format("csv").option("header", "true").option("delimiter", "|").option("multiLine", "true").load("s3a://<s3_path>")
input: org.apache.spark.sql.DataFrame = [PARTY_KEY: string, ROW_START_DATE: string ... 36 more fields]

scala> input.rdd.getNumPartitions
res2: Int = 1

While in Spark 2.0.2 I had:

scala> val input = spark.read.format("csv").option("header", "true").option("delimiter", "|").option("multiLine", "true").load("s3a://<s3_path>")
input: org.apache.spark.sql.DataFrame = [PARTY_KEY: string, ROW_START_DATE: string ... 36 more fields]

scala> input.rdd.getNumPartitions
res2: Int = 115

This introduces obvious performance issues in Spark 2.2.0. Maybe there is a property that should be set to have the number of partitions computed correctly.

I'm aware that the .option("multiline","true") is not supported in Spark 2.0.2, it's not relevant here.

Attachments

Issue Links

is broken by

HADOOP-14943 Add common getFileBlockLocations() emulation for object stores, including S3A

Patch Available

Activity

People

Assignee:: Unassigned

Reporter:: Arthur Baudry

Votes:: 1 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 11/Oct/17 00:13

Updated:: 12/Dec/22 18:11

Resolved:: 21/May/19 04:12