[HADOOP-17789] S3 CSV read performance with Spark with Hadoop 3.3.1 is slower than older Hadoop - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Works for Me
Affects Version/s: 3.3.1
Fix Version/s: None
Component/s: fs/s3
Labels:
None

Description

This is issue is continuation to https://issues.apache.org/jira/browse/HADOOP-17755

The input data reported by Spark(Hadoop 3.3.1) was almost double and read runtime also increased (around 20%) compared to Spark(Hadoop 3.2.0) with same exact amount of resource and same configuration. And this is happening with other jobs as well which was not impacted by read fully error as stated above.

I was having the same exact issue when I was using the workaround fs.s3a.readahead.range = 1G with Hadoop 3.2.0

Below is further details :

Hadoop Version	Actual size of the files(in SQL Tab)	Reported size of the file(In Stages)	Time to complete the Stage	fs.s3a.readahead.range
Hadoop 3.2.0	29.3 GiB	29.3 GiB	23 min	64K
Hadoop 3.3.1	29.3 GiB	58.7 GiB	27 min	64K
Hadoop 3.2.0	29.3 GiB	58.7 GiB	~27 min	1G

Shuffle Write is same (95.9 GiB) for all the above three cases

I was expecting some improvement(or same as 3.2.0) with Hadoop 3.3.1 with read operations, please suggest how to approach this and resolve this.

I have used the default s3a config along with below and also using EKS cluster

spark.hadoop.fs.s3a.committer.magic.enabled: 'true'
spark.hadoop.fs.s3a.committer.name: magic
spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a: org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory
spark.hadoop.fs.s3a.downgrade.syncable.exceptions: "true"

I did not use

spark.hadoop.fs.s3a.experimental.input.fadvise=random

And as already mentioned I have used same Spark, same amount of resources and same config. Only change is Hadoop 3.2.0 to Hadoop 3.3.1 (Built with Spark using ./dev/make-distribution.sh --name spark-patched --pip -Pkubernetes -Phive -Phive-thriftserver -Dhadoop.version="3.3.1")

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

storediag.log
19/Jul/21 05:25
29 kB
Arghya Saha

Issue Links

Discovered while testing

HADOOP-17755 EOF reached error reading ORC file on S3A

Resolved

relates to

HADOOP-17774 bytesRead FS statistic showing twice the correct value in S3A

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Arghya Saha

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 02/Jul/21 15:53

Updated:: 29/Jul/21 09:24

Resolved:: 28/Jul/21 15:17