Details
-
Improvement
-
Status: Resolved
-
Minor
-
Resolution: Works for Me
-
3.3.1
-
None
-
None
Description
This is issue is continuation to https://issues.apache.org/jira/browse/HADOOP-17755
The input data reported by Spark(Hadoop 3.3.1) was almost double and read runtime also increased (around 20%) compared to Spark(Hadoop 3.2.0) with same exact amount of resource and same configuration. And this is happening with other jobs as well which was not impacted by read fully error as stated above.
I was having the same exact issue when I was using the workaround fs.s3a.readahead.range = 1G with Hadoop 3.2.0
Below is further details :
Hadoop Version | Actual size of the files(in SQL Tab) | Reported size of the file(In Stages) | Time to complete the Stage | fs.s3a.readahead.range |
Hadoop 3.2.0 | 29.3 GiB | 29.3 GiB | 23 min | 64K |
Hadoop 3.3.1 | 29.3 GiB | 58.7 GiB | 27 min | 64K |
Hadoop 3.2.0 | 29.3 GiB | 58.7 GiB | ~27 min | 1G |
- Shuffle Write is same (95.9 GiB) for all the above three cases
I was expecting some improvement(or same as 3.2.0) with Hadoop 3.3.1 with read operations, please suggest how to approach this and resolve this.
I have used the default s3a config along with below and also using EKS cluster
spark.hadoop.fs.s3a.committer.magic.enabled: 'true' spark.hadoop.fs.s3a.committer.name: magic spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a: org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory spark.hadoop.fs.s3a.downgrade.syncable.exceptions: "true"
- I did not use
spark.hadoop.fs.s3a.experimental.input.fadvise=random
And as already mentioned I have used same Spark, same amount of resources and same config. Only change is Hadoop 3.2.0 to Hadoop 3.3.1 (Built with Spark using ./dev/make-distribution.sh --name spark-patched --pip -Pkubernetes -Phive -Phive-thriftserver -Dhadoop.version="3.3.1")
Attachments
Attachments
Issue Links
- Discovered while testing
-
HADOOP-17755 EOF reached error reading ORC file on S3A
- Resolved
- relates to
-
HADOOP-17774 bytesRead FS statistic showing twice the correct value in S3A
- Resolved