Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-17789

S3 CSV read performance with Spark with Hadoop 3.3.1 is slower than older Hadoop

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Works for Me
    • 3.3.1
    • None
    • fs/s3
    • None

    Description

      This is issue is continuation to https://issues.apache.org/jira/browse/HADOOP-17755

      The input data reported by Spark(Hadoop 3.3.1) was almost double and read runtime also increased (around 20%) compared to Spark(Hadoop 3.2.0) with same exact amount of resource and same configuration. And this is happening with other jobs as well which was not impacted by read fully error as stated above.

      I was having the same exact issue when I was using the workaround  fs.s3a.readahead.range = 1G with Hadoop 3.2.0

      Below is further details :

       

      Hadoop Version Actual size of the files(in SQL Tab) Reported size of the file(In Stages) Time to complete the Stage fs.s3a.readahead.range
      Hadoop 3.2.0 29.3 GiB 29.3 GiB 23 min 64K
      Hadoop 3.3.1 29.3 GiB 58.7 GiB 27 min 64K
      Hadoop 3.2.0 29.3 GiB 58.7 GiB ~27 min 1G
      • Shuffle Write is same (95.9 GiB) for all the above three cases

      I was expecting some improvement(or same as 3.2.0) with Hadoop 3.3.1 with read operations, please suggest how to approach this and resolve this.

      I have used the default s3a config along with below and also using EKS cluster

      spark.hadoop.fs.s3a.committer.magic.enabled: 'true'
      spark.hadoop.fs.s3a.committer.name: magic
      spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a: org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory
      spark.hadoop.fs.s3a.downgrade.syncable.exceptions: "true"
      • I did not use 
        spark.hadoop.fs.s3a.experimental.input.fadvise=random

      And as already mentioned I have used same Spark, same amount of resources and same config.  Only change is Hadoop 3.2.0 to Hadoop 3.3.1 (Built with Spark using ./dev/make-distribution.sh --name spark-patched --pip -Pkubernetes -Phive -Phive-thriftserver -Dhadoop.version="3.3.1")

      Attachments

        1. storediag.log
          29 kB
          Arghya Saha

        Issue Links

          Activity

            People

              Unassigned Unassigned
              arghya18 Arghya Saha
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: