Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-16736

remove redundant FileSystem status checks calls from Spark codebase

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 2.0.0
    • Fix Version/s: 2.1.0
    • Component/s: Spark Core
    • Labels:
      None

      Description

      The Hadoop FileSystem.exists() and FileSystem.isDirectory() calls are wrappers around FileSystem.getStatus(), —the latter putting load on an HDFS NN, and very, very slow against object stores.

      1. if these calls are followed by any getStatus() calls then they can be eliminated by careful merging and pulling out the catching of {FileNotFoundException}} from the exists() call to the spark code.
      1. Any sequence of exists + delete can be optimised by removing the exists check, relying on FileSystem.delete() to be a no-op if the destination path is not present. That's a tested requirement of all Hadoop compatible FS and object stores.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                stevel@apache.org Steve Loughran
                Reporter:
                stevel@apache.org Steve Loughran
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: