Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-11176 Umbrella ticket for wholeTextFiles bugs
  3. SPARK-4414

SparkContext.wholeTextFiles Doesn't work with S3 Buckets

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Critical
    • Resolution: Won't Fix
    • 1.1.0, 1.2.0
    • None
    • Spark Core
    • None

    Description

      SparkContext.wholeTextFiles does not read files which SparkContext.textFile can read. Below are general steps to reproduce, my specific case is following that on a git repo.

      Steps to reproduce.
      1. Create Amazon S3 bucket, make public with multiple files
      2. Attempt to read bucket with
      sc.wholeTextFiles("s3n://mybucket/myfile.txt")
      3. Spark returns the following error, even if the file exists.
      Exception in thread "main" java.io.FileNotFoundException: File does not exist: /myfile.txt
      at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:517)
      at org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat$OneFileInfo.<init>(CombineFileInputFormat.java:489)
      4. Change the call to
      sc.textFile("s3n://mybucket/myfile.txt")
      and there is no error message, the application should run fine.

      There is a question on StackOverflow as well on this:
      http://stackoverflow.com/questions/26258458/sparkcontext-wholetextfiles-java-io-filenotfoundexception-file-does-not-exist

      This is link to repo/lines of code. The uncommented call doesn't work, the commented call works as expected:
      https://github.com/EntilZha/nips-lda-spark/blob/45f5ad1e2646609ef9d295a0954fbefe84111d8a/src/main/scala/NipsLda.scala#L13-L19

      It would be easy to use textFile with a multifile argument, but this should work correctly for s3 bucket files as well.

      Attachments

        Issue Links

          Activity

            People

              joshrosen Josh Rosen
              pedrorodriguez Pedro Rodriguez
              Votes:
              4 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: