[SPARK-4414] SparkContext.wholeTextFiles Doesn't work with S3 Buckets - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Critical
Resolution: Won't Fix
Affects Version/s: 1.1.0, 1.2.0
Fix Version/s: None
Component/s: Spark Core
Labels:
None

Description

SparkContext.wholeTextFiles does not read files which SparkContext.textFile can read. Below are general steps to reproduce, my specific case is following that on a git repo.

Steps to reproduce.
1. Create Amazon S3 bucket, make public with multiple files
2. Attempt to read bucket with
sc.wholeTextFiles("s3n://mybucket/myfile.txt")
3. Spark returns the following error, even if the file exists.
Exception in thread "main" java.io.FileNotFoundException: File does not exist: /myfile.txt
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:517)
at org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat$OneFileInfo.<init>(CombineFileInputFormat.java:489)
4. Change the call to
sc.textFile("s3n://mybucket/myfile.txt")
and there is no error message, the application should run fine.

There is a question on StackOverflow as well on this:
http://stackoverflow.com/questions/26258458/sparkcontext-wholetextfiles-java-io-filenotfoundexception-file-does-not-exist

This is link to repo/lines of code. The uncommented call doesn't work, the commented call works as expected:
https://github.com/EntilZha/nips-lda-spark/blob/45f5ad1e2646609ef9d295a0954fbefe84111d8a/src/main/scala/NipsLda.scala#L13-L19

It would be easy to use textFile with a multifile argument, but this should work correctly for s3 bucket files as well.

Attachments

Issue Links

is related to

SPARK-5250 EOFException in when reading gzipped files from S3 with wholeTextFiles

Resolved

Activity

People

Assignee:: Josh Rosen

Reporter:: Pedro Rodriguez

Votes:: 4 Vote for this issue

Watchers:: 11 Start watching this issue

Dates

Created:: 14/Nov/14 21:09

Updated:: 19/Oct/15 18:47

Resolved:: 19/Oct/15 18:47