[CRUNCH-47] Inputs and outputs can't use non-default Hadoop FileSystem - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 0.3.0
Fix Version/s: 0.3.0
Component/s: IO
Labels:
None
Environment:
Elastic MapReduce Hadoop 1.0.3

Description

I'm getting the following exception trying to use Crunch with Elastic MapReduce where input and output files use the Native S3 FileSystem and intermediate files use HDFS. HDFS is configured as the default file system:

Exception in thread "main" java.lang.IllegalArgumentException: This file system object (hdfs://10.114.37.65:9000) does not support access to the request path 's3n://test-bucket/test/Input.avro' You possibly called FileSystem.get(conf) when you should have called FileSystem.get(uri, conf) to obtain a file system supporting your path.
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:381)
at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:129)
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:513)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:767)
at org.apache.crunch.io.SourceTargetHelper.getPathSize(SourceTargetHelper.java:44)

It looks like Crunch has a number of calls to FileSystem.get(Configuration) that assume the default configured file system and fail with an S3 input or output.

Also, CrunchJob.handleMultiPaths() calls FileSystem.rename() which works only if the source and destination use the same file system. This breaks the final upload of the output files from HDFS to S3.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

multiple-file-systems.patch
14/Aug/12 22:36
10 kB
Shawn Smith

Activity

People

Assignee:: Unassigned

Reporter:: Shawn Smith

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 14/Aug/12 22:29

Updated:: 17/Sep/12 06:41

Resolved:: 14/Aug/12 23:14