[HADOOP-4637] Unhandled failures starting jobs with S3 as backing store - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Incomplete
Affects Version/s: 0.18.1
Fix Version/s: None
Component/s: fs/s3
Labels:
None

Description

I run Hadoop 0.18.1 on Amazon EC2, with S3 as the backing store.

When starting jobs, I sometimes get the following failure, which causes the job to be abandoned:

org.apache.hadoop.ipc.RemoteException: java.io.IOException: java.lang.NullPointerException
at org.apache.hadoop.fs.s3.Jets3tFileSystemStore.retrieveBlock(Jets3tFileSystemStore.java:222)
at sun.reflect.GeneratedMethodAccessor18.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at $Proxy4.retrieveBlock(Unknown Source)
at org.apache.hadoop.fs.s3.S3InputStream.blockSeekTo(S3InputStream.java:160)
at org.apache.hadoop.fs.s3.S3InputStream.read(S3InputStream.java:119)
at java.io.DataInputStream.read(DataInputStream.java:83)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:47)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:85)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:214)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:150)
at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1212)
at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1193)
at org.apache.hadoop.mapred.JobInProgress.<init>(JobInProgress.java:177)
at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:1783)
at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888)
at org.apache.hadoop.ipc.Client.call(Client.java:715)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
at org.apache.hadoop.mapred.$Proxy5.submitJob(Unknown Source)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:788)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1026)

The stack trace suggests that copying the job file fails, because the HDFS S3 filesystem can't find all of the expected block objects when it needs them.

Since S3 is an "eventually consistent" kind of a filesystem, and does not always provide an up-to-date view of the stored data, this execution path probably should be strengthened - at least to retry these failed operations, or wait for the expected block file if it hasn't shown up yet.

Attachments

Issue Links

is related to

HADOOP-9577 Actual data loss using s3n (against US Standard region)

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Robert

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 12/Nov/08 00:13

Updated:: 19/Jul/14 00:08

Resolved:: 19/Jul/14 00:08