Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Incomplete
-
0.18.1
-
None
-
None
Description
I run Hadoop 0.18.1 on Amazon EC2, with S3 as the backing store.
When starting jobs, I sometimes get the following failure, which causes the job to be abandoned:
org.apache.hadoop.ipc.RemoteException: java.io.IOException: java.lang.NullPointerException
at org.apache.hadoop.fs.s3.Jets3tFileSystemStore.retrieveBlock(Jets3tFileSystemStore.java:222)
at sun.reflect.GeneratedMethodAccessor18.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at $Proxy4.retrieveBlock(Unknown Source)
at org.apache.hadoop.fs.s3.S3InputStream.blockSeekTo(S3InputStream.java:160)
at org.apache.hadoop.fs.s3.S3InputStream.read(S3InputStream.java:119)
at java.io.DataInputStream.read(DataInputStream.java:83)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:47)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:85)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:214)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:150)
at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1212)
at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1193)
at org.apache.hadoop.mapred.JobInProgress.<init>(JobInProgress.java:177)
at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:1783)
at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888)
at org.apache.hadoop.ipc.Client.call(Client.java:715)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
at org.apache.hadoop.mapred.$Proxy5.submitJob(Unknown Source)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:788)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1026)
The stack trace suggests that copying the job file fails, because the HDFS S3 filesystem can't find all of the expected block objects when it needs them.
Since S3 is an "eventually consistent" kind of a filesystem, and does not always provide an up-to-date view of the stored data, this execution path probably should be strengthened - at least to retry these failed operations, or wait for the expected block file if it hasn't shown up yet.
Attachments
Issue Links
- is related to
-
HADOOP-9577 Actual data loss using s3n (against US Standard region)
- Resolved