Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-4637

Unhandled failures starting jobs with S3 as backing store

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 0.18.1
    • None
    • fs/s3
    • None

    Description

      I run Hadoop 0.18.1 on Amazon EC2, with S3 as the backing store.

      When starting jobs, I sometimes get the following failure, which causes the job to be abandoned:

      org.apache.hadoop.ipc.RemoteException: java.io.IOException: java.lang.NullPointerException
      at org.apache.hadoop.fs.s3.Jets3tFileSystemStore.retrieveBlock(Jets3tFileSystemStore.java:222)
      at sun.reflect.GeneratedMethodAccessor18.invoke(Unknown Source)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
      at java.lang.reflect.Method.invoke(Method.java:597)
      at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
      at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
      at $Proxy4.retrieveBlock(Unknown Source)
      at org.apache.hadoop.fs.s3.S3InputStream.blockSeekTo(S3InputStream.java:160)
      at org.apache.hadoop.fs.s3.S3InputStream.read(S3InputStream.java:119)
      at java.io.DataInputStream.read(DataInputStream.java:83)
      at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:47)
      at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:85)
      at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:214)
      at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:150)
      at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1212)
      at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1193)
      at org.apache.hadoop.mapred.JobInProgress.<init>(JobInProgress.java:177)
      at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:1783)
      at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
      at java.lang.reflect.Method.invoke(Method.java:597)
      at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
      at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888)
      at org.apache.hadoop.ipc.Client.call(Client.java:715)
      at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
      at org.apache.hadoop.mapred.$Proxy5.submitJob(Unknown Source)
      at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:788)
      at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1026)

      The stack trace suggests that copying the job file fails, because the HDFS S3 filesystem can't find all of the expected block objects when it needs them.

      Since S3 is an "eventually consistent" kind of a filesystem, and does not always provide an up-to-date view of the stored data, this execution path probably should be strengthened - at least to retry these failed operations, or wait for the expected block file if it hasn't shown up yet.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              robert Robert
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: