Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-12033

Reducer task failure with java.lang.NoClassDefFoundError: Ljava/lang/InternalError at org.apache.hadoop.io.compress.snappy.SnappyDecompressor.decompressBytesDirect

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Duplicate
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Target Version/s:

      Description

      We have noticed intermittent reducer task failures with the below exception:

      Error: org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in shuffle in fetcher#9 at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:134) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) Caused by: java.lang.NoClassDefFoundError: Ljava/lang/InternalError at org.apache.hadoop.io.compress.snappy.SnappyDecompressor.decompressBytesDirect(Native Method) at org.apache.hadoop.io.compress.snappy.SnappyDecompressor.decompress(SnappyDecompressor.java:239) at org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:88) at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85) at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192) at org.apache.hadoop.mapreduce.task.reduce.InMemoryMapOutput.shuffle(InMemoryMapOutput.java:97) at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyMapOutput(Fetcher.java:534) at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:329) at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:193) Caused by: java.lang.ClassNotFoundException: Ljava.lang.InternalError at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) ... 9 more 
      

      Usually, the reduce task succeeds on retry.

      Some of the symptoms are similar to HADOOP-8423, but this fix is already included (this is on Hadoop 2.6).

        Issue Links

          Activity

          Hide
          zxu zhihai xu added a comment -

          This looks likes hadoop native library was not loaded successfully.
          Did you see this warning message?
          LOG.warn("Unable to load native-hadoop library for your platform... " +
          "using builtin-java classes where applicable");
          You need configure LD_LIBRARY_PATH correctly in your environment.

          Show
          zxu zhihai xu added a comment - This looks likes hadoop native library was not loaded successfully. Did you see this warning message? LOG.warn("Unable to load native-hadoop library for your platform... " + "using builtin-java classes where applicable"); You need configure LD_LIBRARY_PATH correctly in your environment.
          Hide
          ivanmi Ivan Mitic added a comment -

          Thanks for responding zhihai xu. The reducer task would succeed on retry, so I assumed it's not an environment problem. Below is the task syslog:

          2015-05-21 18:33:10,773 INFO [main] org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from hadoop-metrics2.properties
          2015-05-21 18:33:10,976 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period at 60 second(s).
          2015-05-21 18:33:10,976 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ReduceTask metrics system started
          2015-05-21 18:33:10,991 INFO [main] org.apache.hadoop.mapred.YarnChild: Executing with tokens:
          2015-05-21 18:33:10,991 INFO [main] org.apache.hadoop.mapred.YarnChild: Kind: mapreduce.job, Service: job_1432143397187_0004, Ident: (org.apache.hadoop.mapreduce.security.token.JobTokenIdentifier@5df3ade7)
          2015-05-21 18:33:11,132 INFO [main] org.apache.hadoop.mapred.YarnChild: Kind: RM_DELEGATION_TOKEN, Service: 100.76.156.98:9010, Ident: (owner=btbig2, renewer=mr token, realUser=hdp, issueDate=1432225097662, maxDate=1432829897662, sequenceNumber=2, masterKeyId=2)
          2015-05-21 18:33:11,351 INFO [main] org.apache.hadoop.mapred.YarnChild: Sleeping for 0ms before retrying again. Got null now.
          2015-05-21 18:33:12,335 INFO [main] org.apache.hadoop.mapred.YarnChild: Sleeping for 500ms before retrying again. Got null now.
          2015-05-21 18:33:13,804 INFO [main] org.apache.hadoop.mapred.YarnChild: Sleeping for 1000ms before retrying again. Got null now.
          2015-05-21 18:33:16,308 INFO [main] org.apache.hadoop.mapred.YarnChild: mapreduce.cluster.local.dir for child: c:/apps/temp/hdfs/nm-local-dir/usercache/btbig2/appcache/application_1432143397187_0004
          2015-05-21 18:33:17,199 INFO [main] org.apache.hadoop.conf.Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
          2015-05-21 18:33:17,402 INFO [main] org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from hadoop-metrics2-azure-file-system.properties
          2015-05-21 18:33:17,418 INFO [main] org.apache.hadoop.metrics2.sink.WindowsAzureETWSink: Init starting.
          2015-05-21 18:33:17,418 INFO [main] org.apache.hadoop.metrics2.sink.WindowsAzureETWSink: Successfully loaded native library. LibraryName = EtwLogger
          2015-05-21 18:33:17,418 INFO [main] org.apache.hadoop.metrics2.sink.WindowsAzureETWSink: Init completed. Native library loaded and ETW handle obtained.
          2015-05-21 18:33:17,418 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSinkAdapter: Sink azurefs2 started
          2015-05-21 18:33:17,433 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period at 60 second(s).
          2015-05-21 18:33:17,433 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: azure-file-system metrics system started
          2015-05-21 18:33:17,699 INFO [main] org.apache.hadoop.yarn.util.ProcfsBasedProcessTree: ProcfsBasedProcessTree currently is supported only on Linux.
          2015-05-21 18:33:17,714 INFO [main] org.apache.hadoop.mapred.Task:  Using ResourceCalculatorProcessTree : org.apache.hadoop.yarn.util.WindowsBasedProcessTree@36c76ec3
          2015-05-21 18:33:17,746 INFO [main] org.apache.hadoop.mapred.ReduceTask: Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle@5c7b1796
          2015-05-21 18:33:17,793 INFO [main] org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl: MergerManager: memoryLimit=741710208, maxSingleShuffleLimit=185427552, mergeThreshold=489528768, ioSortFactor=100, memToMemMergeOutputsThreshold=100
          2015-05-21 18:33:17,793 INFO [EventFetcher for fetching Map Completion Events] org.apache.hadoop.mapreduce.task.reduce.EventFetcher: attempt_1432143397187_0004_r_001735_0 Thread started: EventFetcher for fetching Map Completion Events
          2015-05-21 18:33:19,187 INFO [fetcher#30] org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: Assigning workernode165.btbig2.c2.internal.cloudapp.net:13562 with 1 to fetcher#30
          2015-05-21 18:33:19,187 INFO [fetcher#30] org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: assigned 1 of 1 to workernode165.btbig2.c2.internal.cloudapp.net:13562 to fetcher#30
          2015-05-21 18:33:19,187 INFO [fetcher#1] org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: Assigning workernode279.btbig2.c2.internal.cloudapp.net:13562 with 1 to fetcher#1
          2015-05-21 18:33:19,187 INFO [fetcher#1] org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: assigned 1 of 1 to workernode279.btbig2.c2.internal.cloudapp.net:13562 to fetcher#1
          (fetch logs removed)
          2015-05-21 19:25:08,983 INFO [fetcher#9] org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: Assigning workernode133.btbig2.c2.internal.cloudapp.net:13562 with 88 to fetcher#9
          2015-05-21 19:25:08,983 INFO [fetcher#9] org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: assigned 20 of 88 to workernode133.btbig2.c2.internal.cloudapp.net:13562 to fetcher#9
          2015-05-21 19:25:08,983 INFO [fetcher#9] org.apache.hadoop.mapreduce.task.reduce.Fetcher: for url=13562/mapOutput?job=job_1432143397187_0004&reduce=1735&map=attempt_1432143397187_0004_m_006276_0,attempt_1432143397187_0004_m_006308_0,attempt_1432143397187_0004_m_006355_0,attempt_1432143397187_0004_m_006349_0,attempt_1432143397187_0004_m_006360_0,attempt_1432143397187_0004_m_006368_0,attempt_1432143397187_0004_m_006658_0,attempt_1432143397187_0004_m_008329_0,attempt_1432143397187_0004_m_008443_0,attempt_1432143397187_0004_m_008448_0,attempt_1432143397187_0004_m_008423_0,attempt_1432143397187_0004_m_008441_0,attempt_1432143397187_0004_m_008588_0,attempt_1432143397187_0004_m_008393_0,attempt_1432143397187_0004_m_010397_0,attempt_1432143397187_0004_m_010486_0,attempt_1432143397187_0004_m_010459_0,attempt_1432143397187_0004_m_010522_0,attempt_1432143397187_0004_m_010537_0,attempt_1432143397187_0004_m_010548_0 sent hash and received reply
          2015-05-21 19:25:08,999 INFO [fetcher#9] org.apache.hadoop.mapreduce.task.reduce.Fetcher: fetcher#9 about to shuffle output of map attempt_1432143397187_0004_m_006276_0 decomp: 307061 len: 69824 to MEMORY
          2015-05-21 19:25:08,999 INFO [fetcher#9] org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: workernode133.btbig2.c2.internal.cloudapp.net:13562 freed by fetcher#9 in 23ms
          2015-05-21 19:25:08,999 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in shuffle in fetcher#9
          	at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:134)
          	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376)
          	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
          	at java.security.AccessController.doPrivileged(Native Method)
          	at javax.security.auth.Subject.doAs(Subject.java:415)
          	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
          	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
          Caused by: java.lang.NoClassDefFoundError: Ljava/lang/InternalError
          	at org.apache.hadoop.io.compress.snappy.SnappyDecompressor.decompressBytesDirect(Native Method)
          	at org.apache.hadoop.io.compress.snappy.SnappyDecompressor.decompress(SnappyDecompressor.java:239)
          	at org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:88)
          	at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85)
          	at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192)
          	at org.apache.hadoop.mapreduce.task.reduce.InMemoryMapOutput.shuffle(InMemoryMapOutput.java:97)
          	at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyMapOutput(Fetcher.java:534)
          	at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:329)
          	at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:193)
          Caused by: java.lang.ClassNotFoundException: Ljava.lang.InternalError
          	at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
          	at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
          	at java.security.AccessController.doPrivileged(Native Method)
          	at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
          	at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
          	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
          	at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
          	... 9 more
          
          2015-05-21 19:25:08,999 INFO [main] org.apache.hadoop.mapred.Task: Runnning cleanup for the task
          
          
          Show
          ivanmi Ivan Mitic added a comment - Thanks for responding zhihai xu . The reducer task would succeed on retry, so I assumed it's not an environment problem. Below is the task syslog: 2015-05-21 18:33:10,773 INFO [main] org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from hadoop-metrics2.properties 2015-05-21 18:33:10,976 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period at 60 second(s). 2015-05-21 18:33:10,976 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ReduceTask metrics system started 2015-05-21 18:33:10,991 INFO [main] org.apache.hadoop.mapred.YarnChild: Executing with tokens: 2015-05-21 18:33:10,991 INFO [main] org.apache.hadoop.mapred.YarnChild: Kind: mapreduce.job, Service: job_1432143397187_0004, Ident: (org.apache.hadoop.mapreduce.security.token.JobTokenIdentifier@5df3ade7) 2015-05-21 18:33:11,132 INFO [main] org.apache.hadoop.mapred.YarnChild: Kind: RM_DELEGATION_TOKEN, Service: 100.76.156.98:9010, Ident: (owner=btbig2, renewer=mr token, realUser=hdp, issueDate=1432225097662, maxDate=1432829897662, sequenceNumber=2, masterKeyId=2) 2015-05-21 18:33:11,351 INFO [main] org.apache.hadoop.mapred.YarnChild: Sleeping for 0ms before retrying again. Got null now. 2015-05-21 18:33:12,335 INFO [main] org.apache.hadoop.mapred.YarnChild: Sleeping for 500ms before retrying again. Got null now. 2015-05-21 18:33:13,804 INFO [main] org.apache.hadoop.mapred.YarnChild: Sleeping for 1000ms before retrying again. Got null now. 2015-05-21 18:33:16,308 INFO [main] org.apache.hadoop.mapred.YarnChild: mapreduce.cluster.local.dir for child: c:/apps/temp/hdfs/nm-local-dir/usercache/btbig2/appcache/application_1432143397187_0004 2015-05-21 18:33:17,199 INFO [main] org.apache.hadoop.conf.Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id 2015-05-21 18:33:17,402 INFO [main] org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from hadoop-metrics2-azure-file-system.properties 2015-05-21 18:33:17,418 INFO [main] org.apache.hadoop.metrics2.sink.WindowsAzureETWSink: Init starting. 2015-05-21 18:33:17,418 INFO [main] org.apache.hadoop.metrics2.sink.WindowsAzureETWSink: Successfully loaded native library. LibraryName = EtwLogger 2015-05-21 18:33:17,418 INFO [main] org.apache.hadoop.metrics2.sink.WindowsAzureETWSink: Init completed. Native library loaded and ETW handle obtained. 2015-05-21 18:33:17,418 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSinkAdapter: Sink azurefs2 started 2015-05-21 18:33:17,433 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period at 60 second(s). 2015-05-21 18:33:17,433 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: azure-file-system metrics system started 2015-05-21 18:33:17,699 INFO [main] org.apache.hadoop.yarn.util.ProcfsBasedProcessTree: ProcfsBasedProcessTree currently is supported only on Linux. 2015-05-21 18:33:17,714 INFO [main] org.apache.hadoop.mapred.Task: Using ResourceCalculatorProcessTree : org.apache.hadoop.yarn.util.WindowsBasedProcessTree@36c76ec3 2015-05-21 18:33:17,746 INFO [main] org.apache.hadoop.mapred.ReduceTask: Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle@5c7b1796 2015-05-21 18:33:17,793 INFO [main] org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl: MergerManager: memoryLimit=741710208, maxSingleShuffleLimit=185427552, mergeThreshold=489528768, ioSortFactor=100, memToMemMergeOutputsThreshold=100 2015-05-21 18:33:17,793 INFO [EventFetcher for fetching Map Completion Events] org.apache.hadoop.mapreduce.task.reduce.EventFetcher: attempt_1432143397187_0004_r_001735_0 Thread started: EventFetcher for fetching Map Completion Events 2015-05-21 18:33:19,187 INFO [fetcher#30] org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: Assigning workernode165.btbig2.c2.internal.cloudapp.net:13562 with 1 to fetcher#30 2015-05-21 18:33:19,187 INFO [fetcher#30] org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: assigned 1 of 1 to workernode165.btbig2.c2.internal.cloudapp.net:13562 to fetcher#30 2015-05-21 18:33:19,187 INFO [fetcher#1] org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: Assigning workernode279.btbig2.c2.internal.cloudapp.net:13562 with 1 to fetcher#1 2015-05-21 18:33:19,187 INFO [fetcher#1] org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: assigned 1 of 1 to workernode279.btbig2.c2.internal.cloudapp.net:13562 to fetcher#1 (fetch logs removed) 2015-05-21 19:25:08,983 INFO [fetcher#9] org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: Assigning workernode133.btbig2.c2.internal.cloudapp.net:13562 with 88 to fetcher#9 2015-05-21 19:25:08,983 INFO [fetcher#9] org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: assigned 20 of 88 to workernode133.btbig2.c2.internal.cloudapp.net:13562 to fetcher#9 2015-05-21 19:25:08,983 INFO [fetcher#9] org.apache.hadoop.mapreduce.task.reduce.Fetcher: for url=13562/mapOutput?job=job_1432143397187_0004&reduce=1735&map=attempt_1432143397187_0004_m_006276_0,attempt_1432143397187_0004_m_006308_0,attempt_1432143397187_0004_m_006355_0,attempt_1432143397187_0004_m_006349_0,attempt_1432143397187_0004_m_006360_0,attempt_1432143397187_0004_m_006368_0,attempt_1432143397187_0004_m_006658_0,attempt_1432143397187_0004_m_008329_0,attempt_1432143397187_0004_m_008443_0,attempt_1432143397187_0004_m_008448_0,attempt_1432143397187_0004_m_008423_0,attempt_1432143397187_0004_m_008441_0,attempt_1432143397187_0004_m_008588_0,attempt_1432143397187_0004_m_008393_0,attempt_1432143397187_0004_m_010397_0,attempt_1432143397187_0004_m_010486_0,attempt_1432143397187_0004_m_010459_0,attempt_1432143397187_0004_m_010522_0,attempt_1432143397187_0004_m_010537_0,attempt_1432143397187_0004_m_010548_0 sent hash and received reply 2015-05-21 19:25:08,999 INFO [fetcher#9] org.apache.hadoop.mapreduce.task.reduce.Fetcher: fetcher#9 about to shuffle output of map attempt_1432143397187_0004_m_006276_0 decomp: 307061 len: 69824 to MEMORY 2015-05-21 19:25:08,999 INFO [fetcher#9] org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: workernode133.btbig2.c2.internal.cloudapp.net:13562 freed by fetcher#9 in 23ms 2015-05-21 19:25:08,999 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in shuffle in fetcher#9 at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:134) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) Caused by: java.lang.NoClassDefFoundError: Ljava/lang/InternalError at org.apache.hadoop.io.compress.snappy.SnappyDecompressor.decompressBytesDirect(Native Method) at org.apache.hadoop.io.compress.snappy.SnappyDecompressor.decompress(SnappyDecompressor.java:239) at org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:88) at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85) at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192) at org.apache.hadoop.mapreduce.task.reduce.InMemoryMapOutput.shuffle(InMemoryMapOutput.java:97) at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyMapOutput(Fetcher.java:534) at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:329) at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:193) Caused by: java.lang.ClassNotFoundException: Ljava.lang.InternalError at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) ... 9 more 2015-05-21 19:25:08,999 INFO [main] org.apache.hadoop.mapred.Task: Runnning cleanup for the task
          Hide
          zxu zhihai xu added a comment -

          Is it possible some early failure such as ClassNotFoundException or an ExceptionInInitializerError (indicating a failure in the static initialization block) or some incompatible version of the class found at runtime cause this exception?

          Show
          zxu zhihai xu added a comment - Is it possible some early failure such as ClassNotFoundException or an ExceptionInInitializerError (indicating a failure in the static initialization block) or some incompatible version of the class found at runtime cause this exception?
          Hide
          ivanmi Ivan Mitic added a comment -

          If I had to guess (and I can only guess at this time ) I'd say this is something similar to the root cause from HADOOP-8423, where in case of a transient error (e.g. a networking error) someone's state gets out of sync, and results in a task failure.

          Show
          ivanmi Ivan Mitic added a comment - If I had to guess (and I can only guess at this time ) I'd say this is something similar to the root cause from HADOOP-8423 , where in case of a transient error (e.g. a networking error) someone's state gets out of sync, and results in a task failure.
          Hide
          vinodkv Vinod Kumar Vavilapalli added a comment -

          If the problem turns around to be in MR, please move this to the MapReduce JIRA project.

          Show
          vinodkv Vinod Kumar Vavilapalli added a comment - If the problem turns around to be in MR, please move this to the MapReduce JIRA project.
          Hide
          ivanmi Ivan Mitic added a comment -

          If the problem turns around to be in MR, please move this to the MapReduce JIRA project

          Sounds good Vinod. I placed it under Hadoop based on my best guess.

          Show
          ivanmi Ivan Mitic added a comment - If the problem turns around to be in MR, please move this to the MapReduce JIRA project Sounds good Vinod. I placed it under Hadoop based on my best guess.
          Hide
          zxu zhihai xu added a comment -

          Hi Ivan Mitic, I looked at the hadoop-snappy library source code, it looks like the exception java.lang.NoClassDefFoundError: Ljava/lang/InternalError is from the following code at SnappyDecompressor.c

            if (ret == SNAPPY_BUFFER_TOO_SMALL){
              THROW(env, "Ljava/lang/InternalError", "Could not decompress data. Buffer length is too small.");
            } else if (ret == SNAPPY_INVALID_INPUT){
              THROW(env, "Ljava/lang/InternalError", "Could not decompress data. Input is invalid.");
            } else if (ret != SNAPPY_OK){
              THROW(env, "Ljava/lang/InternalError", "Could not decompress data.");
            }
          

          And also based on another HBASE issue HBASE-9644, this issue may be because corrupted map output data is fed to the SnappyDecompressor.

          I also found a bug at the above code in SnappyDecompressor.c. We should change the above code to:

            if (ret == SNAPPY_BUFFER_TOO_SMALL){
              THROW(env, "java/lang/InternalError", "Could not decompress data. Buffer length is too small.");
            } else if (ret == SNAPPY_INVALID_INPUT){
              THROW(env, "java/lang/InternalError", "Could not decompress data. Input is invalid.");
            } else if (ret != SNAPPY_OK){
              THROW(env, "java/lang/InternalError", "Could not decompress data.");
            }
          

          I think SnappyDecompressor really want to throw java.lang.InternalError exception, but due to this bug, it throws java.lang.NoClassDefFoundError/ClassNotFoundException.

          THROW is defined at org_apache_hadoop.h

          #define THROW(env, exception_name, message) \
            { \
          	jclass ecls = (*env)->FindClass(env, exception_name); \
          	if (ecls) { \
          	  (*env)->ThrowNew(env, ecls, message); \
          	  (*env)->DeleteLocalRef(env, ecls); \
          	} \
            }
          

          Based on the above code, you can see the correct parameter passed to FindClass should be "java/lang/InternalError" instead of "Ljava/lang/InternalError".
          Also java.lang.InternalError exception will be handled correctly in Fetcher.java at the following code:

                // The codec for lz0,lz4,snappy,bz2,etc. throw java.lang.InternalError
                // on decompression failures. Catching and re-throwing as IOException
                // to allow fetch failure logic to be processed
                try {
                  // Go!
                  LOG.info("fetcher#" + id + " about to shuffle output of map "
                      + mapOutput.getMapId() + " decomp: " + decompressedLength
                      + " len: " + compressedLength + " to " + mapOutput.getDescription());
                  mapOutput.shuffle(host, is, compressedLength, decompressedLength,
                      metrics, reporter);
                } catch (java.lang.InternalError e) {
                  LOG.warn("Failed to shuffle for fetcher#"+id, e);
                  throw new IOException(e);
                }
          

          So if SnappyDecompressor throws java.lang.InternalError exception, the reduce task won't fail and the map task may be rerun on another node after too many fetch failures.

          Show
          zxu zhihai xu added a comment - Hi Ivan Mitic , I looked at the hadoop-snappy library source code, it looks like the exception java.lang.NoClassDefFoundError: Ljava/lang/InternalError is from the following code at SnappyDecompressor.c if (ret == SNAPPY_BUFFER_TOO_SMALL){ THROW(env, "Ljava/lang/InternalError" , "Could not decompress data. Buffer length is too small." ); } else if (ret == SNAPPY_INVALID_INPUT){ THROW(env, "Ljava/lang/InternalError" , "Could not decompress data. Input is invalid." ); } else if (ret != SNAPPY_OK){ THROW(env, "Ljava/lang/InternalError" , "Could not decompress data." ); } And also based on another HBASE issue HBASE-9644 , this issue may be because corrupted map output data is fed to the SnappyDecompressor. I also found a bug at the above code in SnappyDecompressor.c. We should change the above code to: if (ret == SNAPPY_BUFFER_TOO_SMALL){ THROW(env, "java/lang/InternalError" , "Could not decompress data. Buffer length is too small." ); } else if (ret == SNAPPY_INVALID_INPUT){ THROW(env, "java/lang/InternalError" , "Could not decompress data. Input is invalid." ); } else if (ret != SNAPPY_OK){ THROW(env, "java/lang/InternalError" , "Could not decompress data." ); } I think SnappyDecompressor really want to throw java.lang.InternalError exception, but due to this bug, it throws java.lang.NoClassDefFoundError / ClassNotFoundException . THROW is defined at org_apache_hadoop.h #define THROW(env, exception_name, message) \ { \ jclass ecls = (*env)->FindClass(env, exception_name); \ if (ecls) { \ (*env)->ThrowNew(env, ecls, message); \ (*env)->DeleteLocalRef(env, ecls); \ } \ } Based on the above code, you can see the correct parameter passed to FindClass should be "java/lang/InternalError" instead of "Ljava/lang/InternalError". Also java.lang.InternalError exception will be handled correctly in Fetcher.java at the following code: // The codec for lz0,lz4,snappy,bz2,etc. throw java.lang.InternalError // on decompression failures. Catching and re-throwing as IOException // to allow fetch failure logic to be processed try { // Go! LOG.info( "fetcher#" + id + " about to shuffle output of map " + mapOutput.getMapId() + " decomp: " + decompressedLength + " len: " + compressedLength + " to " + mapOutput.getDescription()); mapOutput.shuffle(host, is, compressedLength, decompressedLength, metrics, reporter); } catch (java.lang.InternalError e) { LOG.warn( "Failed to shuffle for fetcher#" +id, e); throw new IOException(e); } So if SnappyDecompressor throws java.lang.InternalError exception, the reduce task won't fail and the map task may be rerun on another node after too many fetch failures.
          Hide
          jhalfpenny Jim Halfpenny added a comment -

          Patch for HADOOP-12033. Modified SnappyDecompressor.c to contain the correct Java class names for thrown errors.

          Show
          jhalfpenny Jim Halfpenny added a comment - Patch for HADOOP-12033 . Modified SnappyDecompressor.c to contain the correct Java class names for thrown errors.
          Hide
          zxu zhihai xu added a comment -

          thanks Jim Halfpenny for the patch, It looks like the issue is fixed at HADOOP-8151 at trunk but HADOOP-8151 is not merged into branch-2:
          https://github.com/apache/hadoop/blob/branch-2/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/io/compress/snappy/SnappyDecompressor.c#L121
          Hi Ivan Mitic, Did the branch you use have the fix for HADOOP-8151?
          Hi Vinod Kumar Vavilapalli, It looks like HADOOP-8151 is also not merged into branch-2.7.0.
          Can we add HADOOP-8151 to 2.7.1 release?
          thanks

          Show
          zxu zhihai xu added a comment - thanks Jim Halfpenny for the patch, It looks like the issue is fixed at HADOOP-8151 at trunk but HADOOP-8151 is not merged into branch-2: https://github.com/apache/hadoop/blob/branch-2/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/io/compress/snappy/SnappyDecompressor.c#L121 Hi Ivan Mitic , Did the branch you use have the fix for HADOOP-8151 ? Hi Vinod Kumar Vavilapalli , It looks like HADOOP-8151 is also not merged into branch-2.7.0. Can we add HADOOP-8151 to 2.7.1 release? thanks
          Hide
          ivanmi Ivan Mitic added a comment -

          Thanks zhihai xu, Jim Halfpenny! Very nice catch!

          Hi Ivan Mitic, Did the branch you use have the fix for HADOOP-8151?

          This fix is not included, user was on Hadoop 2.6.

          Show
          ivanmi Ivan Mitic added a comment - Thanks zhihai xu , Jim Halfpenny ! Very nice catch! Hi Ivan Mitic, Did the branch you use have the fix for HADOOP-8151 ? This fix is not included, user was on Hadoop 2.6.
          Hide
          djp Junping Du added a comment -

          It looks like the fix here is a sub set of HADOOP-8151. Shall we resolve this JIRA as a duplicated one, and reopen HADOOP-8151 for porting patch there to branch-2?

          Show
          djp Junping Du added a comment - It looks like the fix here is a sub set of HADOOP-8151 . Shall we resolve this JIRA as a duplicated one, and reopen HADOOP-8151 for porting patch there to branch-2?
          Hide
          zxu zhihai xu added a comment -

          Junping Du, Yes, Your suggestion LGTM.

          Show
          zxu zhihai xu added a comment - Junping Du , Yes, Your suggestion LGTM.
          Hide
          djp Junping Du added a comment -

          Thanks Zhihai Xu for confirmation on this. Already commit/merge HADOOP-8151 to branch-2, so resolve this JIRA as duplicated.

          Show
          djp Junping Du added a comment - Thanks Zhihai Xu for confirmation on this. Already commit/merge HADOOP-8151 to branch-2, so resolve this JIRA as duplicated.
          Hide
          zxu zhihai xu added a comment -

          Thanks Junping Du for committing HADOOP-8151 to branch-2!

          Show
          zxu zhihai xu added a comment - Thanks Junping Du for committing HADOOP-8151 to branch-2!
          Hide
          ivanmi Ivan Mitic added a comment -

          Thanks Junping Du and zhihai xu!

          Show
          ivanmi Ivan Mitic added a comment - Thanks Junping Du and zhihai xu !

            People

            • Assignee:
              Unassigned
              Reporter:
              ivanmi Ivan Mitic
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development