Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-12606

When using native decoder, DFSStripedStream#close crashes JVM after being called multiple times.

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 3.0.0-beta1
    • 3.0.0
    • erasure-coding
    • None

    Description

      When running NNbench on a RS(6,3) directory, JVM crashes double free or corruption:

      08:16:29 Running NNBENCH.
      08:16:29 WARNING: Use "yarn jar" to launch YARN applications.
      08:16:31 NameNode Benchmark 0.4
      08:16:31 17/10/04 08:16:31 INFO hdfs.NNBench: Test Inputs: 
      08:16:31 17/10/04 08:16:31 INFO hdfs.NNBench: Test Operation: create_write
      08:16:31 17/10/04 08:16:31 INFO hdfs.NNBench: Start time: 2017-10-04 08:18:31,16
      :
      :
      08:18:54 *** Error in `/usr/java/jdk1.8.0_144/bin/java': double free or corruption (out): 0x00007ffb55dbfab0 ***
      08:18:54 ======= Backtrace: =========
      08:18:54 /lib64/libc.so.6(+0x7c619)[0x7ffb5b85f619]
      08:18:54 [0x7ffb45017774]
      08:18:54 ======= Memory map: ========
      08:18:54 00400000-00401000 r-xp 00000000 ca:01 276832134 /usr/java/jdk1.8.0_144/bin/java
      08:18:54 00600000-00601000 rw-p 00000000 ca:01 276832134 /usr/java/jdk1.8.0_144/bin/java
      08:18:54 0173e000-01f91000 rw-p 00000000 00:00 0 [heap]
      08:18:54 603600000-614700000 rw-p 00000000 00:00 0 
      08:18:54 614700000-72bd00000 ---p 00000000 00:00 0 
      08:18:54 72bd00000-73a500000 rw-p 00000000 00:00 0 
      08:18:54 73a500000-7c0000000 ---p 00000000 00:00 0 
      08:18:54 7c0000000-7c0400000 rw-p 00000000 00:00 0 
      08:18:54 7c0400000-800000000 ---p 00000000 00:00 0 
      08:18:54 7ffb20174000-7ffb208ab000 rw-p 00000000 00:00 0 
      08:18:54 7ffb208ab000-7ffb20975000 ---p 00000000 00:00 0 
      08:18:54 7ffb20975000-7ffb20b75000 rw-p 00000000 00:00 0 
      08:18:54 7ffb20b75000-7ffb20d75000 rw-p 00000000 00:00 0 
      08:18:54 7ffb20d75000-7ffb20d8a000 r-xp 00000000 ca:01 209866 /usr/lib64/libgcc_s-4.8.5-20150702.so.1
      08:18:54 7ffb20d8a000-7ffb20f89000 ---p 00015000 ca:01 209866 /usr/lib64/libgcc_s-4.8.5-20150702.so.1
      08:18:54 7ffb20f89000-7ffb20f8a000 r--p 00014000 ca:01 209866 /usr/lib64/libgcc_s-4.8.5-20150702.so.1
      08:18:54 7ffb20f8a000-7ffb20f8b000 rw-p 00015000 ca:01 209866 /usr/lib64/libgcc_s-4.8.5-20150702.so.1
      08:18:54 7ffb20f8b000-7ffb20fbd000 r-xp 00000000 ca:01 553654092 /usr/java/jdk1.8.0_144/jre/lib/amd64/libsunec.so
      08:18:54 7ffb20fbd000-7ffb211bc000 ---p 00032000 ca:01 553654092 /usr/java/jdk1.8.0_144/jre/lib/amd64/libsunec.so
      08:18:54 7ffb211bc000-7ffb211c2000 rw-p 00031000 ca:01 553654092 /usr/java/jdk1.8.0_144/jre/lib/amd64/libsunec.so
      :
      :
      08:18:54 7ffb5c3fb000-7ffb5c3fc000 r--p 00000000 00:00 0 
      08:18:54 7ffb5c3fc000-7ffb5c3fd000 rw-p 00000000 00:00 0 
      08:18:54 7ffb5c3fd000-7ffb5c3fe000 r--p 00021000 ca:01 637266 /usr/lib64/ld-2.17.so
      08:18:54 7ffb5c3fe000-7ffb5c3ff000 rw-p 00022000 ca:01 637266 /usr/lib64/ld-2.17.so
      08:18:54 7ffb5c3ff000-7ffb5c400000 rw-p 00000000 00:00 0 
      08:18:54 7ffdf8767000-7ffdf8788000 rw-p 00000000 00:00 0 [stack]
      08:18:54 7ffdf878b000-7ffdf878d000 r-xp 00000000 00:00 0 [vdso]
      08:18:54 ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
      

      It happens on both jdk1.8.0_144 and jdk1.8.0_121 in our environments.

      It is highly suspicious due to the native code used in erasure coding, i.e., ISA-L is not thread safe https://01.org/sites/default/files/documentation/isa-l_open_src_2.10.pdf

      Attachments

        1. HDFS-12606.00.patch
          2 kB
          Lei (Eddy) Xu

        Issue Links

          Activity

            eddyxu Lei (Eddy) Xu added a comment -

            I saw a single static global variable isaLoader in isal_loader.h providing access to ISA-L library. One simple way is to make NativeRSRawDecoder, but it will limits the parallelism of encoding / decoding on DN. Kai, Sammi Could you give some thoughts about this?

            eddyxu Lei (Eddy) Xu added a comment - I saw a single static global variable isaLoader in isal_loader.h providing access to ISA-L library. One simple way is to make NativeRSRawDecoder , but it will limits the parallelism of encoding / decoding on DN. Kai , Sammi Could you give some thoughts about this?
            drankye Kai Zheng added a comment -

            Thanks for the ping Eddy. By design we can have multiple coder instances for concurrent coding tasks, and no global static variable should block this except bugs. We guard isal codes in Java, not relying on its thread model. We can investigate it when back to office, next Monday.

            drankye Kai Zheng added a comment - Thanks for the ping Eddy. By design we can have multiple coder instances for concurrent coding tasks, and no global static variable should block this except bugs. We guard isal codes in Java, not relying on its thread model. We can investigate it when back to office, next Monday.
            eddyxu Lei (Eddy) Xu added a comment -

            Found out that this was due to DFSStripedInputStream#close being called more than once.

            Thread 105026: (state = IN_NATIVE)
             - org.apache.hadoop.io.erasurecode.rawcoder.NativeRSRawDecoder.destroyImpl() @bci=0 (Interpreted frame)
             - org.apache.hadoop.io.erasurecode.rawcoder.NativeRSRawDecoder.release() @bci=1, line=50 (Interpreted frame)
             - org.apache.hadoop.hdfs.DFSStripedInputStream.close() @bci=56, line=191 (Interpreted frame)
             - java.io.FilterInputStream.close() @bci=4, line=181 (Interpreted frame)
             - java.io.FilterInputStream.close() @bci=4, line=181 (Interpreted frame)
             - org.apache.hadoop.hdfs.NNBench.analyzeResults() @bci=458, line=352 (Interpreted frame)
             - org.apache.hadoop.hdfs.NNBench.run(java.lang.String[]) @bci=34, line=608 (Interpreted frame)
             - org.apache.hadoop.util.ToolRunner.run(org.apache.hadoop.conf.Configuration, org.apache.hadoop.util.Tool, java.lang.String[]) @bci=61, line=76 (Interpreted frame)
             - org.apache.hadoop.util.ToolRunner.run(org.apache.hadoop.util.Tool, java.lang.String[]) @bci=8, line=90 (Interpreted frame)
             - org.apache.hadoop.hdfs.NNBench.main(java.lang.String[]) @bci=8, line=580 (Interpreted frame)
             - sun.reflect.NativeMethodAccessorImpl.invoke0(java.lang.reflect.Method, java.lang.Object, java.lang.Object[]) @bci=0 (Interpreted frame)
             - sun.reflect.NativeMethodAccessorImpl.invoke(java.lang.Object, java.lang.Object[]) @bci=100, line=62 (Interpreted frame)
             - sun.reflect.DelegatingMethodAccessorImpl.invoke(java.lang.Object, java.lang.Object[]) @bci=6, line=43 (Interpreted frame)
             - java.lang.reflect.Method.invoke(java.lang.Object, java.lang.Object[]) @bci=56, line=498 (Interpreted frame)
             - org.apache.hadoop.util.RunJar.run(java.lang.String[]) @bci=453, line=239 (Interpreted frame)
             - org.apache.hadoop.util.RunJar.main(java.lang.String[]) @bci=8, line=153 (Interpreted frame)
            

            The close() should be idempotent to be safer.

            eddyxu Lei (Eddy) Xu added a comment - Found out that this was due to DFSStripedInputStream#close being called more than once. Thread 105026: (state = IN_NATIVE) - org.apache.hadoop.io.erasurecode.rawcoder.NativeRSRawDecoder.destroyImpl() @bci=0 (Interpreted frame) - org.apache.hadoop.io.erasurecode.rawcoder.NativeRSRawDecoder.release() @bci=1, line=50 (Interpreted frame) - org.apache.hadoop.hdfs.DFSStripedInputStream.close() @bci=56, line=191 (Interpreted frame) - java.io.FilterInputStream.close() @bci=4, line=181 (Interpreted frame) - java.io.FilterInputStream.close() @bci=4, line=181 (Interpreted frame) - org.apache.hadoop.hdfs.NNBench.analyzeResults() @bci=458, line=352 (Interpreted frame) - org.apache.hadoop.hdfs.NNBench.run(java.lang. String []) @bci=34, line=608 (Interpreted frame) - org.apache.hadoop.util.ToolRunner.run(org.apache.hadoop.conf.Configuration, org.apache.hadoop.util.Tool, java.lang. String []) @bci=61, line=76 (Interpreted frame) - org.apache.hadoop.util.ToolRunner.run(org.apache.hadoop.util.Tool, java.lang. String []) @bci=8, line=90 (Interpreted frame) - org.apache.hadoop.hdfs.NNBench.main(java.lang. String []) @bci=8, line=580 (Interpreted frame) - sun.reflect.NativeMethodAccessorImpl.invoke0(java.lang.reflect.Method, java.lang. Object , java.lang. Object []) @bci=0 (Interpreted frame) - sun.reflect.NativeMethodAccessorImpl.invoke(java.lang. Object , java.lang. Object []) @bci=100, line=62 (Interpreted frame) - sun.reflect.DelegatingMethodAccessorImpl.invoke(java.lang. Object , java.lang. Object []) @bci=6, line=43 (Interpreted frame) - java.lang.reflect.Method.invoke(java.lang. Object , java.lang. Object []) @bci=56, line=498 (Interpreted frame) - org.apache.hadoop.util.RunJar.run(java.lang. String []) @bci=453, line=239 (Interpreted frame) - org.apache.hadoop.util.RunJar.main(java.lang. String []) @bci=8, line=153 (Interpreted frame) The close() should be idempotent to be safer.
            eddyxu Lei (Eddy) Xu added a comment -

            Add a test to verify the crash.

            Without the fix, running mvn test -Pnative -Dtest=TestDFSStripedInputStream can crash JVM.

            eddyxu Lei (Eddy) Xu added a comment - Add a test to verify the crash. Without the fix, running mvn test -Pnative -Dtest=TestDFSStripedInputStream can crash JVM.
            andrew.wang Andrew Wang added a comment -

            Nice find Eddy. Does the output stream also suffer from the same issue? I don't see a null check wrapping encoder.release() in DFSStripedOutputStream#closeImpl. Maybe we should fix the output stream at HDFS-12612 since it's a very similar issue.

            Related to this patch, would it be better if we did this null check/set in the native code? This makes the API simpler to use. We should also add some javadoc about the idempotency (or non-idempotency) of these APIs.

            Also, what's the behavior of release() for the Java coder implementation?

            andrew.wang Andrew Wang added a comment - Nice find Eddy. Does the output stream also suffer from the same issue? I don't see a null check wrapping encoder.release() in DFSStripedOutputStream#closeImpl. Maybe we should fix the output stream at HDFS-12612 since it's a very similar issue. Related to this patch, would it be better if we did this null check/set in the native code? This makes the API simpler to use. We should also add some javadoc about the idempotency (or non-idempotency) of these APIs. Also, what's the behavior of release() for the Java coder implementation?
            eddyxu Lei (Eddy) Xu added a comment -

            Hi, andrew.wang

            Maybe we should fix the output stream at HDFS-12612 since it's a very similar issue.

            HDFS-12612 has IOE from streams in the context, I will take closer look of the difference between it and this one. Will work separately on HDFS-12612.

            Related to this patch, would it be better if we did this null check/set in the native code? This makes the API simpler to use.

            Good point. I feel that it'd be another separated effort to solid implementation of the native code. However, it is still a good practice to set decoder to null in close(), similar to setting other fields in the same close(), so that it can be orthogonal to the implementations of EC coder. I will file follow on JIRA to take care of the native code part.

            what's the behavior of release() for the Java coder implementation?

            You meant in non-native coder ? release() are only implemented NativeRSRawDecoder and NativeXORRawDecoder, which free the memory of the coder struct in C.

            eddyxu Lei (Eddy) Xu added a comment - Hi, andrew.wang Maybe we should fix the output stream at HDFS-12612 since it's a very similar issue. HDFS-12612 has IOE from streams in the context, I will take closer look of the difference between it and this one. Will work separately on HDFS-12612 . Related to this patch, would it be better if we did this null check/set in the native code? This makes the API simpler to use. Good point. I feel that it'd be another separated effort to solid implementation of the native code. However, it is still a good practice to set decoder to null in close(), similar to setting other fields in the same close() , so that it can be orthogonal to the implementations of EC coder. I will file follow on JIRA to take care of the native code part. what's the behavior of release() for the Java coder implementation? You meant in non-native coder ? release() are only implemented NativeRSRawDecoder and NativeXORRawDecoder , which free the memory of the coder struct in C.
            andrew.wang Andrew Wang added a comment -

            Hi Eddy,

            Good point. I feel that it'd be another separated effort to solid implementation of the native code. However, it is still a good practice to set decoder to null in close(), similar to setting other fields in the same close(), so that it can be orthogonal to the implementations of EC coder. I will file follow on JIRA to take care of the native code part.

            Sure, the belt-and-suspenders approach to safety

            You meant in non-native coder ? release() are only implemented NativeRSRawDecoder and NativeXORRawDecoder, which free the memory of the coder struct in C.

            Yea, so the empty inherited implementation is idempotent.

            I'm +1 on this change, as long as in the follow-on we also update the javadoc to clarify the idempotency of these APIs and have appropriate unit tests.

            andrew.wang Andrew Wang added a comment - Hi Eddy, Good point. I feel that it'd be another separated effort to solid implementation of the native code. However, it is still a good practice to set decoder to null in close(), similar to setting other fields in the same close(), so that it can be orthogonal to the implementations of EC coder. I will file follow on JIRA to take care of the native code part. Sure, the belt-and-suspenders approach to safety You meant in non-native coder ? release() are only implemented NativeRSRawDecoder and NativeXORRawDecoder, which free the memory of the coder struct in C. Yea, so the empty inherited implementation is idempotent. I'm +1 on this change, as long as in the follow-on we also update the javadoc to clarify the idempotency of these APIs and have appropriate unit tests.
            hadoopqa Hadoop QA added a comment -
            -1 overall



            Vote Subsystem Runtime Comment
            0 reexec 0m 18s Docker mode activated.
                  Prechecks
            +1 @author 0m 0s The patch does not contain any @author tags.
            +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
                  trunk Compile Tests
            0 mvndep 0m 8s Maven dependency ordering for branch
            +1 mvninstall 15m 9s trunk passed
            +1 compile 1m 43s trunk passed
            +1 checkstyle 0m 43s trunk passed
            +1 mvnsite 1m 41s trunk passed
            +1 shadedclient 11m 33s branch has no errors when building and testing our client artifacts.
            +1 findbugs 3m 32s trunk passed
            +1 javadoc 1m 10s trunk passed
                  Patch Compile Tests
            0 mvndep 0m 8s Maven dependency ordering for patch
            +1 mvninstall 1m 35s the patch passed
            +1 compile 1m 42s the patch passed
            +1 javac 1m 42s the patch passed
            +1 checkstyle 0m 41s the patch passed
            +1 mvnsite 1m 40s the patch passed
            +1 whitespace 0m 0s The patch has no whitespace issues.
            +1 shadedclient 9m 33s patch has no errors when building and testing our client artifacts.
            +1 findbugs 3m 27s the patch passed
            +1 javadoc 0m 59s the patch passed
                  Other Tests
            +1 unit 1m 13s hadoop-hdfs-client in the patch passed.
            -1 unit 91m 57s hadoop-hdfs in the patch failed.
            +1 asflicense 0m 24s The patch does not generate ASF License warnings.
            148m 15s



            Reason Tests
            Failed junit tests hadoop.hdfs.server.namenode.ha.TestDNFencingWithReplication



            Subsystem Report/Notes
            Docker Image:yetus/hadoop:71bbb86
            JIRA Issue HDFS-12606
            JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12890830/HDFS-12606.00.patch
            Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle
            uname Linux b00ee514a997 3.13.0-117-generic #164-Ubuntu SMP Fri Apr 7 11:05:26 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
            Build tool maven
            Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
            git revision trunk / 2b08a1f
            Default Java 1.8.0_144
            findbugs v3.1.0-RC1
            unit https://builds.apache.org/job/PreCommit-HDFS-Build/21577/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
            Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/21577/testReport/
            modules C: hadoop-hdfs-project/hadoop-hdfs-client hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project
            Console output https://builds.apache.org/job/PreCommit-HDFS-Build/21577/console
            Powered by Apache Yetus 0.6.0-SNAPSHOT http://yetus.apache.org

            This message was automatically generated.

            hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 18s Docker mode activated.       Prechecks +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.       trunk Compile Tests 0 mvndep 0m 8s Maven dependency ordering for branch +1 mvninstall 15m 9s trunk passed +1 compile 1m 43s trunk passed +1 checkstyle 0m 43s trunk passed +1 mvnsite 1m 41s trunk passed +1 shadedclient 11m 33s branch has no errors when building and testing our client artifacts. +1 findbugs 3m 32s trunk passed +1 javadoc 1m 10s trunk passed       Patch Compile Tests 0 mvndep 0m 8s Maven dependency ordering for patch +1 mvninstall 1m 35s the patch passed +1 compile 1m 42s the patch passed +1 javac 1m 42s the patch passed +1 checkstyle 0m 41s the patch passed +1 mvnsite 1m 40s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 shadedclient 9m 33s patch has no errors when building and testing our client artifacts. +1 findbugs 3m 27s the patch passed +1 javadoc 0m 59s the patch passed       Other Tests +1 unit 1m 13s hadoop-hdfs-client in the patch passed. -1 unit 91m 57s hadoop-hdfs in the patch failed. +1 asflicense 0m 24s The patch does not generate ASF License warnings. 148m 15s Reason Tests Failed junit tests hadoop.hdfs.server.namenode.ha.TestDNFencingWithReplication Subsystem Report/Notes Docker Image:yetus/hadoop:71bbb86 JIRA Issue HDFS-12606 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12890830/HDFS-12606.00.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle uname Linux b00ee514a997 3.13.0-117-generic #164-Ubuntu SMP Fri Apr 7 11:05:26 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 2b08a1f Default Java 1.8.0_144 findbugs v3.1.0-RC1 unit https://builds.apache.org/job/PreCommit-HDFS-Build/21577/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/21577/testReport/ modules C: hadoop-hdfs-project/hadoop-hdfs-client hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project Console output https://builds.apache.org/job/PreCommit-HDFS-Build/21577/console Powered by Apache Yetus 0.6.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
            eddyxu Lei (Eddy) Xu added a comment -

            Thanks for the reviews, andrew.wang

            Committed to trunk and branch-3.0

            eddyxu Lei (Eddy) Xu added a comment - Thanks for the reviews, andrew.wang Committed to trunk and branch-3.0
            hudson Hudson added a comment -

            SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #13052 (See https://builds.apache.org/job/Hadoop-trunk-Commit/13052/)
            HDFS-12606. When using native decoder, DFSStripedStream.close crashes (lei: rev 46644319e1b3295ddbc7597c060956bf46487d11)

            • (edit) hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSStripedInputStream.java
            • (edit) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDFSStripedInputStream.java
            hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #13052 (See https://builds.apache.org/job/Hadoop-trunk-Commit/13052/ ) HDFS-12606 . When using native decoder, DFSStripedStream.close crashes (lei: rev 46644319e1b3295ddbc7597c060956bf46487d11) (edit) hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSStripedInputStream.java (edit) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDFSStripedInputStream.java

            People

              eddyxu Lei (Eddy) Xu
              eddyxu Lei (Eddy) Xu
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: