Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-12832

INode.getFullPathName may throw ArrayIndexOutOfBoundsException lead to NameNode exit

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 2.7.4, 3.0.0-beta1
    • Fix Version/s: 2.8.3, 2.7.5, 3.1.0, 2.10.0, 2.9.1, 3.0.1
    • Component/s: namenode
    • Labels:
      None

      Description

      INode.java
      public String getFullPathName() {
          // Get the full path name of this inode.
          if (isRoot()) {
            return Path.SEPARATOR;
          }
          // compute size of needed bytes for the path
          int idx = 0;
          for (INode inode = this; inode != null; inode = inode.getParent()) {
            // add component + delimiter (if not tail component)
            idx += inode.getLocalNameBytes().length + (inode != this ? 1 : 0);
          }
          byte[] path = new byte[idx];
          for (INode inode = this; inode != null; inode = inode.getParent()) {
            if (inode != this) {
              path[--idx] = Path.SEPARATOR_CHAR;
            }
            byte[] name = inode.getLocalNameBytes();
            idx -= name.length;
            System.arraycopy(name, 0, path, idx, name.length);
          }
          return DFSUtil.bytes2String(path);
        }
      

      We found ArrayIndexOutOfBoundsException at System.arraycopy(name, 0, path, idx, name.length) when ReplicaMonitor work ,and the NameNode will quit.

      It seems the two loop is not synchronized, the path's length is changed.

      1. HDFS-12832-branch-2.7.002.patch
        3 kB
        Konstantin Shvachko
      2. HDFS-12832-branch-2.002.patch
        3 kB
        Konstantin Shvachko
      3. HDFS-12832.002.patch
        5 kB
        Konstantin Shvachko
      4. exception.log
        1 kB
        DENG FEI
      5. HDFS-12832-trunk-001.patch
        1 kB
        DENG FEI

        Issue Links

          Activity

          Hide
          djp Junping Du added a comment -

          Merge to branch-2.8.3 as well.

          Show
          djp Junping Du added a comment - Merge to branch-2.8.3 as well.
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #13288 (See https://builds.apache.org/job/Hadoop-trunk-Commit/13288/)
          HDFS-12832. INode.getFullPathName may throw (shv: rev d331762f24b3f22f609366740c9c4f449edc61ac)

          • (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/ErasureCodingWork.java
          • (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/ReplicationWork.java
          • (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java
          • (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockReconstructionWork.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #13288 (See https://builds.apache.org/job/Hadoop-trunk-Commit/13288/ ) HDFS-12832 . INode.getFullPathName may throw (shv: rev d331762f24b3f22f609366740c9c4f449edc61ac) (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/ErasureCodingWork.java (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/ReplicationWork.java (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockReconstructionWork.java
          Hide
          shv Konstantin Shvachko added a comment -

          Just committed to the following branches:

             5c37a0b..9d406e5  branch-2 -> branch-2
             0da13b9..7252e18  branch-2.7 -> branch-2.7
             2976d04..3219b1b  branch-2.8 -> branch-2.8
             83c4075..3f9f005  branch-2.9 -> branch-2.9
             a4f1e30..4cbd5ea  branch-3.0 -> branch-3.0
             30941d9..d331762  trunk -> trunk
          

          LMK if I missed any. Thanks everybody for reviews and comments.

          Show
          shv Konstantin Shvachko added a comment - Just committed to the following branches: 5c37a0b..9d406e5 branch-2 -> branch-2 0da13b9..7252e18 branch-2.7 -> branch-2.7 2976d04..3219b1b branch-2.8 -> branch-2.8 83c4075..3f9f005 branch-2.9 -> branch-2.9 a4f1e30..4cbd5ea branch-3.0 -> branch-3.0 30941d9..d331762 trunk -> trunk LMK if I missed any. Thanks everybody for reviews and comments.
          Hide
          djp Junping Du added a comment -

          Thanks Konstantin Shvachko for the patch and Zhe Zhang for review. I think we also need this fix in branch-2.8 (branch-2.8.3) and branch-2.9. Adding 2.8.3, 2.9.1 and 3.0.0 in target version.

          Show
          djp Junping Du added a comment - Thanks Konstantin Shvachko for the patch and Zhe Zhang for review. I think we also need this fix in branch-2.8 (branch-2.8.3) and branch-2.9. Adding 2.8.3, 2.9.1 and 3.0.0 in target version.
          Hide
          zhz Zhe Zhang added a comment -

          +1 on branch-2 and branch-2.7 patches as well. They are pretty clean fixes to avoid race conditions on updating and reading bc's members.

          Show
          zhz Zhe Zhang added a comment - +1 on branch-2 and branch-2.7 patches as well. They are pretty clean fixes to avoid race conditions on updating and reading bc 's members.
          Hide
          zhz Zhe Zhang added a comment -

          Thanks Konstantin Shvachko for the fix. Trunk patch LGTM, +1. I'll finish reviewing branch patches tomorrow.

          Show
          zhz Zhe Zhang added a comment - Thanks Konstantin Shvachko for the fix. Trunk patch LGTM, +1. I'll finish reviewing branch patches tomorrow.
          Hide
          shahrs87 Rushabh S Shah added a comment -

          I think if the name is calculated it must be inside the lock, otherwise it's a bug.

          Agreed. +1 non-binding.

          Show
          shahrs87 Rushabh S Shah added a comment - I think if the name is calculated it must be inside the lock, otherwise it's a bug. Agreed. +1 non-binding.
          Hide
          xkrogen Erik Krogen added a comment -

          I think it is also relevant to point out that HDFS-10674 made that comment somewhat less true as it makes the #getName() code path less expensive. Certainly still worth considering but hopefully the impact is mitigated. I agree with Konstantin that getting the safe fix in to unblock releases is the right move, then a follow-on JIRA to measure performance implications of removing srcPath from the API and depending on the results deprecating the current API.

          Show
          xkrogen Erik Krogen added a comment - I think it is also relevant to point out that HDFS-10674 made that comment somewhat less true as it makes the #getName() code path less expensive. Certainly still worth considering but hopefully the impact is mitigated. I agree with Konstantin that getting the safe fix in to unblock releases is the right move, then a follow-on JIRA to measure performance implications of removing srcPath from the API and depending on the results deprecating the current API.
          Hide
          shv Konstantin Shvachko added a comment -

          Hi Rushabh S Shah this is what I called "dirty" fix, see my earlier comment and communication afterwards. If you know the story behind the comment about costly filename calculation please let us know. It's pretty old, I cannot see beyond project merge point.
          I think if the name is calculated it must be inside the lock, otherwise it's a bug. We need to unblock upcoming releases, whichdepend on this issue, and the safest way to fix this is by not changing any logic in block placement. Even though in standard Hadoop placement policies file name is not involved we do not know what other policies are out there. Also we cannot remove APIs before they are deprecated.

          Show
          shv Konstantin Shvachko added a comment - Hi Rushabh S Shah this is what I called "dirty" fix, see my earlier comment and communication afterwards. If you know the story behind the comment about costly filename calculation please let us know. It's pretty old, I cannot see beyond project merge point. I think if the name is calculated it must be inside the lock, otherwise it's a bug. We need to unblock upcoming releases, whichdepend on this issue, and the safest way to fix this is by not changing any logic in block placement. Even though in standard Hadoop placement policies file name is not involved we do not know what other policies are out there. Also we cannot remove APIs before they are deprecated.
          Hide
          shahrs87 Rushabh S Shah added a comment -

          Thanks DENG FEI for reporting the issue and Konstantin Shvachko for the patch.

           // choose replication targets: NOT HOLDING THE GLOBAL LOCK
          // It is costly to extract the filename for which chooseTargets is called,
          // so for now we pass in the Inode itself.
          

          If we are not going to use the srcPath anywhere in BlockPlacementPolicy#chooseTarget, then why to even compute bc#getName() ?
          The above comment was specifically made to move this computation outside the lock and now we are moving this inside the lock and that too for the value that is never going to get used.
          IMO we should set the srcPath as null and add a warning in code to refer to this jira so future programmers would know the background on why its set to null.
          In trunk, we can deprecate BlockPlacementPolicy.chooseTargets() with srcPath parameter as you mentioned in HDFS-12856.

          Show
          shahrs87 Rushabh S Shah added a comment - Thanks DENG FEI for reporting the issue and Konstantin Shvachko for the patch. // choose replication targets: NOT HOLDING THE GLOBAL LOCK // It is costly to extract the filename for which chooseTargets is called, // so for now we pass in the Inode itself. If we are not going to use the srcPath anywhere in BlockPlacementPolicy#chooseTarget , then why to even compute bc#getName() ? The above comment was specifically made to move this computation outside the lock and now we are moving this inside the lock and that too for the value that is never going to get used. IMO we should set the srcPath as null and add a warning in code to refer to this jira so future programmers would know the background on why its set to null. In trunk, we can deprecate BlockPlacementPolicy.chooseTargets() with srcPath parameter as you mentioned in HDFS-12856 .
          Hide
          xkrogen Erik Krogen added a comment -

          Hey Konstantin Shvachko, changes look good to me and thank you for filing the two follow-on JIRAs to address these rare but potentially disastrous race conditions. The lack of additional unit test seems very reasonable to me. I wonder if there is somewhere we should add developer documentation to indicate that this type of usage pattern is not safe?

          Show
          xkrogen Erik Krogen added a comment - Hey Konstantin Shvachko , changes look good to me and thank you for filing the two follow-on JIRAs to address these rare but potentially disastrous race conditions. The lack of additional unit test seems very reasonable to me. I wonder if there is somewhere we should add developer documentation to indicate that this type of usage pattern is not safe?
          Hide
          shv Konstantin Shvachko added a comment -

          TestUnbuffer is tracked under HDFS-12815. Other tests succeeded locally.

          Show
          shv Konstantin Shvachko added a comment - TestUnbuffer is tracked under HDFS-12815 . Other tests succeeded locally.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 18m 47s Docker mode activated.
                Prechecks
          +1 @author 0m 0s The patch does not contain any @author tags.
          -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
                branch-2.7 Compile Tests
          +1 mvninstall 8m 29s branch-2.7 passed
          +1 compile 1m 5s branch-2.7 passed
          +1 checkstyle 0m 27s branch-2.7 passed
          +1 mvnsite 1m 6s branch-2.7 passed
          +1 findbugs 3m 5s branch-2.7 passed
          +1 javadoc 1m 48s branch-2.7 passed
                Patch Compile Tests
          +1 mvninstall 0m 55s the patch passed
          +1 compile 1m 2s the patch passed
          +1 javac 1m 2s the patch passed
          +1 checkstyle 0m 23s hadoop-hdfs-project/hadoop-hdfs: The patch generated 0 new + 188 unchanged - 1 fixed = 188 total (was 189)
          +1 mvnsite 0m 58s the patch passed
          -1 whitespace 0m 0s The patch has 60 line(s) that end in whitespace. Use git apply --whitespace=fix <<patch_file>>. Refer https://git-scm.com/docs/git-apply
          +1 findbugs 3m 8s the patch passed
          +1 javadoc 1m 43s the patch passed
                Other Tests
          -1 unit 76m 32s hadoop-hdfs in the patch failed.
          -1 asflicense 0m 50s The patch generated 7 ASF License warnings.
          121m 59s



          Reason Tests
          Unreaped Processes hadoop-hdfs:17
          Failed junit tests hadoop.hdfs.server.datanode.TestDataNodeInitStorage
            hadoop.hdfs.server.datanode.TestRefreshNamenodes
            hadoop.hdfs.server.namenode.TestNamenodeRetryCache
          Timed out junit tests org.apache.hadoop.hdfs.server.datanode.TestHSync
            org.apache.hadoop.hdfs.server.datanode.TestDataNodeVolumeFailureReporting
            org.apache.hadoop.hdfs.server.datanode.TestDataNodeMetrics
            org.apache.hadoop.hdfs.server.datanode.TestReadOnlySharedStorage
            org.apache.hadoop.hdfs.TestPread
            org.apache.hadoop.hdfs.TestDecommission
            org.apache.hadoop.hdfs.TestDFSAddressConfig
            org.apache.hadoop.hdfs.TestDFSUpgrade
            org.apache.hadoop.hdfs.server.datanode.TestDataNodeMultipleRegistrations
            org.apache.hadoop.hdfs.server.datanode.TestIncrementalBlockReports
            org.apache.hadoop.hdfs.server.datanode.TestFsDatasetCache
            org.apache.hadoop.hdfs.TestDFSRollback
            org.apache.hadoop.hdfs.server.datanode.TestBlockScanner
            org.apache.hadoop.hdfs.server.datanode.TestCachingStrategy
            org.apache.hadoop.hdfs.server.datanode.TestDataNodeVolumeFailure
            org.apache.hadoop.hdfs.TestAbandonBlock



          Subsystem Report/Notes
          Docker Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:67e87c9
          JIRA Issue HDFS-12832
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12898948/HDFS-12832-branch-2.7.002.patch
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle
          uname Linux 48182a918b6e 3.13.0-135-generic #184-Ubuntu SMP Wed Oct 18 11:55:51 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/patchprocess/precommit/personality/provided.sh
          git revision branch-2.7 / 0da13b9
          maven version: Apache Maven 3.0.5
          Default Java 1.7.0_151
          findbugs v3.0.0
          whitespace https://builds.apache.org/job/PreCommit-HDFS-Build/22172/artifact/out/whitespace-eol.txt
          Unreaped Processes Log https://builds.apache.org/job/PreCommit-HDFS-Build/22172/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs-reaper.txt
          unit https://builds.apache.org/job/PreCommit-HDFS-Build/22172/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
          Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/22172/testReport/
          asflicense https://builds.apache.org/job/PreCommit-HDFS-Build/22172/artifact/out/patch-asflicense-problems.txt
          Max. process+thread count 3747 (vs. ulimit of 5000)
          modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
          Console output https://builds.apache.org/job/PreCommit-HDFS-Build/22172/console
          Powered by Apache Yetus 0.7.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 18m 47s Docker mode activated.       Prechecks +1 @author 0m 0s The patch does not contain any @author tags. -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.       branch-2.7 Compile Tests +1 mvninstall 8m 29s branch-2.7 passed +1 compile 1m 5s branch-2.7 passed +1 checkstyle 0m 27s branch-2.7 passed +1 mvnsite 1m 6s branch-2.7 passed +1 findbugs 3m 5s branch-2.7 passed +1 javadoc 1m 48s branch-2.7 passed       Patch Compile Tests +1 mvninstall 0m 55s the patch passed +1 compile 1m 2s the patch passed +1 javac 1m 2s the patch passed +1 checkstyle 0m 23s hadoop-hdfs-project/hadoop-hdfs: The patch generated 0 new + 188 unchanged - 1 fixed = 188 total (was 189) +1 mvnsite 0m 58s the patch passed -1 whitespace 0m 0s The patch has 60 line(s) that end in whitespace. Use git apply --whitespace=fix <<patch_file>>. Refer https://git-scm.com/docs/git-apply +1 findbugs 3m 8s the patch passed +1 javadoc 1m 43s the patch passed       Other Tests -1 unit 76m 32s hadoop-hdfs in the patch failed. -1 asflicense 0m 50s The patch generated 7 ASF License warnings. 121m 59s Reason Tests Unreaped Processes hadoop-hdfs:17 Failed junit tests hadoop.hdfs.server.datanode.TestDataNodeInitStorage   hadoop.hdfs.server.datanode.TestRefreshNamenodes   hadoop.hdfs.server.namenode.TestNamenodeRetryCache Timed out junit tests org.apache.hadoop.hdfs.server.datanode.TestHSync   org.apache.hadoop.hdfs.server.datanode.TestDataNodeVolumeFailureReporting   org.apache.hadoop.hdfs.server.datanode.TestDataNodeMetrics   org.apache.hadoop.hdfs.server.datanode.TestReadOnlySharedStorage   org.apache.hadoop.hdfs.TestPread   org.apache.hadoop.hdfs.TestDecommission   org.apache.hadoop.hdfs.TestDFSAddressConfig   org.apache.hadoop.hdfs.TestDFSUpgrade   org.apache.hadoop.hdfs.server.datanode.TestDataNodeMultipleRegistrations   org.apache.hadoop.hdfs.server.datanode.TestIncrementalBlockReports   org.apache.hadoop.hdfs.server.datanode.TestFsDatasetCache   org.apache.hadoop.hdfs.TestDFSRollback   org.apache.hadoop.hdfs.server.datanode.TestBlockScanner   org.apache.hadoop.hdfs.server.datanode.TestCachingStrategy   org.apache.hadoop.hdfs.server.datanode.TestDataNodeVolumeFailure   org.apache.hadoop.hdfs.TestAbandonBlock Subsystem Report/Notes Docker Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:67e87c9 JIRA Issue HDFS-12832 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12898948/HDFS-12832-branch-2.7.002.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle uname Linux 48182a918b6e 3.13.0-135-generic #184-Ubuntu SMP Wed Oct 18 11:55:51 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/patchprocess/precommit/personality/provided.sh git revision branch-2.7 / 0da13b9 maven version: Apache Maven 3.0.5 Default Java 1.7.0_151 findbugs v3.0.0 whitespace https://builds.apache.org/job/PreCommit-HDFS-Build/22172/artifact/out/whitespace-eol.txt Unreaped Processes Log https://builds.apache.org/job/PreCommit-HDFS-Build/22172/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs-reaper.txt unit https://builds.apache.org/job/PreCommit-HDFS-Build/22172/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/22172/testReport/ asflicense https://builds.apache.org/job/PreCommit-HDFS-Build/22172/artifact/out/patch-asflicense-problems.txt Max. process+thread count 3747 (vs. ulimit of 5000) modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs Console output https://builds.apache.org/job/PreCommit-HDFS-Build/22172/console Powered by Apache Yetus 0.7.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 24m 42s Docker mode activated.
                Prechecks
          +1 @author 0m 0s The patch does not contain any @author tags.
          -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
                branch-2.7 Compile Tests
          +1 mvninstall 5m 34s branch-2.7 passed
          +1 compile 1m 1s branch-2.7 passed
          +1 checkstyle 0m 23s branch-2.7 passed
          +1 mvnsite 0m 57s branch-2.7 passed
          +1 findbugs 2m 37s branch-2.7 passed
          +1 javadoc 1m 48s branch-2.7 passed
                Patch Compile Tests
          +1 mvninstall 0m 53s the patch passed
          +1 compile 1m 1s the patch passed
          +1 javac 1m 1s the patch passed
          +1 checkstyle 0m 23s hadoop-hdfs-project/hadoop-hdfs: The patch generated 0 new + 188 unchanged - 1 fixed = 188 total (was 189)
          +1 mvnsite 0m 56s the patch passed
          -1 whitespace 0m 0s The patch has 60 line(s) that end in whitespace. Use git apply --whitespace=fix <<patch_file>>. Refer https://git-scm.com/docs/git-apply
          +1 findbugs 2m 55s the patch passed
          +1 javadoc 1m 47s the patch passed
                Other Tests
          -1 unit 80m 19s hadoop-hdfs in the patch failed.
          -1 asflicense 0m 56s The patch generated 1 ASF License warnings.
          127m 58s



          Reason Tests
          Unreaped Processes hadoop-hdfs:19
          Failed junit tests hadoop.hdfs.TestClientReportBadBlock
            hadoop.hdfs.web.TestWebHdfsTimeouts
            hadoop.hdfs.web.TestWebHdfsTokens
          Timed out junit tests org.apache.hadoop.hdfs.TestSetrepDecreasing
            org.apache.hadoop.hdfs.TestFileAppend4
            org.apache.hadoop.hdfs.TestRollingUpgradeDowngrade
            org.apache.hadoop.hdfs.TestReadWhileWriting
            org.apache.hadoop.hdfs.TestFileCorruption
            org.apache.hadoop.hdfs.TestLease
            org.apache.hadoop.hdfs.TestHDFSServerPorts
            org.apache.hadoop.hdfs.TestDFSUpgrade
            org.apache.hadoop.hdfs.web.TestWebHDFS
            org.apache.hadoop.hdfs.TestAppendSnapshotTruncate
            org.apache.hadoop.hdfs.TestRollingUpgradeRollback
            org.apache.hadoop.hdfs.TestFSOutputSummer
            org.apache.hadoop.hdfs.TestMiniDFSCluster
            org.apache.hadoop.hdfs.TestBlockReaderFactory
            org.apache.hadoop.hdfs.TestHFlush
            org.apache.hadoop.hdfs.TestEncryptedTransfer
            org.apache.hadoop.hdfs.TestDFSShell
            org.apache.hadoop.hdfs.TestDataTransferProtocol
            org.apache.hadoop.hdfs.TestHDFSTrash



          Subsystem Report/Notes
          Docker Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:17213a0
          JIRA Issue HDFS-12832
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12898948/HDFS-12832-branch-2.7.002.patch
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle
          uname Linux 5676fa1f6fe3 4.4.0-64-generic #85-Ubuntu SMP Mon Feb 20 11:50:30 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/patchprocess/precommit/personality/provided.sh
          git revision branch-2.7 / 0da13b9
          maven version: Apache Maven 3.3.9 (bb52d8502b132ec0a5a3f4c09453c07478323dc5; 2015-11-10T16:41:47+00:00)
          Default Java 1.7.0_151
          findbugs v3.0.0
          whitespace https://builds.apache.org/job/PreCommit-HDFS-Build/22171/artifact/out/whitespace-eol.txt
          Unreaped Processes Log https://builds.apache.org/job/PreCommit-HDFS-Build/22171/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs-reaper.txt
          unit https://builds.apache.org/job/PreCommit-HDFS-Build/22171/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
          Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/22171/testReport/
          asflicense https://builds.apache.org/job/PreCommit-HDFS-Build/22171/artifact/out/patch-asflicense-problems.txt
          Max. process+thread count 4920 (vs. ulimit of 5000)
          modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
          Console output https://builds.apache.org/job/PreCommit-HDFS-Build/22171/console
          Powered by Apache Yetus 0.7.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 24m 42s Docker mode activated.       Prechecks +1 @author 0m 0s The patch does not contain any @author tags. -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.       branch-2.7 Compile Tests +1 mvninstall 5m 34s branch-2.7 passed +1 compile 1m 1s branch-2.7 passed +1 checkstyle 0m 23s branch-2.7 passed +1 mvnsite 0m 57s branch-2.7 passed +1 findbugs 2m 37s branch-2.7 passed +1 javadoc 1m 48s branch-2.7 passed       Patch Compile Tests +1 mvninstall 0m 53s the patch passed +1 compile 1m 1s the patch passed +1 javac 1m 1s the patch passed +1 checkstyle 0m 23s hadoop-hdfs-project/hadoop-hdfs: The patch generated 0 new + 188 unchanged - 1 fixed = 188 total (was 189) +1 mvnsite 0m 56s the patch passed -1 whitespace 0m 0s The patch has 60 line(s) that end in whitespace. Use git apply --whitespace=fix <<patch_file>>. Refer https://git-scm.com/docs/git-apply +1 findbugs 2m 55s the patch passed +1 javadoc 1m 47s the patch passed       Other Tests -1 unit 80m 19s hadoop-hdfs in the patch failed. -1 asflicense 0m 56s The patch generated 1 ASF License warnings. 127m 58s Reason Tests Unreaped Processes hadoop-hdfs:19 Failed junit tests hadoop.hdfs.TestClientReportBadBlock   hadoop.hdfs.web.TestWebHdfsTimeouts   hadoop.hdfs.web.TestWebHdfsTokens Timed out junit tests org.apache.hadoop.hdfs.TestSetrepDecreasing   org.apache.hadoop.hdfs.TestFileAppend4   org.apache.hadoop.hdfs.TestRollingUpgradeDowngrade   org.apache.hadoop.hdfs.TestReadWhileWriting   org.apache.hadoop.hdfs.TestFileCorruption   org.apache.hadoop.hdfs.TestLease   org.apache.hadoop.hdfs.TestHDFSServerPorts   org.apache.hadoop.hdfs.TestDFSUpgrade   org.apache.hadoop.hdfs.web.TestWebHDFS   org.apache.hadoop.hdfs.TestAppendSnapshotTruncate   org.apache.hadoop.hdfs.TestRollingUpgradeRollback   org.apache.hadoop.hdfs.TestFSOutputSummer   org.apache.hadoop.hdfs.TestMiniDFSCluster   org.apache.hadoop.hdfs.TestBlockReaderFactory   org.apache.hadoop.hdfs.TestHFlush   org.apache.hadoop.hdfs.TestEncryptedTransfer   org.apache.hadoop.hdfs.TestDFSShell   org.apache.hadoop.hdfs.TestDataTransferProtocol   org.apache.hadoop.hdfs.TestHDFSTrash Subsystem Report/Notes Docker Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:17213a0 JIRA Issue HDFS-12832 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12898948/HDFS-12832-branch-2.7.002.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle uname Linux 5676fa1f6fe3 4.4.0-64-generic #85-Ubuntu SMP Mon Feb 20 11:50:30 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/patchprocess/precommit/personality/provided.sh git revision branch-2.7 / 0da13b9 maven version: Apache Maven 3.3.9 (bb52d8502b132ec0a5a3f4c09453c07478323dc5; 2015-11-10T16:41:47+00:00) Default Java 1.7.0_151 findbugs v3.0.0 whitespace https://builds.apache.org/job/PreCommit-HDFS-Build/22171/artifact/out/whitespace-eol.txt Unreaped Processes Log https://builds.apache.org/job/PreCommit-HDFS-Build/22171/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs-reaper.txt unit https://builds.apache.org/job/PreCommit-HDFS-Build/22171/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/22171/testReport/ asflicense https://builds.apache.org/job/PreCommit-HDFS-Build/22171/artifact/out/patch-asflicense-problems.txt Max. process+thread count 4920 (vs. ulimit of 5000) modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs Console output https://builds.apache.org/job/PreCommit-HDFS-Build/22171/console Powered by Apache Yetus 0.7.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 10m 52s Docker mode activated.
                Prechecks
          +1 @author 0m 0s The patch does not contain any @author tags.
          -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
                trunk Compile Tests
          +1 mvninstall 19m 24s trunk passed
          +1 compile 1m 8s trunk passed
          +1 checkstyle 0m 41s trunk passed
          +1 mvnsite 1m 5s trunk passed
          +1 shadedclient 11m 36s branch has no errors when building and testing our client artifacts.
          +1 findbugs 1m 50s trunk passed
          +1 javadoc 0m 49s trunk passed
                Patch Compile Tests
          +1 mvninstall 0m 55s the patch passed
          +1 compile 0m 49s the patch passed
          +1 javac 0m 49s the patch passed
          +1 checkstyle 0m 36s the patch passed
          +1 mvnsite 0m 53s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 shadedclient 10m 51s patch has no errors when building and testing our client artifacts.
          +1 findbugs 1m 56s the patch passed
          +1 javadoc 0m 47s the patch passed
                Other Tests
          -1 unit 91m 7s hadoop-hdfs in the patch failed.
          +1 asflicense 0m 24s The patch does not generate ASF License warnings.
          155m 20s



          Reason Tests
          Failed junit tests hadoop.hdfs.server.datanode.TestDirectoryScanner
            hadoop.fs.TestUnbuffer
            hadoop.hdfs.qjournal.server.TestJournalNodeSync



          Subsystem Report/Notes
          Docker Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:5b98639
          JIRA Issue HDFS-12832
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12898935/HDFS-12832.002.patch
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle
          uname Linux 6f585d9ef76d 3.13.0-129-generic #178-Ubuntu SMP Fri Aug 11 12:48:20 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/patchprocess/precommit/personality/provided.sh
          git revision trunk / 738d1a2
          maven version: Apache Maven 3.3.9
          Default Java 1.8.0_151
          findbugs v3.1.0-RC1
          unit https://builds.apache.org/job/PreCommit-HDFS-Build/22170/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
          Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/22170/testReport/
          Max. process+thread count 3882 (vs. ulimit of 5000)
          modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
          Console output https://builds.apache.org/job/PreCommit-HDFS-Build/22170/console
          Powered by Apache Yetus 0.7.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 10m 52s Docker mode activated.       Prechecks +1 @author 0m 0s The patch does not contain any @author tags. -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.       trunk Compile Tests +1 mvninstall 19m 24s trunk passed +1 compile 1m 8s trunk passed +1 checkstyle 0m 41s trunk passed +1 mvnsite 1m 5s trunk passed +1 shadedclient 11m 36s branch has no errors when building and testing our client artifacts. +1 findbugs 1m 50s trunk passed +1 javadoc 0m 49s trunk passed       Patch Compile Tests +1 mvninstall 0m 55s the patch passed +1 compile 0m 49s the patch passed +1 javac 0m 49s the patch passed +1 checkstyle 0m 36s the patch passed +1 mvnsite 0m 53s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 shadedclient 10m 51s patch has no errors when building and testing our client artifacts. +1 findbugs 1m 56s the patch passed +1 javadoc 0m 47s the patch passed       Other Tests -1 unit 91m 7s hadoop-hdfs in the patch failed. +1 asflicense 0m 24s The patch does not generate ASF License warnings. 155m 20s Reason Tests Failed junit tests hadoop.hdfs.server.datanode.TestDirectoryScanner   hadoop.fs.TestUnbuffer   hadoop.hdfs.qjournal.server.TestJournalNodeSync Subsystem Report/Notes Docker Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:5b98639 JIRA Issue HDFS-12832 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12898935/HDFS-12832.002.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle uname Linux 6f585d9ef76d 3.13.0-129-generic #178-Ubuntu SMP Fri Aug 11 12:48:20 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/patchprocess/precommit/personality/provided.sh git revision trunk / 738d1a2 maven version: Apache Maven 3.3.9 Default Java 1.8.0_151 findbugs v3.1.0-RC1 unit https://builds.apache.org/job/PreCommit-HDFS-Build/22170/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/22170/testReport/ Max. process+thread count 3882 (vs. ulimit of 5000) modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs Console output https://builds.apache.org/job/PreCommit-HDFS-Build/22170/console Powered by Apache Yetus 0.7.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          shv Konstantin Shvachko added a comment -

          Also file HDFS-12856 to further address Erik Krogen's comment above, which still remains valid after fixing ArrayIndexOutOfBoundException addressed in this jira.

          Show
          shv Konstantin Shvachko added a comment - Also file HDFS-12856 to further address Erik Krogen 's comment above, which still remains valid after fixing ArrayIndexOutOfBoundException addressed in this jira.
          Hide
          shv Konstantin Shvachko added a comment -

          branch-2.7 patch. A bit more changes as I reorganized members of ReplicationWork to match branch-2.

          Show
          shv Konstantin Shvachko added a comment - branch-2.7 patch. A bit more changes as I reorganized members of ReplicationWork to match branch-2.
          Hide
          shv Konstantin Shvachko added a comment -

          Branch-2 patch is different from trunk and 3.0, but works for 2.8.

          Show
          shv Konstantin Shvachko added a comment - Branch-2 patch is different from trunk and 3.0, but works for 2.8.
          Hide
          shv Konstantin Shvachko added a comment -

          Filed HDFS-12855.

          Show
          shv Konstantin Shvachko added a comment - Filed HDFS-12855 .
          Hide
          shv Konstantin Shvachko added a comment - - edited

          The 002 patch fixes the issue of calling getFullPathName() without holding the lock. The fix is to compute getFullPathName() inside BlockReconstructionWork constructor, which is called under the lock. Just as explained earlier.
          I did not write a unit test as we would have to perform a major surgery on FSNamesystem in order to get exactly that ArrayIndexOutOfBoundsException.
          Instead I looked through all calls of getFullPathName(), making sure it is done under a lock. Found one place where it is not - in Fsck. Will file a jira for that shortly.

          Show
          shv Konstantin Shvachko added a comment - - edited The 002 patch fixes the issue of calling getFullPathName() without holding the lock. The fix is to compute getFullPathName() inside BlockReconstructionWork constructor, which is called under the lock. Just as explained earlier. I did not write a unit test as we would have to perform a major surgery on FSNamesystem in order to get exactly that ArrayIndexOutOfBoundsException . Instead I looked through all calls of getFullPathName() , making sure it is done under a lock. Found one place where it is not - in Fsck. Will file a jira for that shortly.
          Hide
          vinayrpet Vinayakumar B added a comment -

          A more accurate approach is to remove ReplicationWork.bc and replace it with two srcPath and storagePolicyID, which are the only things need from bc and which could be computed inside ReplicationWork() from bc. The latter is safe since it is constructed under the lock. That way we will avoid breaking private interface BlockPlacementPolicy if it matters.

          +1

          Show
          vinayrpet Vinayakumar B added a comment - A more accurate approach is to remove ReplicationWork.bc and replace it with two srcPath and storagePolicyID, which are the only things need from bc and which could be computed inside ReplicationWork() from bc. The latter is safe since it is constructed under the lock. That way we will avoid breaking private interface BlockPlacementPolicy if it matters. +1
          Hide
          shv Konstantin Shvachko added a comment - - edited

          The funny thing is that in BlockPlacementPolicyDefault.chooseTarget(String srcPath, ...) the srcPath parameter is completely redundant, as it is not used at all.

          I'm wondering if there are any policies outside that take file name into account while choosing a target for a block?

          So a dirty fix would be to pass src = null and avoid calling bc.getName() completely:

              private void chooseTargets(BlockPlacementPolicy blockplacement,
                  BlockStoragePolicySuite storagePolicySuite,
                  Set<Node> excludedNodes) {
                try {
          -       targets = blockplacement.chooseTarget(bc.getName(),
          +       targets = blockplacement.chooseTarget(null,
                      additionalReplRequired, srcNode, liveReplicaStorages, false,
                      excludedNodes, block.getNumBytes(),
                      storagePolicySuite.getPolicy(bc.getStoragePolicyID()));
          

          A more accurate approach is to remove ReplicationWork.bc and replace it with two srcPath and storagePolicyID, which are the only things need from bc and which could be computed inside ReplicationWork() from bc. The latter is safe since it is constructed under the lock. That way we will avoid breaking private interface BlockPlacementPolicy if it matters.

          Show
          shv Konstantin Shvachko added a comment - - edited The funny thing is that in BlockPlacementPolicyDefault.chooseTarget(String srcPath, ...) the srcPath parameter is completely redundant, as it is not used at all. I'm wondering if there are any policies outside that take file name into account while choosing a target for a block? So a dirty fix would be to pass src = null and avoid calling bc.getName() completely: private void chooseTargets(BlockPlacementPolicy blockplacement, BlockStoragePolicySuite storagePolicySuite, Set<Node> excludedNodes) { try { - targets = blockplacement.chooseTarget(bc.getName(), + targets = blockplacement.chooseTarget( null , additionalReplRequired, srcNode, liveReplicaStorages, false , excludedNodes, block.getNumBytes(), storagePolicySuite.getPolicy(bc.getStoragePolicyID())); A more accurate approach is to remove ReplicationWork.bc and replace it with two srcPath and storagePolicyID , which are the only things need from bc and which could be computed inside ReplicationWork() from bc . The latter is safe since it is constructed under the lock. That way we will avoid breaking private interface BlockPlacementPolicy if it matters.
          Hide
          Deng FEI DENG FEI added a comment -

          Erik Krogen
          Has been uploaded on the stack, it happened at ReplicationWork#chooseTargets() indeed.
          And you are right. INode#getPathComponents() has same problem when concurrently do move but not only rename.

          Show
          Deng FEI DENG FEI added a comment - Erik Krogen Has been uploaded on the stack, it happened at ReplicationWork#chooseTargets() indeed. And you are right. INode#getPathComponents() has same problem when concurrently do move but not only rename .
          Hide
          shv Konstantin Shvachko added a comment -

          DENG FEI could you please post the actual exception here if you have it. It would be good to see a stack trace.

          Show
          shv Konstantin Shvachko added a comment - DENG FEI could you please post the actual exception here if you have it. It would be good to see a stack trace.
          Hide
          xkrogen Erik Krogen added a comment - - edited

          Thanks for reporting this and for working on a patch, DENG FEI! Actually, the new method you are using, INode#getPathComponents() is subject to the same race condition. Generally INode is not meant to be a concurrent data structure as far as I can tell. I believe the issue is actually that ReplicationWork#chooseTargets() is being called without a lock:

          BlockManager.ReplicationWork
                // choose replication targets: NOT HOLDING THE GLOBAL LOCK
                // It is costly to extract the filename for which chooseTargets is called,
                // so for now we pass in the block collection itself.
                rw.chooseTargets(blockplacement, storagePolicySuite, excludedNodes);
          

          Within chooseTargets() various methods on INode/BlockCollection, DatanodeDescriptor, DatanodeStorageInfo, and Block are called which it seems should not be allowed outside of the lock.

          DENG FEI, do you have a stack trace available to confirm that this is the same code path which caused your exception? This is the code path that was taken to trigger the issue for us.

          Show
          xkrogen Erik Krogen added a comment - - edited Thanks for reporting this and for working on a patch, DENG FEI ! Actually, the new method you are using, INode#getPathComponents() is subject to the same race condition. Generally INode is not meant to be a concurrent data structure as far as I can tell. I believe the issue is actually that ReplicationWork#chooseTargets() is being called without a lock: BlockManager.ReplicationWork // choose replication targets: NOT HOLDING THE GLOBAL LOCK // It is costly to extract the filename for which chooseTargets is called, // so for now we pass in the block collection itself. rw.chooseTargets(blockplacement, storagePolicySuite, excludedNodes); Within chooseTargets() various methods on INode / BlockCollection , DatanodeDescriptor , DatanodeStorageInfo , and Block are called which it seems should not be allowed outside of the lock. DENG FEI , do you have a stack trace available to confirm that this is the same code path which caused your exception? This is the code path that was taken to trigger the issue for us.
          Hide
          Deng FEI DENG FEI added a comment - - edited

          upload the first patch for review.Thanks

          Show
          Deng FEI DENG FEI added a comment - - edited upload the first patch for review.Thanks

            People

            • Assignee:
              shv Konstantin Shvachko
              Reporter:
              Deng FEI DENG FEI
            • Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development