Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-13359

DataXceiver hung due to the lock in FsDatasetImpl#getBlockInputStream

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.7.1
    • 3.3.0, 3.2.1, 3.1.3
    • datanode
    • None

    Description

      DataXceiver hung due to the lock that locked by
      FsDatasetImpl#getBlockInputStream (have attached stack).

        @Override // FsDatasetSpi
        public InputStream getBlockInputStream(ExtendedBlock b,
            long seekOffset) throws IOException {
      
          ReplicaInfo info;
          synchronized(this) {
            info = volumeMap.get(b.getBlockPoolId(), b.getLocalBlock());
          }
          ...
        }
      

      The lock synchronized(this) used here is expensive, there is already one AutoCloseableLock type lock defined for ReplicaMap. We can use it instead.

      Attachments

        1. stack.jpg
          249 kB
          Yiqun Lin
        2. HDFS-13359.001.patch
          0.8 kB
          Yiqun Lin

        Issue Links

          Activity

            linyiqun Yiqun Lin added a comment - - edited

            Patch attached. Just using datasetLock to replace the synchronized lock.

            linyiqun Yiqun Lin added a comment - - edited Patch attached. Just using datasetLock to replace the synchronized lock.
            genericqa genericqa added a comment -
            -1 overall



            Vote Subsystem Runtime Comment
            0 reexec 0m 21s Docker mode activated.
                  Prechecks
            +1 @author 0m 0s The patch does not contain any @author tags.
            -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
                  trunk Compile Tests
            +1 mvninstall 22m 47s trunk passed
            +1 compile 0m 49s trunk passed
            +1 checkstyle 0m 46s trunk passed
            +1 mvnsite 0m 56s trunk passed
            +1 shadedclient 10m 53s branch has no errors when building and testing our client artifacts.
            +1 findbugs 1m 40s trunk passed
            +1 javadoc 0m 46s trunk passed
                  Patch Compile Tests
            +1 mvninstall 0m 52s the patch passed
            +1 compile 0m 46s the patch passed
            +1 javac 0m 46s the patch passed
            +1 checkstyle 0m 42s the patch passed
            +1 mvnsite 0m 51s the patch passed
            +1 whitespace 0m 0s The patch has no whitespace issues.
            +1 shadedclient 9m 35s patch has no errors when building and testing our client artifacts.
            +1 findbugs 1m 47s the patch passed
            +1 javadoc 0m 44s the patch passed
                  Other Tests
            -1 unit 104m 33s hadoop-hdfs in the patch failed.
            +1 asflicense 0m 20s The patch does not generate ASF License warnings.
            158m 56s



            Reason Tests
            Failed junit tests hadoop.hdfs.server.namenode.ha.TestRetryCacheWithHA
              hadoop.hdfs.web.TestWebHdfsTimeouts
              hadoop.hdfs.server.namenode.ha.TestEditLogTailer
              hadoop.hdfs.server.datanode.TestDataNodeVolumeFailureReporting



            Subsystem Report/Notes
            Docker Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8620d2b
            JIRA Issue HDFS-13359
            JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12916572/HDFS-13359.001.patch
            Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle
            uname Linux 5458367062db 4.4.0-64-generic #85-Ubuntu SMP Mon Feb 20 11:50:30 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
            Build tool maven
            Personality /testptch/patchprocess/precommit/personality/provided.sh
            git revision trunk / a71656c
            maven version: Apache Maven 3.3.9
            Default Java 1.8.0_151
            findbugs v3.1.0-RC1
            unit https://builds.apache.org/job/PreCommit-HDFS-Build/23698/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
            Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/23698/testReport/
            Max. process+thread count 3153 (vs. ulimit of 10000)
            modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
            Console output https://builds.apache.org/job/PreCommit-HDFS-Build/23698/console
            Powered by Apache Yetus 0.8.0-SNAPSHOT http://yetus.apache.org

            This message was automatically generated.

            genericqa genericqa added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 21s Docker mode activated.       Prechecks +1 @author 0m 0s The patch does not contain any @author tags. -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.       trunk Compile Tests +1 mvninstall 22m 47s trunk passed +1 compile 0m 49s trunk passed +1 checkstyle 0m 46s trunk passed +1 mvnsite 0m 56s trunk passed +1 shadedclient 10m 53s branch has no errors when building and testing our client artifacts. +1 findbugs 1m 40s trunk passed +1 javadoc 0m 46s trunk passed       Patch Compile Tests +1 mvninstall 0m 52s the patch passed +1 compile 0m 46s the patch passed +1 javac 0m 46s the patch passed +1 checkstyle 0m 42s the patch passed +1 mvnsite 0m 51s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 shadedclient 9m 35s patch has no errors when building and testing our client artifacts. +1 findbugs 1m 47s the patch passed +1 javadoc 0m 44s the patch passed       Other Tests -1 unit 104m 33s hadoop-hdfs in the patch failed. +1 asflicense 0m 20s The patch does not generate ASF License warnings. 158m 56s Reason Tests Failed junit tests hadoop.hdfs.server.namenode.ha.TestRetryCacheWithHA   hadoop.hdfs.web.TestWebHdfsTimeouts   hadoop.hdfs.server.namenode.ha.TestEditLogTailer   hadoop.hdfs.server.datanode.TestDataNodeVolumeFailureReporting Subsystem Report/Notes Docker Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8620d2b JIRA Issue HDFS-13359 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12916572/HDFS-13359.001.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle uname Linux 5458367062db 4.4.0-64-generic #85-Ubuntu SMP Mon Feb 20 11:50:30 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/patchprocess/precommit/personality/provided.sh git revision trunk / a71656c maven version: Apache Maven 3.3.9 Default Java 1.8.0_151 findbugs v3.1.0-RC1 unit https://builds.apache.org/job/PreCommit-HDFS-Build/23698/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/23698/testReport/ Max. process+thread count 3153 (vs. ulimit of 10000) modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs Console output https://builds.apache.org/job/PreCommit-HDFS-Build/23698/console Powered by Apache Yetus 0.8.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.

            Hi linyiqun thanks for the patch!

            Could you shed a little more light on why changing from an object lock to a ReentrantLock improves locking? Is it because it is a fair lock?

            Thank you

            weichiu Wei-Chiu Chuang added a comment - Hi linyiqun thanks for the patch! Could you shed a little more light on why changing from an object lock to a ReentrantLock improves locking? Is it because it is a fair lock? Thank you
            linyiqun Yiqun Lin added a comment - - edited

            Thanks weichiu for the comment.

            Could you shed a little more light on why changing from an object lock to a ReentrantLock improves locking? Is it because it is a fair lock?

            When the lock contention is increasing, the performance of ReentrantLock will be better than synchronized lock. By default, ReentrantLock uses a non-fair strategy, this is the same as synchronized. This is not the real reason I changed here,.

            linyiqun Yiqun Lin added a comment - - edited Thanks weichiu for the comment. Could you shed a little more light on why changing from an object lock to a ReentrantLock improves locking? Is it because it is a fair lock? When the lock contention is increasing, the performance of ReentrantLock will be better than synchronized lock. By default, ReentrantLock uses a non-fair strategy, this is the same as synchronized . This is not the real reason I changed here, .

            Thanks. I'm not so sure about performance of ReentrantLock.

            When HDFS-10682 introduced AutoClosable ReentrantLock, it was for 

            Doing so will make it easier to measure lock statistics like lock held time and warn about potential lock contention due to slow disk operations.

            Do you have a reference to a performance measurement between ReentrantLock and object lock? Just curious and would like to learn more about it.

             

            Thank you!

            weichiu Wei-Chiu Chuang added a comment - Thanks. I'm not so sure about performance of ReentrantLock. When HDFS-10682 introduced AutoClosable ReentrantLock, it was for  Doing so will make it easier to measure lock statistics like lock held time and warn about potential lock contention due to slow disk operations. Do you have a reference to a performance measurement between ReentrantLock and object lock? Just curious and would like to learn more about it.   Thank you!
            linyiqun Yiqun Lin added a comment - - edited

            weichiu, thanks for your reference of HDFS-10682.

            Do you have a reference to a performance measurement between ReentrantLock and object lock? Just curious and would like to learn more about it.

            Can see this link: https://www.ibm.com/developerworks/java/library/j-jtp10264/index.html

            The ReentrantLock class, which implements Lock, has the same concurrency and memory semantics as synchronized, but also adds features like lock polling, timed lock waits, and interruptible lock waits. Additionally, it offers far better performance under heavy contention.

            linyiqun Yiqun Lin added a comment - - edited weichiu , thanks for your reference of HDFS-10682 . Do you have a reference to a performance measurement between ReentrantLock and object lock? Just curious and would like to learn more about it. Can see this link: https://www.ibm.com/developerworks/java/library/j-jtp10264/index.html The ReentrantLock class, which implements Lock, has the same concurrency and memory semantics as synchronized, but also adds features like lock polling, timed lock waits, and interruptible lock waits. Additionally, it offers far better performance under heavy contention.
            sunilg Sunil G added a comment -

            Bulk update: moved all 3.2.0 non-blocker issues, please move back if it is a blocker.

            sunilg Sunil G added a comment - Bulk update: moved all 3.2.0 non-blocker issues, please move back if it is a blocker.
            hadoopqa Hadoop QA added a comment -
            -1 overall



            Vote Subsystem Runtime Comment
            0 reexec 0m 0s Docker mode activated.
            -1 patch 0m 6s HDFS-13359 does not apply to trunk. Rebase required? Wrong Branch? See https://wiki.apache.org/hadoop/HowToContribute for help.



            Subsystem Report/Notes
            JIRA Issue HDFS-13359
            Console output https://builds.apache.org/job/PreCommit-HDFS-Build/25614/console
            Powered by Apache Yetus 0.8.0 http://yetus.apache.org

            This message was automatically generated.

            hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 0s Docker mode activated. -1 patch 0m 6s HDFS-13359 does not apply to trunk. Rebase required? Wrong Branch? See https://wiki.apache.org/hadoop/HowToContribute for help. Subsystem Report/Notes JIRA Issue HDFS-13359 Console output https://builds.apache.org/job/PreCommit-HDFS-Build/25614/console Powered by Apache Yetus 0.8.0 http://yetus.apache.org This message was automatically generated.

            I would love to improve datanode lock contention, especially in the context of dense DataNodes.
            That said, it would be really nice to have a performance benchmark to compare the performance before/after the change.

            weichiu Wei-Chiu Chuang added a comment - I would love to improve datanode lock contention, especially in the context of dense DataNodes. That said, it would be really nice to have a performance benchmark to compare the performance before/after the change.
            hadoopqa Hadoop QA added a comment -
            -1 overall



            Vote Subsystem Runtime Comment
            0 reexec 0m 0s Docker mode activated.
            -1 patch 0m 6s HDFS-13359 does not apply to trunk. Rebase required? Wrong Branch? See https://wiki.apache.org/hadoop/HowToContribute for help.



            Subsystem Report/Notes
            JIRA Issue HDFS-13359
            Console output https://builds.apache.org/job/PreCommit-HDFS-Build/27027/console
            Powered by Apache Yetus 0.8.0 http://yetus.apache.org

            This message was automatically generated.

            hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 0s Docker mode activated. -1 patch 0m 6s HDFS-13359 does not apply to trunk. Rebase required? Wrong Branch? See https://wiki.apache.org/hadoop/HowToContribute for help. Subsystem Report/Notes JIRA Issue HDFS-13359 Console output https://builds.apache.org/job/PreCommit-HDFS-Build/27027/console Powered by Apache Yetus 0.8.0 http://yetus.apache.org This message was automatically generated.

            Patch still applies.
            +1 I think this is a good improvement regardless. Didn't mean to stall the patch.

            weichiu Wei-Chiu Chuang added a comment - Patch still applies. +1 I think this is a good improvement regardless. Didn't mean to stall the patch.

            Thanks linyiqun!

            weichiu Wei-Chiu Chuang added a comment - Thanks linyiqun !
            hudson Hudson added a comment -

            FAILURE: Integrated in Jenkins build Hadoop-trunk-Commit #17081 (See https://builds.apache.org/job/Hadoop-trunk-Commit/17081/)
            HDFS-13359. DataXceiver hung due to the lock in (weichiu: rev 8a77a224c734bea0eb490f30c908748458c190c3)

            • (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetImpl.java
            hudson Hudson added a comment - FAILURE: Integrated in Jenkins build Hadoop-trunk-Commit #17081 (See https://builds.apache.org/job/Hadoop-trunk-Commit/17081/ ) HDFS-13359 . DataXceiver hung due to the lock in (weichiu: rev 8a77a224c734bea0eb490f30c908748458c190c3) (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetImpl.java
            qinyuren qinyuren added a comment -

            Hi linyiqun, we have a similar problem with version 3.1, but we are not sure why Synchronize lock occurs DataXceiver hung. Hope to get your answer.

            Thanks you

            qinyuren qinyuren added a comment - Hi  linyiqun , we have a similar problem with version 3.1, but we are not sure why Synchronize lock occurs DataXceiver hung. Hope to get your answer. Thanks you

            People

              linyiqun Yiqun Lin
              linyiqun Yiqun Lin
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: