Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-9046

Any Error during BPOfferService run can leads to Missing DN.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Patch Available
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None
    • None

    Description

      The cluster is ins HA mode and each DN having only one block pool.

      The issue is once after switch one DN is missing from the current active NN.
      Upon analysis I found that there is one exception in BPOfferService.run()

      2015-08-21 09:02:11,190 | WARN  | DataNode: [[[DISK]file:/srv/BigData/hadoop/data5/dn/ [DISK]file:/srv/BigData/hadoop/data4/dn/]]  heartbeating to 160-149-0-114/160.149.0.114:25000 | Unexpected exception in block pool Block pool BP-284203724-160.149.0.114-1438774011693 (Datanode Uuid 15ce1dd7-227f-4fd2-9682-091aa6bc2b89) service to 160-149-0-114/160.149.0.114:25000 | BPServiceActor.java:830
      java.lang.OutOfMemoryError: unable to create new native thread
                      at java.lang.Thread.start0(Native Method)
                      at java.lang.Thread.start(Thread.java:714)
                      at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:950)
                      at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1357)
                      at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.execute(FsDatasetAsyncDiskService.java:172)
                      at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.deleteAsync(FsDatasetAsyncDiskService.java:221)
                      at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:1887)
                      at org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:669)
                      at org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:616)
                      at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:856)
                      at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:671)
                      at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:822)
                      at java.lang.Thread.run(Thread.java:745)
      

      After this particular BPOfferService is down during the run time.
      And this particular NN will not have the details of this DN

      Similar issues are discussed in the following JIRAs
      https://issues.apache.org/jira/browse/HDFS-2882
      https://issues.apache.org/jira/browse/HDFS-7714

      Can we retry in this case also with a larger interval instead of shutting down this BPOfferService ?
      I think since this exceptions can occur randomly in DN it is not good to keep the DN running where some NN does not have the info !

      Attachments

        1. HDFS-9046_3.patch
          8 kB
          nijel
        2. HDFS-9046_2.patch
          8 kB
          nijel
        3. HDFS-9046_1.patch
          3 kB
          nijel

        Issue Links

          Activity

            People

              nijel nijel
              nijel nijel
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated: