HBase
  1. HBase
  2. HBASE-10

HRegionServer hangs upon exit due to DFSClient Exception

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.2.0
    • Component/s: regionserver
    • Labels:
      None
    • Environment:

      CentOS 5

      Description

      Several HRegionServers hang around indefinitely well after the HMaster has exited. This was triggered executing $HBASE_HOME/bin/stop-hbase.sh. The HMaster exists fine, but here is what happens on one of the HRegionServers:

      2008-01-02 18:54:01,907 INFO org.apache.hadoop.hbase.HRegionServer: Got regionserver stop message
      2008-01-02 18:54:01,907 INFO org.apache.hadoop.hbase.Leases: regionserver/0.0.0.0:60020 closing leases
      2008-01-02 18:54:01,907 INFO org.apache.hadoop.hbase.Leases$LeaseMonitor: regionserver/0.0.0.0:60020.leaseChecker exiting
      2008-01-02 18:54:01,908 INFO org.apache.hadoop.hbase.Leases: regionserver/0.0.0.0:60020 closed leases
      2008-01-02 18:54:01,908 INFO org.apache.hadoop.ipc.Server: Stopping server on 60020
      2008-01-02 18:54:01,908 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2 on 60020: exiting
      2008-01-02 18:54:01,908 INFO org.apache.hadoop.ipc.Server: IPC Server handler 0 on 60020: exiting
      2008-01-02 18:54:01,908 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7 on 60020: exiting
      2008-01-02 18:54:01,908 INFO org.apache.hadoop.ipc.Server: IPC Server handler 3 on 60020: exiting
      2008-01-02 18:54:01,908 INFO org.apache.hadoop.ipc.Server: IPC Server handler 5 on 60020: exiting
      2008-01-02 18:54:01,908 INFO org.apache.hadoop.ipc.Server: IPC Server handler 9 on 60020: exiting
      2008-01-02 18:54:01,908 INFO org.apache.hadoop.ipc.Server: IPC Server handler 6 on 60020: exiting
      2008-01-02 18:54:01,908 INFO org.apache.hadoop.ipc.Server: IPC Server handler 4 on 60020: exiting
      2008-01-02 18:54:01,908 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 60020: exiting
      2008-01-02 18:54:01,909 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server listener on 60020
      2008-01-02 18:54:01,908 INFO org.apache.hadoop.ipc.Server: IPC Server handler 8 on 60020: exiting
      2008-01-02 18:54:01,909 INFO org.apache.hadoop.hbase.HRegionServer: Stopping infoServer
      2008-01-02 18:54:01,909 DEBUG org.mortbay.util.Container: Stopping org.mortbay.jetty.Server@62c09554
      2008-01-02 18:54:01,909 DEBUG org.mortbay.util.ThreadedServer: closing ServerSocket[addr=0.0.0.0/0.0.0.0,port=0,localport=60030]
      2008-01-02 18:54:01,909 DEBUG org.mortbay.util.ThreadedServer: IGNORED
      java.net.SocketException: Socket closed
      at java.net.PlainSocketImpl.socketAccept(Native Method)
      at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:384)
      at java.net.ServerSocket.implAccept(ServerSocket.java:453)
      at java.net.ServerSocket.accept(ServerSocket.java:421)
      at org.mortbay.util.ThreadedServer.acceptSocket(ThreadedServer.java:432)
      at org.mortbay.util.ThreadedServer$Acceptor.run(ThreadedServer.java:631)
      2008-01-02 18:54:01,910 INFO org.mortbay.util.ThreadedServer: Stopping Acceptor ServerSocket[addr=0.0.0.0/0.0.0.0,port=0,localport=60030]
      2008-01-02 18:54:01,910 DEBUG org.mortbay.util.ThreadedServer: Self connect to close listener /127.0.0.1:60030
      2008-01-02 18:54:01,911 DEBUG org.mortbay.util.ThreadedServer: problem stopping acceptor /127.0.0.1:
      2008-01-02 18:54:01,911 DEBUG org.mortbay.util.ThreadedServer: problem stopping acceptor /127.0.0.1:
      java.net.ConnectException: Connection refused
      at java.net.PlainSocketImpl.socketConnect(Native Method)
      at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333)
      at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195)
      at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182)
      at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
      at java.net.Socket.connect(Socket.java:519)
      at java.net.Socket.connect(Socket.java:469)
      at java.net.Socket.<init>(Socket.java:366)
      at java.net.Socket.<init>(Socket.java:209)
      at org.mortbay.util.ThreadedServer$Acceptor.forceStop(ThreadedServer.java:682)
      at org.mortbay.util.ThreadedServer.stop(ThreadedServer.java:557)
      at org.mortbay.http.SocketListener.stop(SocketListener.java:211)
      at org.mortbay.http.HttpServer.doStop(HttpServer.java:781)
      at org.mortbay.util.Container.stop(Container.java:154)
      at org.apache.hadoop.hbase.util.InfoServer.stop(InfoServer.java:237)
      at org.apache.hadoop.hbase.HRegionServer.run(HRegionServer.java:835)
      at java.lang.Thread.run(Thread.java:619)
      2008-01-02 18:54:01,911 INFO org.mortbay.http.SocketListener: Stopped SocketListener on 0.0.0.0:60030
      2008-01-02 18:54:01,912 DEBUG org.mortbay.util.Container: Stopping HttpContext[/static,/static]
      2008-01-02 18:54:01,912 DEBUG org.mortbay.http.handler.AbstractHttpHandler: Stopped org.mortbay.http.handler.ResourceHandler in HttpContext[/static,/static]
      2008-01-02 18:54:02,039 INFO org.mortbay.util.Container: Stopped HttpContext[/static,/static]
      2008-01-02 18:54:02,039 DEBUG org.mortbay.util.Container: Stopping HttpContext[/logs,/logs]
      2008-01-02 18:54:02,039 DEBUG org.mortbay.http.handler.AbstractHttpHandler: Stopped org.mortbay.http.handler.ResourceHandler in HttpContext[/logs,/logs]
      2008-01-02 18:54:02,154 INFO org.mortbay.util.Container: Stopped HttpContext[/logs,/logs]
      2008-01-02 18:54:02,154 DEBUG org.mortbay.util.Container: Stopping WebApplicationContext[/,/]
      2008-01-02 18:54:02,154 DEBUG org.mortbay.util.Container: Stopping org.mortbay.jetty.servlet.WebApplicationHandler@7ec5495e
      2008-01-02 18:54:02,155 INFO org.mortbay.util.Container: Stopped org.mortbay.jetty.servlet.WebApplicationHandler@7ec5495e
      2008-01-02 18:54:02,277 DEBUG org.mortbay.jetty.servlet.AbstractSessionManager: Session scavenger exited
      2008-01-02 18:54:02,278 DEBUG org.mortbay.util.Container: remove component: org.mortbay.jetty.servlet.WebApplicationHandler@7ec5495e
      2008-01-02 18:54:02,278 INFO org.mortbay.util.Container: Stopped WebApplicationContext[/,/]
      2008-01-02 18:54:02,278 INFO org.mortbay.util.Container: Stopped org.mortbay.jetty.Server@62c09554
      2008-01-02 18:54:02,278 DEBUG org.apache.hadoop.hbase.HRegionServer: closing region spider_pages,10_131455761,1198140179439
      2008-01-02 18:54:02,278 INFO org.apache.hadoop.hbase.HRegionServer: regionserver/0.0.0.0:60020.cacheFlusher exiting
      2008-01-02 18:54:02,278 INFO org.apache.hadoop.hbase.HRegionServer: regionserver/0.0.0.0:60020.compactor exiting
      2008-01-02 18:54:02,278 INFO org.apache.hadoop.hbase.HRegionServer: regionserver/0.0.0.0:60020.splitter exiting
      2008-01-02 18:54:02,279 DEBUG org.apache.hadoop.hbase.HStore: closed spider_pages,10_131455761,1198140179439/search (1501227429/search)
      2008-01-02 18:54:02,279 DEBUG org.apache.hadoop.hbase.HStore: closed spider_pages,10_131455761,1198140179439/profile (1501227429/profile)
      2008-01-02 18:54:02,279 DEBUG org.apache.hadoop.hbase.HStore: closed spider_pages,10_131455761,1198140179439/meta (1501227429/meta)
      2008-01-02 18:54:02,279 INFO org.apache.hadoop.hbase.HRegion: closed spider_pages,10_131455761,1198140179439
      2008-01-02 18:54:02,279 DEBUG org.apache.hadoop.hbase.HRegionServer: closing region spider_pages,10_486594261,1198319654267
      2008-01-02 18:54:02,280 DEBUG org.apache.hadoop.hbase.HStore: closed spider_pages,10_486594261,1198319654267/search (364081590/search)
      2008-01-02 18:54:02,280 DEBUG org.apache.hadoop.hbase.HStore: closed spider_pages,10_486594261,1198319654267/profile (364081590/profile)
      2008-01-02 18:54:02,280 DEBUG org.apache.hadoop.hbase.HStore: closed spider_pages,10_486594261,1198319654267/meta (364081590/meta)
      2008-01-02 18:54:02,280 INFO org.apache.hadoop.hbase.HRegion: closed spider_pages,10_486594261,1198319654267
      ...
      ... this closing of regions goes on for a while
      ...
      ... the following continues until a kill -9
      ...
      2008-01-02 20:39:20,552 INFO org.apache.hadoop.fs.DFSClient: Could not obtain block blk_5124700261538503923 from any node: java.io.IOException: No live nodes contain current block
      2008-01-02 20:40:23,556 INFO org.apache.hadoop.fs.DFSClient: Could not obtain block blk_5124700261538503923 from any node: java.io.IOException: No live nodes contain current block
      2008-01-02 20:41:26,560 INFO org.apache.hadoop.fs.DFSClient: Could not obtain block blk_5124700261538503923 from any node: java.io.IOException: No live nodes contain current block
      2008-01-02 20:42:29,566 INFO org.apache.hadoop.fs.DFSClient: Could not obtain block blk_5124700261538503923 from any node: java.io.IOException: No live nodes contain current block
      2008-01-02 20:43:32,571 INFO org.apache.hadoop.fs.DFSClient: Could not obtain block blk_5124700261538503923 from any node: java.io.IOException: No live nodes contain current block
      ...

      1. patch.txt
        15 kB
        Jim Kellerman

        Issue Links

          Activity

          Jim Kellerman made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Jim Kellerman made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Hide
          Jim Kellerman added a comment -

          Addressed comments. Committed.

          Show
          Jim Kellerman added a comment - Addressed comments. Committed.
          Hide
          Bryan Duxbury added a comment -

          We could use a comment in Flusher explaining why we need the removeFromQueue option in flushRegion. Otherwise, tests pass, +1.

          Show
          Bryan Duxbury added a comment - We could use a comment in Flusher explaining why we need the removeFromQueue option in flushRegion. Otherwise, tests pass, +1.
          Jim Kellerman made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Hide
          Jim Kellerman added a comment -

          Passes local tests - Please review.

          Show
          Jim Kellerman added a comment - Passes local tests - Please review.
          Jim Kellerman made changes -
          Attachment patch.txt [ 12380431 ]
          Hide
          Jim Kellerman added a comment -

          Doesn't interrupt worker threads if not necessary.

          Show
          Jim Kellerman added a comment - Doesn't interrupt worker threads if not necessary.
          Jim Kellerman made changes -
          Link This issue is related to HBASE-500 [ HBASE-500 ]
          Hide
          Jim Kellerman added a comment - - edited

          This issue may be related to HBASE-500, although where it happens all the region server threads have already exited. There is no error from the compactor, but if it was interrupted while in the DFSClient, it could possibly have exited even though the DFSClient was hosed.

          Show
          Jim Kellerman added a comment - - edited This issue may be related to HBASE-500 , although where it happens all the region server threads have already exited. There is no error from the compactor, but if it was interrupted while in the DFSClient, it could possibly have exited even though the DFSClient was hosed.
          Jim Kellerman made changes -
          Link This issue is related to HBASE-500 [ HBASE-500 ]
          Hide
          Jim Kellerman added a comment -

          Proposal:

          The only time any of the flusher, compactor or log roller threads needs to be interrupted is if it is dormant waiting for something to do. Interrupting one of these threads at any other time will probably produce undesirable results like the DFSClient issue. If the thread is not dormant, it will finish what it is doing, go to the top of the loop and exit because the stop flag is set.

          So instead of using synchronized, use a ReentrantLock. The interruptor will use tryLock and if it fails, it knows the thread is working and does not need to be interrupted because it is doing something and when it finishes it checks the stop flag and exits.

          If the tryLock succeeds, it interrupts the thread to wake it from its sleep.

          Nothing else needs to be done because the join at the end of HRegionServer.run will take care of coordinating the region server exit.

          Show
          Jim Kellerman added a comment - Proposal: The only time any of the flusher, compactor or log roller threads needs to be interrupted is if it is dormant waiting for something to do. Interrupting one of these threads at any other time will probably produce undesirable results like the DFSClient issue. If the thread is not dormant, it will finish what it is doing, go to the top of the loop and exit because the stop flag is set. So instead of using synchronized, use a ReentrantLock. The interruptor will use tryLock and if it fails, it knows the thread is working and does not need to be interrupted because it is doing something and when it finishes it checks the stop flag and exits. If the tryLock succeeds, it interrupts the thread to wake it from its sleep. Nothing else needs to be done because the join at the end of HRegionServer.run will take care of coordinating the region server exit.
          Hide
          Jim Kellerman added a comment -

          It would be nice if when bugs are entered, the environment field was filled in with at least the hadoop version number.

          For this issue, the hadoop version was 0.16.1 or earlier. This can be determined from the message:

          2008-01-02 20:39:20,552 INFO org.apache.hadoop.fs.DFSClient: Could not obtain block blk_5124700261538503923 from any node: java.io.IOException: No live nodes contain current block
          

          Note how DFSClient misidentifies itself as 'org.apache.hadoop.fs.DFSClient'. DFSClient has always lived in org.apache.hadoop.dfs at least for any release since Q4 2007. The reason it misidentifies itself is due to the way it gets its logger in 0.16.1 and earlier:

          public static final Log LOG = LogFactory.getLog("org.apache.hadoop.fs.DFSClient");
          

          In 0.16.2, this was changed to:

          public static final Log LOG = LogFactory.getLog(DFSClient.class);
          

          The message itself originates from DFSClient$DFSInputStream.chooseDataNode. This method will try 3 times to get the block. It is called from both DFSClient$DFSInputStream.blockSeekTo and DFSClient$DFSInputStream.fetchBlockByteRange.

          blockSeekTo does not do any retries, however it is called by DFSClient$DFSInputStream.read which will make a total of 3 attempts at calling blockSeekTo, so a single call to read will generate 9 messages if the block cannot be found.

          fetchBlockByteRange will retry calling chooseDataNode for each known block location. If the replication factor is 3, then it would normally call chooseDataNode 3 times. But suppose the namenode was trying to replicate the block. This could result in the number of possible block locations being greater than three. So at a minimum, chooseDataNode will generate 9 messages when called from fetchBlockByteRange, and will add another 3 messages for each additional entry in the block location list.

          If multiple threads were doing reads, the number of messages generated could be seemingly endless.

          Show
          Jim Kellerman added a comment - It would be nice if when bugs are entered, the environment field was filled in with at least the hadoop version number. For this issue, the hadoop version was 0.16.1 or earlier. This can be determined from the message: 2008-01-02 20:39:20,552 INFO org.apache.hadoop.fs.DFSClient: Could not obtain block blk_5124700261538503923 from any node: java.io.IOException: No live nodes contain current block Note how DFSClient misidentifies itself as 'org.apache.hadoop.fs.DFSClient'. DFSClient has always lived in org.apache.hadoop.dfs at least for any release since Q4 2007. The reason it misidentifies itself is due to the way it gets its logger in 0.16.1 and earlier: public static final Log LOG = LogFactory.getLog( "org.apache.hadoop.fs.DFSClient" ); In 0.16.2, this was changed to: public static final Log LOG = LogFactory.getLog(DFSClient.class); The message itself originates from DFSClient$DFSInputStream.chooseDataNode. This method will try 3 times to get the block. It is called from both DFSClient$DFSInputStream.blockSeekTo and DFSClient$DFSInputStream.fetchBlockByteRange. blockSeekTo does not do any retries, however it is called by DFSClient$DFSInputStream.read which will make a total of 3 attempts at calling blockSeekTo, so a single call to read will generate 9 messages if the block cannot be found. fetchBlockByteRange will retry calling chooseDataNode for each known block location. If the replication factor is 3, then it would normally call chooseDataNode 3 times. But suppose the namenode was trying to replicate the block. This could result in the number of possible block locations being greater than three. So at a minimum, chooseDataNode will generate 9 messages when called from fetchBlockByteRange, and will add another 3 messages for each additional entry in the block location list. If multiple threads were doing reads, the number of messages generated could be seemingly endless.
          Hide
          stack added a comment -

          I'm fine with hadoop has moved on from when this issue was reported.

          I think a timer on HDFS calls on our way out should be a 0.2 issue, not 0.1.

          Show
          stack added a comment - I'm fine with hadoop has moved on from when this issue was reported. I think a timer on HDFS calls on our way out should be a 0.2 issue, not 0.1.
          Jim Kellerman made changes -
          Assignee Jim Kellerman [ jimk ]
          Hide
          Jim Kellerman added a comment -

          > stack - 15/Jan/08 12:43 PM
          > What would have fixed it?

          Possibly a new release of hadoop. I don't recall which version of hadoop was out there when this was reported, but I recall that 0.16.1 had a regression in DFSClient

          > We've seen where when a prob. in hdfs, things like close don't return. Should we put the clean up
          > of hdfs resources into a timer so we'll return eventually?

          Yes, probably. For 0.1.x, or just 0.2?

          Show
          Jim Kellerman added a comment - > stack - 15/Jan/08 12:43 PM > What would have fixed it? Possibly a new release of hadoop. I don't recall which version of hadoop was out there when this was reported, but I recall that 0.16.1 had a regression in DFSClient > We've seen where when a prob. in hdfs, things like close don't return. Should we put the clean up > of hdfs resources into a timer so we'll return eventually? Yes, probably. For 0.1.x, or just 0.2?
          Bryan Duxbury made changes -
          Link This issue is related to HBASE-46 [ HBASE-46 ]
          Bryan Duxbury made changes -
          Fix Version/s 0.2.0 [ 12312955 ]
          Hide
          Bryan Duxbury added a comment -

          Is this just a matter of catching and ignoring (or at least logging) DFS errors when we're shutting down? I realize it might have to be on a case-by-case basis, but it seems to make sense. The last thing we want is a braindead region server hanging around messing things up.

          Show
          Bryan Duxbury added a comment - Is this just a matter of catching and ignoring (or at least logging) DFS errors when we're shutting down? I realize it might have to be on a case-by-case basis, but it seems to make sense. The last thing we want is a braindead region server hanging around messing things up.
          Bryan Duxbury made changes -
          Component/s regionserver [ 12312139 ]
          Owen O'Malley made changes -
          Field Original Value New Value
          Affects Version/s 0.15.0 [ 12312565 ]
          Project Hadoop Core [ 12310240 ] Hadoop HBase [ 12310753 ]
          Component/s contrib/hbase [ 12311752 ]
          Key HADOOP-2526 HBASE-10
          Hide
          stack added a comment -

          What would have fixed it? Looks like we get stuck trying to close out hdfs connections. We've seen where when a prob. in hdfs, things like close don't return. Should we put the clean up of hdfs resources into a timer so we'll return eventually?

          Show
          stack added a comment - What would have fixed it? Looks like we get stuck trying to close out hdfs connections. We've seen where when a prob. in hdfs, things like close don't return. Should we put the clean up of hdfs resources into a timer so we'll return eventually?
          Hide
          Jim Kellerman added a comment -

          Is this still an issue? Has it occurred since reported?

          Show
          Jim Kellerman added a comment - Is this still an issue? Has it occurred since reported?
          Chris Kline created issue -

            People

            • Assignee:
              Jim Kellerman
              Reporter:
              Chris Kline
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development