HBase
  1. HBase
  2. HBASE-1052

Stopping a HRegionServer with unflushed cache causes data loss from org.apache.hadoop.hbase.DroppedSnapshotException

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 0.18.0, 0.18.1
    • Fix Version/s: 0.19.0
    • Component/s: regionserver
    • Labels:
      None

      Description

      1. Start a Hbase cluster
      2. Create a table t1: create 't1',

      {NAME => 'f1'}

      3. Put a cell in the table: put 't1', 'r1', 'f1:', 'value'
      4. Scan it, see it's fine
      5. Stop the HRegionSever hosting the t1 region: hbase/bin/hbase-daemon.sh stop regionserver.
      6. Watch the region being reassigned from the original HRegionServer
      7. Scan the t1 table again. It's empty now.

      If between step 4 and step 5 the cache is flushed (e.g. Hbase cluster restart) no data is loss. However it means that if you stop a region server with dirty cache you will loose some data.

      HRegionServer log after issuing hbase-daemon.sh stop regionserver:

      2008-12-09 06:37:46,873 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7 on 60020: exiting
      2008-12-09 06:37:46,873 INFO org.apache.hadoop.ipc.Server: IPC Server handler 8 on 60020: exiting
      2008-12-09 06:37:46,873 INFO org.apache.hadoop.ipc.Server: IPC Server handler 9 on 60020: exiting
      2008-12-09 06:37:46,873 INFO org.apache.hadoop.ipc.Server: IPC Server handler 0 on 60020: exiting
      2008-12-09 06:37:46,874 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Stopping infoServer
      2008-12-09 06:37:46,874 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server Responder
      2008-12-09 06:37:46,874 INFO org.mortbay.util.ThreadedServer: Stopping Acceptor ServerSocket[addr=0.0.0.0/0.0.0.0,port=0,localport=60030]
      2008-12-09 06:37:46,886 INFO org.mortbay.http.SocketListener: Stopped SocketListener on 0.0.0.0:60030
      2008-12-09 06:37:46,948 INFO org.mortbay.util.Container: Stopped HttpContext[/static,/static]
      2008-12-09 06:37:47,007 INFO org.mortbay.util.Container: Stopped HttpContext[/logs,/logs]
      2008-12-09 06:37:47,007 INFO org.mortbay.util.Container: Stopped org.mortbay.jetty.servlet.WebApplicationHandler@60ded0f0
      2008-12-09 06:37:47,094 INFO org.mortbay.util.Container: Stopped WebApplicationContext[/,/]
      2008-12-09 06:37:47,094 INFO org.mortbay.util.Container: Stopped org.mortbay.jetty.Server@6490832e
      2008-12-09 06:37:47,094 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: closing region t1,,1228833363456
      2008-12-09 06:37:47,094 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Compactions and cache flushes disabled for region t1,,1228833363456
      2008-12-09 06:37:47,094 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Scanners disabled for region t1,,1228833363456
      2008-12-09 06:37:47,094 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: No more active scanners for region t1,,1228833363456
      2008-12-09 06:37:47,095 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Updates disabled for region t1,,1228833363456
      2008-12-09 06:37:47,095 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: No more row locks outstanding on region t1,,1228833363456
      2008-12-09 06:37:47,095 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Started memcache flush for region t1,,1228833363456. Current region memcache size 18.0
      2008-12-09 06:37:47,095 INFO org.apache.hadoop.hbase.regionserver.Flusher: regionserver/0:0:0:0:0:0:0:0:60020.cacheFlusher exiting
      2008-12-09 06:37:47,096 INFO org.apache.hadoop.hbase.regionserver.LogRoller: LogRoller exiting.
      2008-12-09 06:37:47,096 INFO org.apache.hadoop.hbase.regionserver.CompactSplitThread: regionserver/0:0:0:0:0:0:0:0:60020.compactor exiting
      2008-12-09 06:37:47,099 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: error closing region t1,,1228833363456
      org.apache.hadoop.hbase.DroppedSnapshotException: region: t1,,1228833363456
      at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1071)
      at org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:619)
      at org.apache.hadoop.hbase.regionserver.HRegionServer.closeAllRegions(HRegionServer.java:951)
      at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:459)
      at java.lang.Thread.run(Thread.java:619)
      Caused by: java.io.IOException: Filesystem closed
      at org.apache.hadoop.dfs.DFSClient.checkOpen(DFSClient.java:196)
      at org.apache.hadoop.dfs.DFSClient.getFileInfo(DFSClient.java:564)
      at org.apache.hadoop.dfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:390)
      at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:667)
      at org.apache.hadoop.hbase.regionserver.HStoreFile.<init>(HStoreFile.java:152)
      at org.apache.hadoop.hbase.regionserver.HStore.internalFlushCache(HStore.java:599)
      at org.apache.hadoop.hbase.regionserver.HStore.flushCache(HStore.java:577)
      at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1058)
      ... 4 more
      2008-12-09 06:37:47,100 DEBUG org.apache.hadoop.hbase.regionserver.HLog: closing log writer in hdfs://h1:54310/hbase/log_10.131.237.51_1228833326838_60020
      2008-12-09 06:37:47,101 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Close and delete failed
      java.io.IOException: Filesystem closed
      at org.apache.hadoop.dfs.DFSClient.checkOpen(DFSClient.java:196)
      at org.apache.hadoop.dfs.DFSClient.access$600(DFSClient.java:59)
      at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.closeInternal(DFSClient.java:2689)
      at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.close(DFSClient.java:2655)
      at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
      at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
      at org.apache.hadoop.io.SequenceFile$Writer.close(SequenceFile.java:962)
      at org.apache.hadoop.hbase.regionserver.HLog.close(HLog.java:349)
      at org.apache.hadoop.hbase.regionserver.HLog.closeAndDelete(HLog.java:333)
      at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:461)
      at java.lang.Thread.run(Thread.java:619)
      2008-12-09 06:37:47,102 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: telling master that region server is shutting down at: 10.131.237.51:60020
      2008-12-09 06:37:47,104 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: stopping server at: 10.131.237.51:60020
      2008-12-09 06:37:47,882 INFO org.apache.hadoop.hbase.Leases: regionserver/0:0:0:0:0:0:0:0:60020.leaseChecker closing leases
      2008-12-09 06:37:47,882 INFO org.apache.hadoop.hbase.Leases: regionserver/0:0:0:0:0:0:0:0:60020.leaseChecker closed leases
      2008-12-09 06:37:54,919 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: worker thread exiting
      2008-12-09 06:37:54,920 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver/0:0:0:0:0:0:0:0:60020 exiting
      2008-12-09 06:37:54,920 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Shutdown thread complete

        Activity

        Hide
        Jim Kellerman added a comment -

        This cannot be fixed in the hbase 0.18.x branch because it depends on features from hadoop-0.19. It has been fixed in trunk. See HBASE-728

        Show
        Jim Kellerman added a comment - This cannot be fixed in the hbase 0.18.x branch because it depends on features from hadoop-0.19. It has been fixed in trunk. See HBASE-728
        Hide
        stack added a comment -

        If I'm reading this properly, Cosmin is shutting the table down 'nicely'; it ain't crashing out. You'd think the commit log would be tidely closed and on redeploy it should have been replayed by the new HRS? If so, this issue looks like it has nought to do w/ appends and might exist in TRUNK?

        Show
        stack added a comment - If I'm reading this properly, Cosmin is shutting the table down 'nicely'; it ain't crashing out. You'd think the commit log would be tidely closed and on redeploy it should have been replayed by the new HRS? If so, this issue looks like it has nought to do w/ appends and might exist in TRUNK?
        Hide
        Jim Kellerman added a comment -

        Reopening. Cache flush did not complete before FileSystem was closed.

        Show
        Jim Kellerman added a comment - Reopening. Cache flush did not complete before FileSystem was closed.
        Hide
        Jim Kellerman added a comment -

        Stack is correct. I missed the "nice" shutdown part, my bad.

        Looks like there is a timing issue during shutdown where the FileSystem gets closed before all cache flushes complete.

        Show
        Jim Kellerman added a comment - Stack is correct. I missed the "nice" shutdown part, my bad. Looks like there is a timing issue during shutdown where the FileSystem gets closed before all cache flushes complete.
        Hide
        Jim Kellerman added a comment -

        Tested and committed to branch and trunk

        Show
        Jim Kellerman added a comment - Tested and committed to branch and trunk
        Hide
        Jim Kellerman added a comment -

        Reopening. This fix makes tests (TestRegionRebalancing) fail when there are multiple region servers and one is stopped "nicely"

        Show
        Jim Kellerman added a comment - Reopening. This fix makes tests (TestRegionRebalancing) fail when there are multiple region servers and one is stopped "nicely"
        Hide
        stack added a comment -

        The broken TRB was fixed by HBASE-1067

        Show
        stack added a comment - The broken TRB was fixed by HBASE-1067
        Hide
        Jim Kellerman added a comment -

        Yes it was. I hadn't picked up that patch. My bad.

        Show
        Jim Kellerman added a comment - Yes it was. I hadn't picked up that patch. My bad.

          People

          • Assignee:
            Jim Kellerman
            Reporter:
            Cosmin Lehene
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development