Bookkeeper
  1. Bookkeeper
  2. BOOKKEEPER-212

Bookie stops responding when creating and deleting many ledgers

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.1.0
    • Component/s: None
    • Labels:
      None

      Description

      I have written down a short app to try to reproduce one problematic case reported on the user list. The app does the following:

      1. It creates initially a number of ledgers, say 2000;
      2. Once it reaches 2000, for each new ledger it creates, it deletes the one at the head of the list;
      3. Before closing the ledger, it adds 5 entries of 1k, just to generate some traffic for any given ledger.

      What I tried to achieve is to have thousands of active ledgers and delete new ledgers as I create new ones. I'll post a link to my test code later.

      At some point, one bookie stops responding. The bookie seems to be up, but it is not responsive. Looking at the logs, this is what I see:

      2012-04-06 12:22:05,765 - INFO  [SyncThread:LedgerCacheImpl@682] - Ledger 1726 is evicted from file info cache.
      2012-04-06 12:22:05,769 - INFO  [NIOServerFactory-3181:LedgerCacheImpl@682] - Ledger 1727 is evicted from file info cache.
      2012-04-06 12:22:05,772 - INFO  [SyncThread:LedgerCacheImpl@682] - Ledger 1728 is evicted from file info cache.
      2012-04-06 12:22:05,780 - INFO  [NIOServerFactory-3181:LedgerCacheImpl@682] - Ledger 1729 is evicted from file info cache.
      2012-04-06 12:22:05,787 - INFO  [NIOServerFactory-3181:LedgerCacheImpl@682] - Ledger 1730 is evicted from file info cache.
      2012-04-06 12:22:05,794 - INFO  [NIOServerFactory-3181:LedgerCacheImpl@682] - Ledger 1731 is evicted from file info cache.
      2012-04-06 12:22:05,801 - INFO  [NIOServerFactory-3181:LedgerCacheImpl@682] - Ledger 1732 is evicted from file info cache.
      2012-04-06 12:22:05,807 - INFO  [NIOServerFactory-3181:LedgerCacheImpl@682] - Ledger 1733 is evicted from file info cache.
      2012-04-06 12:22:05,822 - INFO  [NIOServerFactory-3181:LedgerCacheImpl@682] - Ledger 1734 is evicted from file info cache.
      2012-04-06 12:22:05,828 - INFO  [NIOServerFactory-3181:LedgerCacheImpl@682] - Ledger 1735 is evicted from file info cache.
      2012-04-06 12:22:05,842 - INFO  [NIOServerFactory-3181:LedgerCacheImpl@682] - Ledger 1736 is evicted from file info cache.
      2012-04-06 12:22:05,851 - INFO  [NIOServerFactory-3181:LedgerCacheImpl@682] - Ledger 1737 is evicted from file info cache.
      2012-04-06 12:22:05,856 - INFO  [NIOServerFactory-3181:LedgerCacheImpl@682] - Ledger 1738 is evicted from file info cache.
      2012-04-06 12:22:05,864 - INFO  [NIOServerFactory-3181:LedgerCacheImpl@682] - Ledger 1739 is evicted from file info cache.
      2012-04-06 12:22:05,874 - INFO  [NIOServerFactory-3181:LedgerCacheImpl@682] - Ledger 1740 is evicted from file info cache.
      2012-04-06 12:22:05,885 - INFO  [NIOServerFactory-3181:LedgerCacheImpl@682] - Ledger 1741 is evicted from file info cache.
      2012-04-06 12:22:05,894 - INFO  [NIOServerFactory-3181:LedgerCacheImpl@682] - Ledger 1742 is evicted from file info cache.
      2012-04-06 12:22:05,902 - INFO  [NIOServerFactory-3181:LedgerCacheImpl@682] - Ledger 1743 is evicted from file info cache.
      2012-04-06 12:22:05,987 - INFO  [GarbageCollectorThread:LedgerCacheImpl@682] - Ledger 1744 is evicted from file info cache.
      2012-04-06 12:22:05,987 - ERROR [GarbageCollectorThread:GarbageCollectorThread$1@244] - Exception when deleting the ledger index file on the Bookie: 
      java.io.IOException: /home/fpj/bk/current/1/b/10b.idx not found
              at org.apache.bookkeeper.bookie.FileInfo.checkOpen(FileInfo.java:118)
              at org.apache.bookkeeper.bookie.FileInfo.close(FileInfo.java:194) 
              at org.apache.bookkeeper.bookie.LedgerCacheImpl.deleteLedger(LedgerCacheImpl.java:619) 
              at org.apache.bookkeeper.bookie.GarbageCollectorThread$1.gc(GarbageCollectorThread.java:242)
              at org.apache.bookkeeper.meta.AbstractZkLedgerManager.doGc(AbstractZkLedgerManager.java:274)
              at org.apache.bookkeeper.meta.FlatLedgerManager.garbageCollectLedgers(FlatLedgerManager.java:168)
              at org.apache.bookkeeper.bookie.GarbageCollectorThread.doGcLedgers(GarbageCollectorThread.java:237)
              at org.apache.bookkeeper.bookie.GarbageCollectorThread.run(GarbageCollectorThread.java:206) 
      2012-04-06 12:22:05,987 - INFO  [GarbageCollectorThread:LedgerCacheImpl@682] - Ledger 1745 is evicted from file info cache.
      

      There are lots of exceptions like that, but otherwise I don't see anything that could make the bookie unresponsive. I'll upload the logs as well.

      1. bookkeeper-server.log
        596 kB
        Flavio Junqueira
      2. BK-212.diff
        3 kB
        Sijie Guo

        Activity

        Hide
        Hudson added a comment -

        Integrated in bookkeeper-trunk #449 (See https://builds.apache.org/job/bookkeeper-trunk/449/)
        BOOKKEEPER-212: Bookie stops responding when creating and deleting many ledgers (sijie via fpj) (Revision 1311177)

        Result = UNSTABLE
        fpj :
        Files :

        • /zookeeper/bookkeeper/trunk/CHANGES.txt
        • /zookeeper/bookkeeper/trunk/bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/LedgerCacheImpl.java
        • /zookeeper/bookkeeper/trunk/bookkeeper-server/src/test/java/org/apache/bookkeeper/test/LedgerCreateDeleteTest.java
        Show
        Hudson added a comment - Integrated in bookkeeper-trunk #449 (See https://builds.apache.org/job/bookkeeper-trunk/449/ ) BOOKKEEPER-212 : Bookie stops responding when creating and deleting many ledgers (sijie via fpj) (Revision 1311177) Result = UNSTABLE fpj : Files : /zookeeper/bookkeeper/trunk/CHANGES.txt /zookeeper/bookkeeper/trunk/bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/LedgerCacheImpl.java /zookeeper/bookkeeper/trunk/bookkeeper-server/src/test/java/org/apache/bookkeeper/test/LedgerCreateDeleteTest.java
        Hide
        Flavio Junqueira added a comment -

        +1, it looks good to me. I have also run a version with this patch against my simple test case and I don't see the issue any more. Good job, Sijie!

        Committed revision 1311177.

        Show
        Flavio Junqueira added a comment - +1, it looks good to me. I have also run a version with this patch against my simple test case and I don't see the issue any more. Good job, Sijie! Committed revision 1311177.
        Hide
        Sijie Guo added a comment -

        attach a patch to fix this issue, including a test case.

        Show
        Sijie Guo added a comment - attach a patch to fix this issue, including a test case.
        Hide
        Sijie Guo added a comment -

        did some investigation on this issue, I found that it is caused due to ledger cache deletes index file first then closes it.

                FileInfo fi = null;
                try {
                    fi = getFileInfo(ledgerId, null);
                    fi.delete();
                    fi.close(false);
                } finally {
                    // should release use count
                    // otherwise the file channel would not be closed.
                    if (null != fi) {
                        fi.release();
                    }   
                }
        

        in common case, the above code is OK. since we use useCount, we can ensure file existed if the useCount is not zero. but if the file info is evicted from ledger cache before, when ledger cache wants to delete such file info, it tries to get file info again but doesn't open the file channel, so fi#delete() will delete index file immediately. when we tried to close it again, it would fail due to FileNotFound exception.

        fail to removed deleted ledger file info will exhaust openFileLimit, so following creating ledgers would fail because we don't have buffer to open new ledger file info.

        the fix is quite simply, just need to change the order of fi#delete and fi#close. I would attach a test case and a patch later.

        Show
        Sijie Guo added a comment - did some investigation on this issue, I found that it is caused due to ledger cache deletes index file first then closes it. FileInfo fi = null ; try { fi = getFileInfo(ledgerId, null ); fi.delete(); fi.close( false ); } finally { // should release use count // otherwise the file channel would not be closed. if ( null != fi) { fi.release(); } } in common case, the above code is OK. since we use useCount , we can ensure file existed if the useCount is not zero. but if the file info is evicted from ledger cache before, when ledger cache wants to delete such file info, it tries to get file info again but doesn't open the file channel, so fi#delete() will delete index file immediately. when we tried to close it again, it would fail due to FileNotFound exception. fail to removed deleted ledger file info will exhaust openFileLimit, so following creating ledgers would fail because we don't have buffer to open new ledger file info. the fix is quite simply, just need to change the order of fi#delete and fi#close. I would attach a test case and a patch later.
        Hide
        Sijie Guo added a comment -

        @Flavio

        checkOpen is introduced in BOOKKEEPER-137. The case that masterKey is null and the index file is not existed, which means we don't have the master key to create index file, so we have to throw an exception.

        the exception you found is a weird case, may be introduced by BOOKKEEPER-190, where we forced the index creation when a ledger is evicted from ledger cache.

        from the exception stack, it seems that the exception is thrown when deleting ledger 3. but ledger 3 was evicted from ledger cache before, and its ledger index file should be forced to create when the eviction happened. it is so strange why the index file didn't exist.

        Show
        Sijie Guo added a comment - @Flavio checkOpen is introduced in BOOKKEEPER-137 . The case that masterKey is null and the index file is not existed, which means we don't have the master key to create index file, so we have to throw an exception. the exception you found is a weird case, may be introduced by BOOKKEEPER-190 , where we forced the index creation when a ledger is evicted from ledger cache. from the exception stack, it seems that the exception is thrown when deleting ledger 3. but ledger 3 was evicted from ledger cache before, and its ledger index file should be forced to create when the eviction happened. it is so strange why the index file didn't exist.
        Hide
        Flavio Junqueira added a comment -

        I investigated a little further and it looks like the culprit of this exception:

        2012-04-07 15:25:47,325 - ERROR [GarbageCollectorThread:FileInfo@76] - MasterKey is null
        2012-04-07 15:25:47,328 - INFO  [NIOServerFactory-3181:LedgerCacheImpl@682] - Ledger 1131 is evicted from file info cache. 2012-04-07 15:25:47,329 - ERROR [GarbageCollectorThread:GarbageCollectorThread$1@244] - Exception when deleting the ledger index file on the Bookie: 
        java.io.IOException: *path*/current/0/3/3.idx not found
                at org.apache.bookkeeper.bookie.FileInfo.checkOpen(FileInfo.java:120)
                at org.apache.bookkeeper.bookie.FileInfo.close(FileInfo.java:196)
                at org.apache.bookkeeper.bookie.LedgerCacheImpl.deleteLedger(LedgerCacheImpl.java:619)
                at org.apache.bookkeeper.bookie.GarbageCollectorThread$1.gc(GarbageCollectorThread.java:242)
                at org.apache.bookkeeper.meta.AbstractZkLedgerManager.doGc(AbstractZkLedgerManager.java:274)
                at org.apache.bookkeeper.meta.FlatLedgerManager.garbageCollectLedgers(FlatLedgerManager.java:168)
                at org.apache.bookkeeper.bookie.GarbageCollectorThread.doGcLedgers(GarbageCollectorThread.java:237)
                at org.apache.bookkeeper.bookie.GarbageCollectorThread.run(GarbageCollectorThread.java:206)
        

        is this call in LedgerCacheImpl.deleteLedger():

        fi = getFileInfo(ledgerId, null);
        

        Because the masterKey is set to null, when the call path reaches checkOpen, it ends up throwing an IOException:

                 if (masterKey == null && !exists) {
                    throw new IOException(lf + " not found");
                }
        

        There are two things I don't understand here:

        1. Why do we set the masterKey to null in the call to getFileInfo in deleteLedger?
        2. Why do we check if masterKey is null in checkOpen and throw an exception if it is (along with the file not existing)?
        Show
        Flavio Junqueira added a comment - I investigated a little further and it looks like the culprit of this exception: 2012-04-07 15:25:47,325 - ERROR [GarbageCollectorThread:FileInfo@76] - MasterKey is null 2012-04-07 15:25:47,328 - INFO [NIOServerFactory-3181:LedgerCacheImpl@682] - Ledger 1131 is evicted from file info cache. 2012-04-07 15:25:47,329 - ERROR [GarbageCollectorThread:GarbageCollectorThread$1@244] - Exception when deleting the ledger index file on the Bookie: java.io.IOException: *path*/current/0/3/3.idx not found at org.apache.bookkeeper.bookie.FileInfo.checkOpen(FileInfo.java:120) at org.apache.bookkeeper.bookie.FileInfo.close(FileInfo.java:196) at org.apache.bookkeeper.bookie.LedgerCacheImpl.deleteLedger(LedgerCacheImpl.java:619) at org.apache.bookkeeper.bookie.GarbageCollectorThread$1.gc(GarbageCollectorThread.java:242) at org.apache.bookkeeper.meta.AbstractZkLedgerManager.doGc(AbstractZkLedgerManager.java:274) at org.apache.bookkeeper.meta.FlatLedgerManager.garbageCollectLedgers(FlatLedgerManager.java:168) at org.apache.bookkeeper.bookie.GarbageCollectorThread.doGcLedgers(GarbageCollectorThread.java:237) at org.apache.bookkeeper.bookie.GarbageCollectorThread.run(GarbageCollectorThread.java:206) is this call in LedgerCacheImpl.deleteLedger(): fi = getFileInfo(ledgerId, null); Because the masterKey is set to null, when the call path reaches checkOpen, it ends up throwing an IOException: if (masterKey == null && !exists) { throw new IOException(lf + " not found"); } There are two things I don't understand here: Why do we set the masterKey to null in the call to getFileInfo in deleteLedger? Why do we check if masterKey is null in checkOpen and throw an exception if it is (along with the file not existing)?
        Hide
        Flavio Junqueira added a comment -

        I'm attaching the log file of the bookie that became unresponsive. In this run, the app stopped with these error messages:

        2012-04-06 12:22:07,565 - ERROR - [pool-3-thread-1:PerChannelBookieClient@550] - Add for ledger: 2382, entry: 0 failed on bookie: /IP2:3181 with code: 101
        2012-04-06 12:22:07,566 - WARN  - [pool-3-thread-1:PendingAddOp@146] - Write did not succeed: 2382, 0
        2012-04-06 12:22:07,572 - ERROR - [pool-3-thread-1:LedgerHandle@659] - Could not get additional bookie to remake ensemble, closing ledger: 2382
        org.apache.bookkeeper.client.BKException$BKNotEnoughBookiesException
        	at org.apache.bookkeeper.client.BKException.create(BKException.java:58)
        	at org.apache.bookkeeper.client.LedgerHandle.addEntry(LedgerHandle.java:410)
        	at org.apache.bookkeeper.client.LedgerHandle.addEntry(LedgerHandle.java:387)
        	at org.apache.bookkeeper.loader.LedgerLoader.main(LedgerLoader.java:38)
        2012-04-06 12:22:12,588 - INFO  - [Hashed wheel timer #1:PerChannelBookieClient@447] - Disconnected from bookie: /IP2:3181
        2012-04-06 12:22:12,600 - INFO  - [Hashed wheel timer #2:PerChannelBookieClient@447] - Disconnected from bookie: /IP1:3181
        2012-04-06 12:22:12,627 - INFO  - [Hashed wheel timer #3:PerChannelBookieClient@447] - Disconnected from bookie: /IP3:3181
        
        Show
        Flavio Junqueira added a comment - I'm attaching the log file of the bookie that became unresponsive. In this run, the app stopped with these error messages: 2012-04-06 12:22:07,565 - ERROR - [pool-3-thread-1:PerChannelBookieClient@550] - Add for ledger: 2382, entry: 0 failed on bookie: /IP2:3181 with code: 101 2012-04-06 12:22:07,566 - WARN - [pool-3-thread-1:PendingAddOp@146] - Write did not succeed: 2382, 0 2012-04-06 12:22:07,572 - ERROR - [pool-3-thread-1:LedgerHandle@659] - Could not get additional bookie to remake ensemble, closing ledger: 2382 org.apache.bookkeeper.client.BKException$BKNotEnoughBookiesException at org.apache.bookkeeper.client.BKException.create(BKException.java:58) at org.apache.bookkeeper.client.LedgerHandle.addEntry(LedgerHandle.java:410) at org.apache.bookkeeper.client.LedgerHandle.addEntry(LedgerHandle.java:387) at org.apache.bookkeeper.loader.LedgerLoader.main(LedgerLoader.java:38) 2012-04-06 12:22:12,588 - INFO - [Hashed wheel timer #1:PerChannelBookieClient@447] - Disconnected from bookie: /IP2:3181 2012-04-06 12:22:12,600 - INFO - [Hashed wheel timer #2:PerChannelBookieClient@447] - Disconnected from bookie: /IP1:3181 2012-04-06 12:22:12,627 - INFO - [Hashed wheel timer #3:PerChannelBookieClient@447] - Disconnected from bookie: /IP3:3181
        Hide
        Flavio Junqueira added a comment -

        Here is the git repository for the test code: https://github.com/fpj/bk-ledger-load

        Show
        Flavio Junqueira added a comment - Here is the git repository for the test code: https://github.com/fpj/bk-ledger-load

          People

          • Assignee:
            Sijie Guo
            Reporter:
            Flavio Junqueira
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development