Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.90.0
    • Fix Version/s: 0.90.0
    • Component/s: master
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      In testing a RS failure under heavy increment workload I ran into an OOME when the master was splitting the logs.

      In this test case, I have exactly 136 bytes per log entry in all the logs, and the logs are all around 66-74MB). With a batch size of 3 logs, this means the master is loading about 500K-600K edits per log file. Each edit ends up creating 3 byte[] objects, the references for which are each 8 bytes of RAM, so we have 160 (136+8*3) bytes per edit used by the byte[]. For each edit we also allocate a bunch of other objects: one HLog$Entry, one WALEdit, one ArrayList, one LinkedList$Entry, one HLogKey, and one KeyValue. Overall this works out to 400 bytes of overhead per edit. So, with the default settings on this fairly average workload, the 1.5M log entries takes about 770MB of RAM. Since I had a few log files that were a bit larger (around 90MB) it exceeded 1GB of RAM and I got an OOME.

      For one, the 400 bytes per edit overhead is pretty bad, and we could probably be a lot more efficient. For two, we should actually account this rather than simply having a configurable "batch size" in the master.

      I think this is a blocker because I'm running with fairly default configs here and just killing one RS made the cluster fall over due to master OOME.

      1. sizes.png
        81 kB
        Todd Lipcon
      2. hbase-3323.txt
        27 kB
        Todd Lipcon
      3. hbase-3323.txt
        60 kB
        Todd Lipcon
      4. hbase-3323.txt
        61 kB
        Todd Lipcon
      5. hbase-3323.6.txt
        71 kB
        stack
      6. hbase-3323.5.txt
        71 kB
        Todd Lipcon
      7. hbase-3323.4.txt
        71 kB
        Todd Lipcon

        Issue Links

          Activity

          Hide
          Hudson added a comment -

          Integrated in HBase-TRUNK #1697 (See https://hudson.apache.org/hudson/job/HBase-TRUNK/1697/)

          Show
          Hudson added a comment - Integrated in HBase-TRUNK #1697 (See https://hudson.apache.org/hudson/job/HBase-TRUNK/1697/ )
          Hide
          Jean-Daniel Cryans added a comment -

          I think the patch is missing this, at least adding it back in HLogSplitter fixes TestHLogSplitting:

          } catch (EOFException eof) {
            // truncated files are expected if a RS crashes (see HBASE-2643)
            LOG.info("EOF from hlog " + logPath + ".  continuing");
            processedLogs.add(logPath);
          
          Show
          Jean-Daniel Cryans added a comment - I think the patch is missing this, at least adding it back in HLogSplitter fixes TestHLogSplitting: } catch (EOFException eof) { // truncated files are expected if a RS crashes (see HBASE-2643) LOG.info( "EOF from hlog " + logPath + ". continuing" ); processedLogs.add(logPath);
          Hide
          stack added a comment -

          Here is what I committed. Its Todd's patch updated so it'd apply to branch and trunk.

          Show
          stack added a comment - Here is what I committed. Its Todd's patch updated so it'd apply to branch and trunk.
          Hide
          stack added a comment -

          Ok. I've overwritten the logs from above. I've tested more since and haven't run into the issue. Will keep at it. Will open new issue if I run into it. I'd say go ahead still w/ commit of this.

          Show
          stack added a comment - Ok. I've overwritten the logs from above. I've tested more since and haven't run into the issue. Will keep at it. Will open new issue if I run into it. I'd say go ahead still w/ commit of this.
          Hide
          Todd Lipcon added a comment -

          Strange... can you grep the NN logs for that directory and try to figure it out? Usually that kind of error happens if someone deletes the file while you're writing it.

          Show
          Todd Lipcon added a comment - Strange... can you grep the NN logs for that directory and try to figure it out? Usually that kind of error happens if someone deletes the file while you're writing it.
          Hide
          stack added a comment -

          I did some more testing and came across the following after split was done and HLogSplitter was moving to close all files:

          2010-12-16 01:07:20,394 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Waiting for split writer threads to finish
          2010-12-16 01:07:20,640 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Split writers finished
          2010-12-16 01:07:20,650 ERROR org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Couldn't close log at hdfs://sv2borg180:10000/hbase/TestTable/02495f8b7cb6404cb7ea0521cc183d56/recovered.edits/0000000000000012854
          org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on /hbase/TestTable/02495f8b7cb6404cb7ea0521cc183d56/recovered.edits/0000000000000012854 File does not exist. [Lease.  Holder: DFSClient_hb_m_sv2borg180:         
          60000_1292460569453, pendingcreates: 178]
              at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1418)
              at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1409)
              at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:1464)
              at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:1452)
              at org.apache.hadoop.hdfs.server.namenode.NameNode.complete(NameNode.java:471)
              at sun.reflect.GeneratedMethodAccessor68.invoke(Unknown Source)
              at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
              at java.lang.reflect.Method.invoke(Method.java:597)
              at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
              at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:961)
              at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:957)
              at java.security.AccessController.doPrivileged(Native Method)
              at javax.security.auth.Subject.doAs(Subject.java:396)
              at org.apache.hadoop.ipc.Server$Handler.run(Server.java:955)
              at org.apache.hadoop.ipc.Client.call(Client.java:740)
              at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
              at $Proxy5.complete(Unknown Source)
              at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
              at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
              at java.lang.reflect.Method.invoke(Method.java:597)
              at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
              at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
              at $Proxy5.complete(Unknown Source)
              at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.closeInternal(DFSClient.java:3457)
              at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.close(DFSClient.java:3381)
              at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:61)
              at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:86)
              at org.apache.hadoop.io.SequenceFile$Writer.close(SequenceFile.java:966)
              at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.close(SequenceFileLogWriter.java:138)
              at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter$OutputSink.closeStreams(HLogSplitter.java:756)
              at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter$OutputSink.finishWritingAndClose(HLogSplitter.java:741)
              at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLog(HLogSplitter.java:291)
              at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLog(HLogSplitter.java:186)
              at org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:194)
              at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:96)
              at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:151)
              at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
              at java.lang.Thread.run(Thread.java:662)
          2010-12-16 01:07:20,685 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Closed path hdfs://sv2borg180:10000/hbase/TestTable/03dbd921c75876d6bc3f86c10201fa93/recovered.edits/0000000000000014196 (wrote 12 edits in 470ms)
          2010-12-16 01:07:20,719 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Closed path hdfs://sv2borg180:10000/hbase/TestTable/04f81e343d032d43946393636b2b4d2d/recovered.edits/0000000000000012928 (wrote 15 edits in 390ms)
          2010-12-16 01:07:20,725 ERROR org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Couldn't close log at hdfs://sv2borg180:10000/hbase/TestTable/06d137bd176e2604761243d396c11b3a/recovered.edits/0000000000000012945
          org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on /hbase/TestTable/06d137bd176e2604761243d396c11b3a/recovered.edits/0000000000000012945 File does not exist. [Lease.  Holder: DFSClient_hb_m_sv2borg180:         
          60000_1292460569453, pendingcreates: 176]
              at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1418)
              at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1409)
              at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:1464)
              at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:1452)
          ...
          

          It then closes a bunch and fails one or two more until its done with all.

          Eventually the split 'completes' and we start assigning out regions:

          2010-12-16 01:07:32,581 INFO org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Reassigning 186 region(s) that sv2borg185,60020,1292460570976 was carrying (skipping 1 regions(s) that are already in transition)
          
          Show
          stack added a comment - I did some more testing and came across the following after split was done and HLogSplitter was moving to close all files: 2010-12-16 01:07:20,394 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Waiting for split writer threads to finish 2010-12-16 01:07:20,640 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Split writers finished 2010-12-16 01:07:20,650 ERROR org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Couldn't close log at hdfs: //sv2borg180:10000/hbase/TestTable/02495f8b7cb6404cb7ea0521cc183d56/recovered.edits/0000000000000012854 org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on /hbase/TestTable/02495f8b7cb6404cb7ea0521cc183d56/recovered.edits/0000000000000012854 File does not exist. [Lease. Holder: DFSClient_hb_m_sv2borg180: 60000_1292460569453, pendingcreates: 178] at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1418) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1409) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:1464) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:1452) at org.apache.hadoop.hdfs.server.namenode.NameNode.complete(NameNode.java:471) at sun.reflect.GeneratedMethodAccessor68.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:961) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:957) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:955) at org.apache.hadoop.ipc.Client.call(Client.java:740) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220) at $Proxy5.complete(Unknown Source) at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59) at $Proxy5.complete(Unknown Source) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.closeInternal(DFSClient.java:3457) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.close(DFSClient.java:3381) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:61) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:86) at org.apache.hadoop.io.SequenceFile$Writer.close(SequenceFile.java:966) at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.close(SequenceFileLogWriter.java:138) at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter$OutputSink.closeStreams(HLogSplitter.java:756) at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter$OutputSink.finishWritingAndClose(HLogSplitter.java:741) at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLog(HLogSplitter.java:291) at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLog(HLogSplitter.java:186) at org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:194) at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:96) at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:151) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang. Thread .run( Thread .java:662) 2010-12-16 01:07:20,685 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Closed path hdfs: //sv2borg180:10000/hbase/TestTable/03dbd921c75876d6bc3f86c10201fa93/recovered.edits/0000000000000014196 (wrote 12 edits in 470ms) 2010-12-16 01:07:20,719 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Closed path hdfs: //sv2borg180:10000/hbase/TestTable/04f81e343d032d43946393636b2b4d2d/recovered.edits/0000000000000012928 (wrote 15 edits in 390ms) 2010-12-16 01:07:20,725 ERROR org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Couldn't close log at hdfs: //sv2borg180:10000/hbase/TestTable/06d137bd176e2604761243d396c11b3a/recovered.edits/0000000000000012945 org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on /hbase/TestTable/06d137bd176e2604761243d396c11b3a/recovered.edits/0000000000000012945 File does not exist. [Lease. Holder: DFSClient_hb_m_sv2borg180: 60000_1292460569453, pendingcreates: 176] at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1418) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1409) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:1464) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:1452) ... It then closes a bunch and fails one or two more until its done with all. Eventually the split 'completes' and we start assigning out regions: 2010-12-16 01:07:32,581 INFO org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Reassigning 186 region(s) that sv2borg185,60020,1292460570976 was carrying (skipping 1 regions(s) that are already in transition)
          Hide
          stack added a comment -

          I'm +1 on committing this patch. I tried v5 on my little test cluster here killing RSs a few times. It does the right thing as best as I can tell verified by rowcount of table subsequently to ensure all regions online after the killing.

          Show
          stack added a comment - I'm +1 on committing this patch. I tried v5 on my little test cluster here killing RSs a few times. It does the right thing as best as I can tell verified by rowcount of table subsequently to ensure all regions online after the killing.
          Hide
          Todd Lipcon added a comment -

          Is stuff protected rather than private for the subclassers - the transactional hbasers?

          Yep, I made it protected since inheritors will need to access this stuff. Hopefully we can get rid of the subclasses in favor of log coprocessors some day.

          Minor.... the below javadoc is now stale..

          ...Its javadoc of params no longer present on this method.

          Mistype '+ "An HLogSplitter instance may only be used one");'

          Extremely minor comment, the below formatting will be destroyed when rendered by javadoc:

          Fixed.

          So, it looks like we keep the order in which edits were written across the split process as best as I can tell. We just append to the Entry List in RegionEntryBuffer. That looks right.

          Added verification to the testThreading test that makes sure the edits come in the right order

          You iterate logWriters, a synchronized Map, a couple of times. Is this safe at the time of iteration?

          It was safe in the current usage, but I added synchronization on this map for getOutputCounts just in case someone starts to call it in a different context.

          You keep the old format for naming edit files? Naming them for the sequenceid of their first edit, it seems (you use getRegionSplitEditsPath - not in the patch).

          Yep, left that as-is.

          Show
          Todd Lipcon added a comment - Is stuff protected rather than private for the subclassers - the transactional hbasers? Yep, I made it protected since inheritors will need to access this stuff. Hopefully we can get rid of the subclasses in favor of log coprocessors some day. Minor.... the below javadoc is now stale.. ...Its javadoc of params no longer present on this method. Mistype '+ "An HLogSplitter instance may only be used one");' Extremely minor comment, the below formatting will be destroyed when rendered by javadoc: Fixed. So, it looks like we keep the order in which edits were written across the split process as best as I can tell. We just append to the Entry List in RegionEntryBuffer. That looks right. Added verification to the testThreading test that makes sure the edits come in the right order You iterate logWriters, a synchronized Map, a couple of times. Is this safe at the time of iteration? It was safe in the current usage, but I added synchronization on this map for getOutputCounts just in case someone starts to call it in a different context. You keep the old format for naming edit files? Naming them for the sequenceid of their first edit, it seems (you use getRegionSplitEditsPath - not in the patch). Yep, left that as-is.
          Hide
          stack added a comment -

          I have a bunch of praise/+1s for changes made in this patch – the refactoring of where params are passed to HLogSplitter, the new javadoc on the intern'ing methods, the overall less moving parts – but will say no more on that other than the improvements are great.

          Whats this for:

          +  protected final Path rootDir;
          +  protected final Path srcDir;
          +  protected final Path oldLogDir;
          +  protected final FileSystem fs;
          +  protected final Configuration conf;
          

          Is stuff protected rather than private for the subclassers – the transactional hbasers?

          Minor.... the below javadoc is now stale..

              * Create a new HLogSplitter using the given {@link Configuration} and the
              * <code>hbase.hlog.splitter.impl</code> property to derived the instance
              * class to use.
          

          Address on commit?

          Same for this stuff:

          +   * @param oldLogDir  directory where processed (split) logs will be archived to
          +   * @param fs FileSystem
          +   * @param conf Configuration
          +   * @throws IOException will throw if corrupted hlogs aren't tolerated
          ...
          

          Its javadoc of params no longer present on this method.

          Mistype '+ "An HLogSplitter instance may only be used one");'

          Extremely minor comment, the below formatting will be destroyed when rendered by javadoc:

          +   * through the logs to be split. For each log, we:
          +   *   - Recover it (take and drop HDFS lease) to ensure no other process can write
          +   *   - Read each edit (see {@link #parseHLog}
          +   *   - Mark as "processed" or "corrupt" depending on outcome
          

          (... but good documentation).

          It would be sweeter if this were a percentage of heap rather than hard MB number --> maxHeapUsage .... but no biggie. Can do in later issue.

          So, it looks like we keep the order in which edits were written across the split process as best as I can tell. We just append to the Entry List in RegionEntryBuffer. That looks right.

          (Reading this patch makes me reconsider asserts)

          You iterate logWriters, a synchronized Map, a couple of times. Is this safe at the time of iteration?

          You keep the old format for naming edit files? Naming them for the sequenceid of their first edit, it seems (you use getRegionSplitEditsPath – not in the patch).

          On the below

          +    if (scopes != null) {
          +      ret += ClassSize.TREEMAP;
          +      ret += ClassSize.align(scopes.size() * ClassSize.MAP_ENTRY);
          +      // TODO this isn't quite right, need help here
          +    }
          

          ... maybe Jon can help – but its fine for the moment I'd say.

          Show
          stack added a comment - I have a bunch of praise/+1s for changes made in this patch – the refactoring of where params are passed to HLogSplitter, the new javadoc on the intern'ing methods, the overall less moving parts – but will say no more on that other than the improvements are great. Whats this for: + protected final Path rootDir; + protected final Path srcDir; + protected final Path oldLogDir; + protected final FileSystem fs; + protected final Configuration conf; Is stuff protected rather than private for the subclassers – the transactional hbasers? Minor.... the below javadoc is now stale.. * Create a new HLogSplitter using the given {@link Configuration} and the * <code>hbase.hlog.splitter.impl</code> property to derived the instance * class to use. Address on commit? Same for this stuff: + * @param oldLogDir directory where processed (split) logs will be archived to + * @param fs FileSystem + * @param conf Configuration + * @ throws IOException will throw if corrupted hlogs aren't tolerated ... Its javadoc of params no longer present on this method. Mistype '+ "An HLogSplitter instance may only be used one");' Extremely minor comment, the below formatting will be destroyed when rendered by javadoc: + * through the logs to be split. For each log, we: + * - Recover it (take and drop HDFS lease) to ensure no other process can write + * - Read each edit (see {@link #parseHLog} + * - Mark as "processed" or "corrupt" depending on outcome (... but good documentation). It would be sweeter if this were a percentage of heap rather than hard MB number --> maxHeapUsage .... but no biggie. Can do in later issue. So, it looks like we keep the order in which edits were written across the split process as best as I can tell. We just append to the Entry List in RegionEntryBuffer. That looks right. (Reading this patch makes me reconsider asserts) You iterate logWriters, a synchronized Map, a couple of times. Is this safe at the time of iteration? You keep the old format for naming edit files? Naming them for the sequenceid of their first edit, it seems (you use getRegionSplitEditsPath – not in the patch). On the below + if (scopes != null ) { + ret += ClassSize.TREEMAP; + ret += ClassSize.align(scopes.size() * ClassSize.MAP_ENTRY); + // TODO this isn't quite right, need help here + } ... maybe Jon can help – but its fine for the moment I'd say.
          Hide
          Todd Lipcon added a comment -

          Here's a new version including some new unit tests that set up a mock log reader and mock writers so as to test a lot of edits going through the splitter without actually taking any IO. This reliably reproduced the split bug Stack found - fixed that now.

          Didn't get a chance to do a true benchmark yet.

          Show
          Todd Lipcon added a comment - Here's a new version including some new unit tests that set up a mock log reader and mock writers so as to test a lot of edits going through the splitter without actually taking any IO. This reliably reproduced the split bug Stack found - fixed that now. Didn't get a chance to do a true benchmark yet.
          Hide
          Todd Lipcon added a comment -

          It seems in my eagerness to fix up the deadlock I ended up undersynchronizing in the latest patch. Oops! I'm going to write some standalone slower tests that actually split larger amounts of data, so we can get proper benchmarks and do a better job of finding these races.

          Will address your other comments in the next version as well.

          Show
          Todd Lipcon added a comment - It seems in my eagerness to fix up the deadlock I ended up undersynchronizing in the latest patch. Oops! I'm going to write some standalone slower tests that actually split larger amounts of data, so we can get proper benchmarks and do a better job of finding these races. Will address your other comments in the next version as well.
          Hide
          stack added a comment -

          So, on restart, the split completed (this is ten servers whose logs were to be split; one split set was of 33 logs... all in a 1G Master heap).

          The logging is profuse but I'm grand w/ that.

          The below looks like a version of HBASE-2471... is that so? If so, good stuff.

          2010-12-11 18:35:43,243 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: This region's directory doesn't exist: hdfs://sv2borg180:10000/hbase/TestTable/1d4a17311a6bc9f7b34f121bf121f42b. It is very likely that it was already split so it's safe to discard those edits.
          

          On the patch, why flip parameter order – i.e. moving conf from end of list to start? People want to know!

          Whats this about?

          +  public void internTableName(byte []tablename) {
          +    assert Bytes.equals(tablename, this.tablename);
          +    this.tablename = tablename;
          +  }
          

          We could call this method mutiple times? And it'd be an error if tablename was different on an invocation? The method is public w/o doc. It has to be public because called from different package?

          Sorry... still not done w/ review. Back later.

          Show
          stack added a comment - So, on restart, the split completed (this is ten servers whose logs were to be split; one split set was of 33 logs... all in a 1G Master heap). The logging is profuse but I'm grand w/ that. The below looks like a version of HBASE-2471 ... is that so? If so, good stuff. 2010-12-11 18:35:43,243 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: This region's directory doesn't exist: hdfs: //sv2borg180:10000/hbase/TestTable/1d4a17311a6bc9f7b34f121bf121f42b. It is very likely that it was already split so it's safe to discard those edits. On the patch, why flip parameter order – i.e. moving conf from end of list to start? People want to know! Whats this about? + public void internTableName( byte []tablename) { + assert Bytes.equals(tablename, this .tablename); + this .tablename = tablename; + } We could call this method mutiple times? And it'd be an error if tablename was different on an invocation? The method is public w/o doc. It has to be public because called from different package? Sorry... still not done w/ review. Back later.
          Hide
          stack added a comment -

          I think I pulled the right patch – the latest one – 62820 bytes in size (Todd, add a version to your patches?). I was trying it and got this on the very end

          2010-12-11 18:39:00,914 INFO org.apache.hadoop.hbase.util.FSUtils: Recovering file hdfs://sv2borg180:10000/hbase/.logs/sv2borg188,60020,1291841481545/sv2borg188%3A60020.1291993339759
          2010-12-11 18:39:00,915 ERROR org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Error in log splitting write thread
          java.util.ConcurrentModificationException
                  at java.util.LinkedList$ListItr.checkForComodification(LinkedList.java:761)
                  at java.util.LinkedList$ListItr.next(LinkedList.java:696)
                  at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter$WriterThread.writeBuffer(HLogSplitter.java:669)
                  at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter$WriterThread.doRun(HLogSplitter.java:649)
                  at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter$WriterThread.run(HLogSplitter.java:621)
          
          Show
          stack added a comment - I think I pulled the right patch – the latest one – 62820 bytes in size (Todd, add a version to your patches?). I was trying it and got this on the very end 2010-12-11 18:39:00,914 INFO org.apache.hadoop.hbase.util.FSUtils: Recovering file hdfs: //sv2borg180:10000/hbase/.logs/sv2borg188,60020,1291841481545/sv2borg188%3A60020.1291993339759 2010-12-11 18:39:00,915 ERROR org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Error in log splitting write thread java.util.ConcurrentModificationException at java.util.LinkedList$ListItr.checkForComodification(LinkedList.java:761) at java.util.LinkedList$ListItr.next(LinkedList.java:696) at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter$WriterThread.writeBuffer(HLogSplitter.java:669) at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter$WriterThread.doRun(HLogSplitter.java:649) at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter$WriterThread.run(HLogSplitter.java:621)
          Hide
          Todd Lipcon added a comment -

          Now certified deadlock free by jcarder

          Show
          Todd Lipcon added a comment - Now certified deadlock free by jcarder
          Hide
          Todd Lipcon added a comment -

          er, there's a deadlock in this version of the patch but general structure should be good enough to review at this point

          Show
          Todd Lipcon added a comment - er, there's a deadlock in this version of the patch but general structure should be good enough to review at this point
          Hide
          Todd Lipcon added a comment -

          Here's an updated patch, refactored some stuff a bit.

          Also fixed up one of the test cases which was flaky in the old version - it used the presence of region directories to determine whether the split had started running, but those were created by generateHLogs, so it was only winning the race sometimes.

          Show
          Todd Lipcon added a comment - Here's an updated patch, refactored some stuff a bit. Also fixed up one of the test cases which was flaky in the old version - it used the presence of region directories to determine whether the split had started running, but those were created by generateHLogs, so it was only winning the race sometimes.
          Hide
          stack added a comment -

          I'm part way through a review but have to leave. So far it looks like less moving parts and cleaner overall. Will finish up review tomorrow. We could set the number of splits to 1 and ship the RC with that but at the moment, going by the other issues that need fixing, its looking like next week before new RC and that might be time to test this redo of splits.

          Show
          stack added a comment - I'm part way through a review but have to leave. So far it looks like less moving parts and cleaner overall. Will finish up review tomorrow. We could set the number of splits to 1 and ship the RC with that but at the moment, going by the other issues that need fixing, its looking like next week before new RC and that might be time to test this redo of splits.
          Hide
          Todd Lipcon added a comment -

          Oh, I should also note that you probably want to turn off DEBUG logs or comment them out before testing this on any real log. It's quite noisy at the moment. Also, it seems like the 128M buffer I hardcoded is plenty fine - the writers stayed completely ahead of the reader in the test I ran.

          We should make a standalone benchmark for this algorithm too.

          Show
          Todd Lipcon added a comment - Oh, I should also note that you probably want to turn off DEBUG logs or comment them out before testing this on any real log. It's quite noisy at the moment. Also, it seems like the 128M buffer I hardcoded is plenty fine - the writers stayed completely ahead of the reader in the test I ran. We should make a standalone benchmark for this algorithm too.
          Hide
          Todd Lipcon added a comment -

          Here's a patch which basically redoes the way log splitting happens. It needs to be commented up and I want to rename some things, but the basic architecture is this:

          • Main thread reads logs in order and writes into a structure called EntrySink (I want to rename this to EntryBuffer or sometihng)
          • EntrySink maintains some kind of approximate heap size (I don't think I calculated it quite right, but c'est la vie) and also takes care of managing a RegionEntryBuffer for each region key.
            • The RegionEntryBuffer just has a LinkedList of Entries right now, but it does size accounting, and I think we could change these to a fancier data structure for more efficient memory usage (eg a linked list of 10000-entry arrays)
          • If the main thread tries to append into the EntrySink but the heap usage has hit a max threshold, it waits.

          Meanwhile, there are N threads called WriterThread-n which do the following in a loop:

          • poll the EntrySink to grab a RegionEntryBuffer
            • The EntrySink returns the one with the most outstanding edits (hope is to write larger sequential chunks if possible)
            • The EntrySink also keeps track of which regions already have some thread working on them, so we don't end up with out-of-order appends
          • The EntrySink then drains the RegionEntryBuffer into the "OutputSink" which maintains the map from region key to WriterAndPath (bug in patch uploaded: this map needs to be synchronizedMap)
          • Once the buffer is drained, it notifies the EntrySink that the memory is no longer in use (hence unblocking the producer thread)

          In summary, it's a fairly standard producer-consumer pattern with some trickery to make a separate queue per region so as not to reorder edits.

          As a non-scientific test I patched this into my cluster which was getting the OOME on master startup, and it not only started up fine, the log splits ran about 50% faster than they did before!

          Known bug: the "log N of M" always says "log 1 of M"

          Thoughts?

          Show
          Todd Lipcon added a comment - Here's a patch which basically redoes the way log splitting happens. It needs to be commented up and I want to rename some things, but the basic architecture is this: Main thread reads logs in order and writes into a structure called EntrySink (I want to rename this to EntryBuffer or sometihng) EntrySink maintains some kind of approximate heap size (I don't think I calculated it quite right, but c'est la vie) and also takes care of managing a RegionEntryBuffer for each region key. The RegionEntryBuffer just has a LinkedList of Entries right now, but it does size accounting, and I think we could change these to a fancier data structure for more efficient memory usage (eg a linked list of 10000-entry arrays) If the main thread tries to append into the EntrySink but the heap usage has hit a max threshold, it waits. Meanwhile, there are N threads called WriterThread-n which do the following in a loop: poll the EntrySink to grab a RegionEntryBuffer The EntrySink returns the one with the most outstanding edits (hope is to write larger sequential chunks if possible) The EntrySink also keeps track of which regions already have some thread working on them, so we don't end up with out-of-order appends The EntrySink then drains the RegionEntryBuffer into the "OutputSink" which maintains the map from region key to WriterAndPath (bug in patch uploaded: this map needs to be synchronizedMap) Once the buffer is drained, it notifies the EntrySink that the memory is no longer in use (hence unblocking the producer thread) In summary, it's a fairly standard producer-consumer pattern with some trickery to make a separate queue per region so as not to reorder edits. As a non-scientific test I patched this into my cluster which was getting the OOME on master startup, and it not only started up fine, the log splits ran about 50% faster than they did before! Known bug: the "log N of M" always says "log 1 of M" Thoughts?
          Hide
          stack added a comment -

          The second option seems viable to me Cosmin. HBASE-2727 made it so we could pick up multiple files from recovered.edits directory. The files we write are named fro the seqid of the first edit so we can pick them up in order.

          Show
          stack added a comment - The second option seems viable to me Cosmin. HBASE-2727 made it so we could pick up multiple files from recovered.edits directory. The files we write are named fro the seqid of the first edit so we can pick them up in order.
          Hide
          stack added a comment -

          Cosmin, on the 'One simple, partial solution', I played with changing the LinkedList from Entry to LinkedList of WALEdits and changing the Map Key to be a data structure of encodedRegionName plus tablename (I used HLogKey rather than add a new data structure and then made a special Comparator that equated HLogKeys that had same encodedRegionName). All was going nicely till it came to dump out the per region recovered.edit files. At this point I need the Entry. A WALEdit will not do. So it seems. We need seqid at least (This could change when KV has seqid – HBASE-2856 'TestAcidGuarantee broken on trunk').

          Show
          stack added a comment - Cosmin, on the 'One simple, partial solution', I played with changing the LinkedList from Entry to LinkedList of WALEdits and changing the Map Key to be a data structure of encodedRegionName plus tablename (I used HLogKey rather than add a new data structure and then made a special Comparator that equated HLogKeys that had same encodedRegionName). All was going nicely till it came to dump out the per region recovered.edit files. At this point I need the Entry. A WALEdit will not do. So it seems. We need seqid at least (This could change when KV has seqid – HBASE-2856 'TestAcidGuarantee broken on trunk').
          Hide
          Todd Lipcon added a comment -

          I'm going to take a leap and assume I still remember how to code and work on this one tonight.

          Show
          Todd Lipcon added a comment - I'm going to take a leap and assume I still remember how to code and work on this one tonight.
          Hide
          Todd Lipcon added a comment -

          Hi Cosmin. I agree that the log split process can be fixed up a bit to use a smaller amount of memory. Fixing the data structure to get rid of all those extra objects is one thing that's pretty straightfowrard like you mentioned. As for how to change the split process itself, I think following a more typical producer/consumer model makes more sense. For example, something like this:

          buffers = map<region name, arraylist<edit>>
          for each log:
            for each edit:
              buf = map.get(region) [inserting a new arraylist if necessary]
              buf.add(edit)
              if buf length > some number of bytes:
                while workqueue length > some threshold: wait
                workqueue.add(buf)
                buffers.remove(region)
          

          then we have a set of threads which pull chunks of edits off the work queue and write into the appropriate file (where the file handles are kept open).

          Show
          Todd Lipcon added a comment - Hi Cosmin. I agree that the log split process can be fixed up a bit to use a smaller amount of memory. Fixing the data structure to get rid of all those extra objects is one thing that's pretty straightfowrard like you mentioned. As for how to change the split process itself, I think following a more typical producer/consumer model makes more sense. For example, something like this: buffers = map<region name, arraylist<edit>> for each log: for each edit: buf = map.get(region) [inserting a new arraylist if necessary] buf.add(edit) if buf length > some number of bytes: while workqueue length > some threshold: wait workqueue.add(buf) buffers.remove(region) then we have a set of threads which pull chunks of edits off the work queue and write into the appropriate file (where the file handles are kept open).
          Hide
          Cosmin Lehene added a comment -

          Talking with JD, we figured HBASE-1364 might be a solution as well.

          Show
          Cosmin Lehene added a comment - Talking with JD, we figured HBASE-1364 might be a solution as well.
          Hide
          Cosmin Lehene added a comment -

          Here's the object distribution tlipcon mentioned:

          The values of this map contain the 1.5M+ edits (in Entry objects) tlipcon mentioned
          
          Map<byte[], LinkedList<Entry>> editsByRegion
                |                  |
                |                  |
                |                  |
          (encodedRegionName)      |
                .                  |
                .                  |
                .                  |
                .                  |
                .                  |
                .                  | 
                .                  --- WalEdit edit
                .                  |      |
                .                  |      |
                .                  |      |
                .                  |      --- ArrayList<KeyValue> kvs
                .                  |                      |
                .                  |                      |
                .                  |                      |
                .                  |                      --- byte[] bytes
                .                  |                          
                .                  |
                .  ----------------------------------------------------------
                .  |               |                                        |
                .  |               |                                        |
                .  |               --- HLogKey key                          |
                .  |                    |                                   |
                .  |                    |                                   |
                .  |                    |                                   |
                .  |                    |                                   |
                . .| . . . . . . . . . .--- byte[] encodedRegionName        |
                   |                    |                                   |
                   |                    |                                   |
                   |                    |                                   |
                   |                    --- byte[] tableName                |
                   |                                                        |
                   |                                                        |
                   | this is useless as we could have this in the map key   |
                   ----------------------------------------------------------
          
          

          The splitLog workflow loads all the edits in a map indexed by region, and then uses a thread pool to write them to per region directories.

          As you can see from this diagram, each edit duplicates the tableName and the encodedRegionName (hence the 2 extra byte[]).

          One simple, partial solution:
          We can reduce the memory footprint by putting the tableName in the map key with the encodedRegionName (it's free). This would leave us with a LinkedList of WalEdit objects (ArrayList + KeyValue + the actual info: byte[]). Of course this could be further compressed, but it might not be worth it (WalEdit has a replication scope as well IIUC).
          This is a partial solution since we still don't solve the case when we have too much data in the HLogs.

          A second solution/suggestion:

          We can change the split process a bit. Let me explain how HLogs are organized and how we split (please correct me if I'm wrong):

          Context:

          • Eeach region server has one HLog directory in HDFS (under /hbase/.logs)
          • In each HRegionServer corresponding directory there's a bunch of HLog files.
          • There's a strict order of the HLog files within a region's dir and edits inside are ordered as well.
          • We read all the files in memory first because we need all the edits for a particular region and to respect the order of the edits.
          • Only after everything is read, we use a thread pool to distribute the log entries per regions.

          Suggestion:
          We could read the files in parallel, and instead of writing a single file in the HRegion corresponding directory, we write one file for each HLog. This should keep all the edits in strict order. Then HRegionServer could safely load them in the same order and apply edits.

          While we read the files in parallel we don't have to read the entire content in memory: we can just read and write to the corresponding destination file. This should solve the memory footprint problem.

          I haven't spent too much time analyzing the second option; it might have been discussed in the past, so if I'm missing something let me know.

          Cosmin

          Show
          Cosmin Lehene added a comment - Here's the object distribution tlipcon mentioned: The values of this map contain the 1.5M+ edits (in Entry objects) tlipcon mentioned Map< byte [], LinkedList<Entry>> editsByRegion | | | | | | (encodedRegionName) | . | . | . | . | . | . | . --- WalEdit edit . | | . | | . | | . | --- ArrayList<KeyValue> kvs . | | . | | . | | . | --- byte [] bytes . | . | . ---------------------------------------------------------- . | | | . | | | . | --- HLogKey key | . | | | . | | | . | | | . | | | . .| . . . . . . . . . .--- byte [] encodedRegionName | | | | | | | | | | | --- byte [] tableName | | | | | | this is useless as we could have this in the map key | ---------------------------------------------------------- The splitLog workflow loads all the edits in a map indexed by region, and then uses a thread pool to write them to per region directories. As you can see from this diagram, each edit duplicates the tableName and the encodedRegionName (hence the 2 extra byte[]). One simple, partial solution: We can reduce the memory footprint by putting the tableName in the map key with the encodedRegionName (it's free). This would leave us with a LinkedList of WalEdit objects (ArrayList + KeyValue + the actual info: byte[]). Of course this could be further compressed, but it might not be worth it (WalEdit has a replication scope as well IIUC). This is a partial solution since we still don't solve the case when we have too much data in the HLogs. A second solution/suggestion: We can change the split process a bit. Let me explain how HLogs are organized and how we split (please correct me if I'm wrong): Context: Eeach region server has one HLog directory in HDFS (under /hbase/.logs) In each HRegionServer corresponding directory there's a bunch of HLog files. There's a strict order of the HLog files within a region's dir and edits inside are ordered as well. We read all the files in memory first because we need all the edits for a particular region and to respect the order of the edits. Only after everything is read, we use a thread pool to distribute the log entries per regions. Suggestion: We could read the files in parallel, and instead of writing a single file in the HRegion corresponding directory, we write one file for each HLog. This should keep all the edits in strict order. Then HRegionServer could safely load them in the same order and apply edits. While we read the files in parallel we don't have to read the entire content in memory: we can just read and write to the corresponding destination file. This should solve the memory footprint problem. I haven't spent too much time analyzing the second option; it might have been discussed in the past, so if I'm missing something let me know. Cosmin

            People

            • Assignee:
              Todd Lipcon
              Reporter:
              Todd Lipcon
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development