|
Looks like these calls in the MiniDFS ctor
235 UnixUserGroupInformation information = UnixUserGroupInformation.login( conf ); 262 NameNode.format(conf); 272 nameNode = NameNode.createNameNode(args, conf); 2008-07-08 00:29:40,034 INFO fs.FSNamesystem (FSNamesystem.java:initialize(286)) - Finished loading FSImage in 12461 msecs suggests maybe FSNamesystem is where to start optimising? well, the only thing in there that would be slow is
this.dir.loadFSImage(getNamespaceDirs(conf), startOpt); and the startopt appears to be FORMAT in most miniDFS operations, so its time to format a test filesystem that is causing the delays. Possibilities:
The other thing to look at is shutdown time...how long does it take for everything to go away? +1
This is killing me in Pig as well. It seems to be something to do with Jetty which usually takes 15 seconds or more to start up 08/07/08 22:05:48 INFO fs.FSNamesystem: Registered FSNamesystemStatusMBean 08/07/08 22:06:03 INFO util.Credential: Checking Resource aliases 08/07/08 22:06:03 INFO http.HttpServer: Version Jetty/5.1.x 08/07/08 22:07:53 INFO mapred.MiniMRCluster: mapred.local.dir is /tmp/hadoop-root/mapred/local/2_0 08/07/08 22:08:08 INFO http.HttpServer: Version Jetty/5.1.x 08/07/08 22:08:08 INFO util.Container: Started HttpContext[/static,/static] A quick search here points a bit at entropy related https delays
http://docs.codehaus.org/display/JETTY/Connectors+slow+to+startup This is backed up by email from Greg Wilkins If the machine is newly started or other things have consumed that entropy, A simple threaddump during this pause should answer this. If it is that- there are several options - including not using Following up on Greg's suggestion, I dont think it is jetty. Whenever I kill -QUIT a test case I see my machine stuck in Inet6addressImpl.getHostByAddr()
I've seen this on 3 different test cases so far: different tests, all in getHostName(). This is on my machine on a managed network, with IPv6 as well as IPv4. [junit] Running org.apache.hadoop.hdfs.TestAbandonBlock TestCLI is taking its time for a different reason; there's a sleep in teardown() Can someone on a different machine If it is networking, it may behave differently for different users, and machines with IPv6 suffer more. Assuming this is the cause (and that same stack trace comes up, again and again), this is what the code is trying to do
DataNode.startDataNode() InetSocketAddress ipcAddr = NetUtils.createSocketAddr( // 1 (1) get socket address from the dfs.datanode.ipc.address, which defaults to "0.0.0.0:50020" Inside NetUtils.createSocketAddr, the configuration string is parsed and the (hostname,port) values extracted. This hostname is then turned into a new address. 1. If there is a static hostname -> hostname' mapping that is used if (getStaticResolution(hostname) != null) { hostname = getStaticResolution(hostname); } 2. else the OS/JVM does the work, to work out the address Somehow this is picking up an IPv6 address Later, when ipcAddr.getHostName(); is called (in (2)), An attempt to rDNS this address is made. Unless your site is running IPv6 DNS, this isnt going to succeed, but you are going to take a 15-30s hit every time an attempt is made. I'm going to see how to remove IPv6 from this machine, which has 1 real and two virtual interfaces as well as loopback, to see if this will make the problem go away, or at least make some mild improvements.... eth0 Link encap:Ethernet HWaddr 00:1C:C4:17:CC:46 lo Link encap:Local Loopback vmnet1 Link encap:Ethernet HWaddr 00:50:56:C0:00:01 vmnet8 Link encap:Ethernet HWaddr 00:50:56:C0:00:08 Turning off IPv6 doesnt help, just changes the stack trace:
[junit] BEA JRockit(R) R27.4.0-90-89592-1.6.0_02-20070928-1715-linux-x86_64 Its the determining of the local hostname from the ipaddress that the DFS is listening on that usually appears in the stack trace in a startup. Its not that is is particularly slow, a few tens of seconds, but that being run on every testcase, it brings the system to its knees. its also why the CPU load during testing is slow: they are not CPU bound, but driven by network delays. changing the ipc node from 0:0:0:0 (all interfaces) to localhost in src/test/hadoop-site.xml appears to help:
<property> We'll see how long it takes tonight. This address is not read from the conf, but rather specfied explicitly in MiniDFSCluster.java. I replaced 0.0.0.0 with 127.0.0.1, and TestDFSMkdirs dropped from 20 seconds to 5 seconds.
Here's a patch which makes that replacement throughout test code.
Doug Cutting made changes - 28/Jul/08 08:44 PM
Here's a new version that changes a few more tests from 0.0.0.0 to 127.0.0.1 and also shortens Thread.sleep's in a couple of long-running tests.
Doug Cutting made changes - 28/Jul/08 09:55 PM
This works on my machine for the tests I've tried. Let's see if the full set runs on Hudson, and, if so, how much faster it runs there.
Doug Cutting made changes - 28/Jul/08 09:57 PM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12387059/HADOOP-3694.patch against trunk revision 679930. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 33 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2963/testReport/ This message is automatically generated. 2 tests failed.
org.apache.hadoop.hdfs.server.namenode.TestCheckpoint.testCheckpoint java.io.FileNotFoundException: http://localhost:41372/getimage?putimage=1&port=41391&machine=140.211.11.106&token=-16:32382157:0:1217291770000:1217291764363 and org.apache.hadoop.hdfs.TestInjectionForSimulatedStorage.testInjection java.io.IOException: Timedout while waiting for all blocks to be replicated for /replication-test-file
Doug Cutting made changes - 29/Jul/08 10:49 PM
Here's a less ambitious version which passes all tests locally.
Doug Cutting made changes - 29/Jul/08 10:50 PM
Doug Cutting made changes - 29/Jul/08 10:50 PM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12387146/HADOOP-3694.patch against trunk revision 680823. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 27 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2978/testReport/ This message is automatically generated. One test is still failing (sigh)
org.apache.hadoop.hdfs.TestFileCreation.testClientTriggeredLeaseRecovery I do want to apply this patch, but as I've got my own changes to MiniDFS, don't want to mix things up. One thing that I can see from looking at the patch is that the constant 0:0:0:0 is being replaced everywhere with another string constant. How about it gets moved to a class TestConstants or similar, so that henceforth there is a single place to change the hostname value? The failed test last night (TestFileCreation#testClientTriggeredLeaseRecovery) is unrelated. The problem there is that it uses Thread.interrupt() to stop threads that are writing to HDFS, but that's not reliable, since DFSClient catches and ignores InterruptedException in many places (which is a bug). So this test fails sporadically.
Doug Cutting made changes - 30/Jul/08 06:47 PM
Here's an even more conservative patch, which only updates MiniDFSCluster, and also includes a fix to make TestFileCreation more reliable.
Doug Cutting made changes - 30/Jul/08 06:55 PM
Doug Cutting made changes - 30/Jul/08 06:55 PM
> The failed test last night (TestFileCreation#testClientTriggeredLeaseRecovery) is unrelated. The problem there is that it uses Thread.interrupt() to stop threads that are writing to HDFS, but that's not reliable, since DFSClient catches and ignores InterruptedException in many places (which is a bug). So this test fails sporadically.
+1 on adding a boolean for stopping the SlowWriter threads. +1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12387219/HADOOP-3694.patch against trunk revision 680823. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2986/testReport/ This message is automatically generated. The overall test runtime was over 10% faster than most other recent successful runs (1:25 from ~1:36). Not huge, but something... Should we commit this?
Most of the difference because of TestDataBlockScanner.testBlockCorruptionRecoveryPolicy somehow taking nearly 10 minutes less. We should fix this test too.
+1 for committing none the less. I'm +1 for a commit. we can look for more speedups later, but when testing, every minute helps.
Doug Cutting made changes - 31/Jul/08 06:18 PM
Integrated in Hadoop-trunk #581 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/581/
Nigel Daley made changes - 20/Nov/08 11:38 PM
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
+1 Decreasing unit test runtime increases developer productivity!