Issue Details (XML | Word | Printable)

Key: HADOOP-4744
Type: Bug Bug
Status: Closed Closed
Resolution: Fixed
Priority: Blocker Blocker
Assignee: Jothi Padmanabhan
Reporter: Amareshwari Sriramadasu
Votes: 0
Watchers: 4
Operations

If you were logged in you would be able to see more operations.
Hadoop Common

Wrong resolution of hostname and port

Created: 01/Dec/08 10:28 AM   Updated: 14/Nov/09 02:12 AM
Return to search
Component/s: None
Affects Version/s: 0.20.0
Fix Version/s: 0.20.0

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works 4744.exception.patch 2009-02-23 04:35 PM Devaraj Das 0.8 kB
Text File Licensed for inclusion in ASF works 4744.patch 2009-02-18 06:31 PM Devaraj Das 1.0 kB
Text File Licensed for inclusion in ASF works 4744.reopened.1.patch 2009-05-07 12:30 PM Devaraj Das 2 kB
Text File Licensed for inclusion in ASF works 4744.reopened.patch 2009-05-06 04:21 PM Devaraj Das 2 kB
Text File Licensed for inclusion in ASF works hadoop-4744-new.patch 2009-02-27 11:05 AM Jothi Padmanabhan 4 kB
Issue Links:
Reference
 

Hadoop Flags: Reviewed
Resolution Date: 02/Mar/09 10:37 AM


 Description  « Hide
I noticed the following for one of the hosts in a cluster:

1. machines.jsp page resolves the http address as just "http://hostname" (which doesn't work). It doesnt put the port number for the host. Even if I add the port number manually in the URI, the task tracker page does not come up.
2. All the tasks(both maps and reduces) which ran on the machine ran successfully. But tasklogs cannot be viewed, because port-number is not resolved. ( same problem as in (1)).
3. The reducers waiting for maps ran on that machine fail with connection failed errors saying the hostname is 'null'.



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Amareshwari Sriramadasu added a comment - 01/Dec/08 10:33 AM
The log for the reducer waiting for map on the machine:
2008-12-01 10:00:19,501 INFO org.apache.hadoop.mapred.ReduceTask: Task attempt_200812010806_0337_r_000000_1: Failed fetch #3 from attempt_200812010806_0337_m_002787_0
2008-12-01 10:00:19,501 WARN org.apache.hadoop.mapred.ReduceTask: attempt_200812010806_0337_r_000000_1 adding host null to penalty box, next contact in 16 seconds
2008-12-01 10:00:20,503 INFO org.apache.hadoop.mapred.ReduceTask: attempt_200812010806_0337_r_000000_1: Got 1 map-outputs from previous failures
2008-12-01 10:00:35,510 INFO org.apache.hadoop.mapred.ReduceTask: attempt_200812010806_0337_r_000000_1 Need another 2 map output(s) where 0 is already in progress
2008-12-01 10:00:35,511 INFO org.apache.hadoop.mapred.ReduceTask: attempt_200812010806_0337_r_000000_1: Got 0 new map-outputs
2008-12-01 10:00:35,511 INFO org.apache.hadoop.mapred.ReduceTask: attempt_200812010806_0337_r_000000_1 Scheduled 1 outputs (0 slow hosts and0 dup hosts)
2008-12-01 10:00:35,516 WARN org.apache.hadoop.mapred.ReduceTask: attempt_200812010806_0337_r_000000_1 copy failed: attempt_200812010806_0337_m_002787_0 from null
2008-12-01 10:00:35,518 WARN org.apache.hadoop.mapred.ReduceTask: java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333)
at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195)
at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182)
at java.net.Socket.connect(Socket.java:519)
at sun.net.NetworkClient.doConnect(NetworkClient.java:152)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:394)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:529)
at sun.net.www.http.HttpClient.<init>(HttpClient.java:233)
at sun.net.www.http.HttpClient.New(HttpClient.java:306)
at sun.net.www.http.HttpClient.New(HttpClient.java:323)
at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:788)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:729)
at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:654)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1312)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1250)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1163)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1100)

Jothi Padmanabhan added a comment - 01/Dec/08 10:46 AM
I observed this behavior too, so this is not a one-off issue.

The code to obtain host from the MapCompletionEvent is

URI u = URI.create(event.getTaskTrackerHttp());
            String host = u.getHost();

For some reason, host is returned as null.


Devaraj Das added a comment - 02/Dec/08 06:06 PM
Marking this issue a blocker for 0.20.

Devaraj Das added a comment - 11/Dec/08 06:00 PM
Observed that this happens whenever the HttpServer.getPort() (that internally calls webServer.getConnectors()[0].getLocalPort()) returns a -1. In the TaskTracker logs for this particular node, it seems jetty could successfully bind to a valid port. The question is why would jetty return a -1 when it is queried. Could it be a race condition in jetty to do with storing the port number in a particular field. Or, does jetty bind to the port asynchronously (putting it in another way, is webServer.start a blocking call) ? Note that we are passing 0 as the port# to bind to, which means it should bind to an OS assigned port...

David Yu added a comment - 12/Dec/08 08:56 AM
It means that the ServerSocketChannel has not yet been opened. (assuming that you're using jetty's nio connector)
The call should be done after Connector.open() or perhaps Connector.start() has been called.

Cheers,
dyu


David Yu added a comment - 12/Dec/08 08:59 AM
It seems you're calling the wrong method if you wanted the port to which the connector listens.
Use connector.getPort() instead of connector.getLocalPort()

Cheers


Greg Wilkins added a comment - 12/Dec/08 09:25 AM
Just to clarify the dfference between getPort() and getLocalPort().
getPort() returns the configured port, which may be 0 if there is no proference.
getLocalPort get's the actual bound port, which may be randomly allocated.

connector.open() should do the binding and allocate the port, which getLocalPort() can
return.

The socket is not put into async mode until after the bind, so there should be no race.


Devaraj Das added a comment - 12/Dec/08 07:38 PM
Just to clarify things. The sequeunce of calls that are made are (details in src/core/org/apache/hadoop/http/HttpServer.java) listed below. Please see if something is missing or is incorrect in the sequence below.
Server webServer = new org.mortbay.jetty.Server();

    Connector listener = new SelectChannelConnector();

    listener.setLowResourceMaxIdleTime(10000);
    listener.setAcceptQueueSize(128);
    listener.setResolveNames(false);
    listener.setUseDirectBuffers(false);
    listener.setHost(bindAddress);
    listener.setPort(port); //THE PORT PASSED IS 0

    webServer.addConnector(listener);
   //other webServer initialization follows...
   //...

   webServer.start();

   port = webServer.getConnectors()[0].getLocalPort(); //CAN THIS RETURN -1

Jothi Padmanabhan added a comment - 02/Jan/09 04:59 AM
I printed the port number returned by getPort (which internally calls getLocalPort) in TaskTracker.java after we do the server.start. It did print -1. A brief perusal of org.mortbay.jetty.Server.java shows that connectors.start is called from inside server.start. David/Greg, could you confirm if this is correct? Is there some timing issue involved somewhere where the port is still not set before the Server.start is returned?

David Yu added a comment - 02/Jan/09 08:35 AM
I've tried the snippet that Devaraj Das posted ..did not ever get -1.
Jothi, right before the println on the port, can you do:
System.err.println("open: " + ((ServerSocketChannel)webServer.getConnectors()[0].getConnection()).isOpen());

More like ... was the socket channel open when it returned -1?


Devaraj Das added a comment - 02/Jan/09 08:45 AM
David, I forgot to mention that this happens on only a few machines on our big clusters. It is not easily reproducible.

Devaraj Das added a comment - 18/Feb/09 04:36 PM
I have a workaround patch that just invokes System.exit at the TT when it discovers that the port returned is -1. Since we see the problem on only a very few trackers on startup (like 1 in 500), I propose that we live with this workaround until jetty provides the actual fix. This problem is tracked in http://jira.codehaus.org/browse/JETTY-748 as well. I will submit the workaround soon.

Devaraj Das added a comment - 18/Feb/09 06:31 PM
Attaching the workaround patch.

Hadoop QA added a comment - 19/Feb/09 12:50 AM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12400434/4744.patch
against trunk revision 745268.

+1 @author. The patch does not contain any @author tags.

-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no tests are needed for this patch.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

-1 findbugs. The patch appears to introduce 1 new Findbugs warnings.

+1 Eclipse classpath. The patch retains Eclipse classpath integrity.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

-1 core tests. The patch failed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3878/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3878/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3878/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3878/console

This message is automatically generated.


Arun C Murthy added a comment - 23/Feb/09 04:29 AM
+1

Vinod K V added a comment - 23/Feb/09 10:35 AM

`ant test` passed locally. `ant test-patch` passed everything but findBugs. Here's the warning:

Dm                       org.apache.hadoop.http.HttpServer.getPort() invokes System.exit(...), which shuts down the entire virtual machine
                                              In class org.apache.hadoop.http.HttpServer
                                              In method org.apache.hadoop.http.HttpServer.getPort()
                                              At HttpServer.java:[line 350]

And the reason stated is :

DM_EXIT: Method invokes System.exit(...)
Invoking System.exit shuts down the entire Java virtual machine. This
   should only been done when it is appropriate. Such calls make it
   hard or impossible for your code to be invoked by other code.
   Consider throwing a RuntimeException instead

Devaraj Das added a comment - 23/Feb/09 11:55 AM
I just committed the workaround patch to the 0.20 and 0.21 branches but I am leaving the issue open for now (with the hope that we will get a fix from the jetty folks). I am ignoring the warning to do with invoking System.exit.

David Yu added a comment - 23/Feb/09 12:07 PM
Its pretty hard to fix a bug (not even 100% sure this is a jetty bug) that cannot be replicated.
If you could provide a small sample webapp that could possibly produce the same error, then than would speed things up.

Cheers


Steve Loughran added a comment - 23/Feb/09 12:33 PM
It would be better to throw an exception than exit. Those of us who trap all System.exit calls from inside Hadoop will only intercept and discard the call from the security manager. The calling code could recognise this event and do the shutdown itself.

Hudson added a comment - 23/Feb/09 03:17 PM
Integrated in Hadoop-trunk #763 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/763/)
. Workaround for jetty6 returning -1 when getLocalPort is invoked on the connector. The workaround patch takes the most conservative approach of killing the server process whenever this is true. Contributed by Devaraj Das.

Devaraj Das added a comment - 23/Feb/09 04:35 PM
Ok, here is a patch that throws an exception when there is a problem in the start of jetty server. The earlier patch can be reverted and this patch can be committed instead.

Devaraj Das added a comment - 23/Feb/09 04:37 PM
David, we see this problem on one or two machines in a 2000 node cluster. So I guess if you run the code snippet i had commented with earlier in a tight loop, you might hit the problem..

Greg Wilkins added a comment - 26/Feb/09 08:31 AM

A couple of comments on the patches I've seen for this.

Firstly, I still don't understand how this is happening, unless there is a race inside the JVM socket layer. Jetty's code here is
synchronous, so it is not a memory caching issue and in your code it is the one thread that does the bind and then the
getLocalPort.

But I'll believe you that it is happening. So a couple of comments about the patch.

It is possible to call open on a connector(listener) before start, so that it does the bind early. So rather than
starting the whole server, detecting failure and then trying again, I would make your init code call open directly
on the listener.

You appear to be trying to search for a free port with:
listener.setPort(listener.getLocalPort() + 1);
but getLocalPort() is returning -1, so will always set 0.

But setting 0 as the port is probably want to want to do in the first place. The semantics of setting 0 is that
the operating system will pick a free port for you, so you probably want to start with that.

If you still get -1 as the return, then I would suspect a race in the JVM library. So rather than stopping and relooping,
it might be worthwhile to sleep for a little bit and try getLocalPort again. Only if it then returns -1 should you
call close on the connector.

So in psuedo code I would write your start method like:

<pre>
start()
{
listener.open();
int port = listener.getLocalPort();
if (port<0)
{
Thread.sleep(100);
while (port<0)

{ if (retries++>limit) throw new NastyException() listener.close(); listener.open(); Thread.sleep(100); port = listener.getLocalPort(); }

}

server.start();
</pre>

Ie, you are just cycling the connector/listener until you get a port, rather than cycling the
whole server.


Devaraj Das added a comment - 27/Feb/09 05:48 AM
I am assigning this to Jothi since I am going to be out for the next couple of days...

Jothi Padmanabhan added a comment - 27/Feb/09 11:05 AM
Attaching a patch with changes suggested

Greg Wilkins added a comment - 27/Feb/09 11:17 AM
Jothi,

latest patch looks about right.

However, it would be really interesting to include some logging in it to gather a bit more information about the -1 return.

Specifically, I'm interested if the pause after an initial -1 return is sufficient to get a non -1 port. If so, then there is
definitely a race or memory barrier problem within the JVM. If you you have to close and reopen to be
successful, then something else is going on.

So I'd be hopeful this code will work around the issue, but as it is a kind of ugly thing to do, it would be good to
find out more.

cheers


Hadoop QA added a comment - 01/Mar/09 02:06 AM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12401100/hadoop-4744-new.patch
against trunk revision 748861.

+1 @author. The patch does not contain any @author tags.

-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no tests are needed for this patch.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs warnings.

+1 Eclipse classpath. The patch retains Eclipse classpath integrity.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

-1 core tests. The patch failed core unit tests.

-1 contrib tests. The patch failed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/22/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/22/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/22/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/22/console

This message is automatically generated.


Jothi Padmanabhan added a comment - 02/Mar/09 05:36 AM
Test failures org.apache.hadoop.http.TestServletFilter.testServletFilter and org.apache.hadoop.chukwa.datacollection.adaptor.filetailer.TestStartAtOffset.testStartAtOffset are unrelated to this patch.

Jothi Padmanabhan added a comment - 02/Mar/09 08:33 AM
Had on an offline discussion with Greg – we can get information on which call returned non -1 port from the number of debug messages already present in the patch.

Hemanth Yamijala added a comment - 02/Mar/09 10:37 AM
I committed the latest patch to trunk and the 0.20 branch.

Hudson added a comment - 13/Mar/09 03:05 PM

Devaraj Das added a comment - 06/May/09 04:21 PM
Looks like the jetty bounce code is not working as expected. Specifically, this is what is observed:
1) In HttpServer.java::start(), after the call to listener.open(), the listener.getLocalPort() returns a seemingly right port number
2) After the above, in TaskTracker.java, the call to HttpServer.getPort() returns a -1. (HttpServer.getPort() invokes listener.getLocalPort() indirectly)

When this happens, we manually checked whether jetty actually bound to the port that it claimed in (1), and we saw that jetty didn't bind to any port at all! So it is a complete mystery how the call getLocalPort could return two different values at two different times.

This patch tries to work around the problem by shutting down the TaskTracker process if and when (2) happens.


Steve Loughran added a comment - 07/May/09 10:20 AM
Could the new patch throw an exception instead? As the hadoop-4744-new.patch did? That means that however the TT is executed and that included under junit the failure is caught at the right layer of code, instead of having methods acrosss the codebase deciding when to exit the JVM.

Devaraj Das added a comment - 07/May/09 12:30 PM
This addresses Steve's concern.

Steve Loughran added a comment - 07/May/09 01:19 PM
looks good to me, +1

Devaraj Das added a comment - 08/May/09 05:32 AM
I just committed this patch (on the 0.20 branch also).

Hudson added a comment - 08/May/09 07:55 PM
Integrated in Hadoop-trunk #830 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/830/)
. Attaching another fix to the jetty port issue. The TaskTracker kills itself if it ever discovers that the port to which jetty is actuallybound is invalid (-1). Contributed by Devaraj Das.