Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-11574

Uber-JIRA: improve Hadoop network resilience & diagnostics

Details

    • Task
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 2.6.0
    • None
    • net

    Description

      Improve Hadoop's resilience to bad network conditions/problems, including

      • improving recognition of problem states
      • improving diagnostics
      • better handling of IPv6 addresses, even if the protocol is unsupported.
      • better behaviour client-side when there are connectivity problems. (i.e while some errors you can spin on, DNS failures are not on the list)

      Attachments

        Issue Links

          Activity

            stevel@apache.org Steve Loughran added a comment -

            Link to HADOOP-11582; TestDNS failing with some IPv6 problems

            stevel@apache.org Steve Loughran added a comment - Link to HADOOP-11582 ; TestDNS failing with some IPv6 problems
            stevel@apache.org Steve Loughran added a comment -

            Stack

            org.apache.hadoop.ipc.RemoteException: stevel.local/192.168.1.12:1026: java.lang.IllegalArgumentException: getLength on uninitialized RpcWrapper
            	at org.apache.hadoop.ipc.ProtobufRpcEngine$RpcResponseWrapper.getLength(ProtobufRpcEngine.java:488)
            	at org.apache.hadoop.ipc.Server.setupResponse(Server.java:2312)
            	at org.apache.hadoop.ipc.Server.access$2500(Server.java:135)
            	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2084)
            
            stevel@apache.org Steve Loughran added a comment - Stack org.apache.hadoop.ipc.RemoteException: stevel.local/192.168.1.12:1026: java.lang.IllegalArgumentException: getLength on uninitialized RpcWrapper at org.apache.hadoop.ipc.ProtobufRpcEngine$RpcResponseWrapper.getLength(ProtobufRpcEngine.java:488) at org.apache.hadoop.ipc.Server.setupResponse(Server.java:2312) at org.apache.hadoop.ipc.Server.access$2500(Server.java:135) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2084)
            cmccabe Colin McCabe added a comment -

            Thanks for filing this, Steve. What's the proposal? Perhaps we could have a central function in Hadoop for DNS resolution and have it check for ipv6 (and give a better error message if so)?

            cmccabe Colin McCabe added a comment - Thanks for filing this, Steve. What's the proposal? Perhaps we could have a central function in Hadoop for DNS resolution and have it check for ipv6 (and give a better error message if so)?

            Would multiple NICs be in scope?

            cdouglas Christopher Douglas added a comment - Would multiple NICs be in scope?
            stevel@apache.org Steve Loughran added a comment -

            If you look at the open JIRA list, especially related to net and ipc, there's a vast collection of minor JIRAs, related to

            1. diagnostics and failure handling (exception swallowing, error text)
            2. reacting to IPv6 addresses. Even without IPv6 support, the code should not be surprised to see them, and fail meaningfully.
            3. messy teardown/cleanup, including some potential blocks & deadlocks.

            None of these is significant (e.g multi-NIC support), but together they'd make for a network client more resilient to config problems & slightly easier to debug when things are playing up. I think its the fact that they are so minor that nobody ever sits down to fix them. Together they'd be good.

            This JIRA can simply act as a place to aggregate/link those outstanding issues under the common theme of resilience and diagnostics. things like performance & checksums would be separate bits of work altogether

            stevel@apache.org Steve Loughran added a comment - If you look at the open JIRA list, especially related to net and ipc, there's a vast collection of minor JIRAs, related to diagnostics and failure handling (exception swallowing, error text) reacting to IPv6 addresses. Even without IPv6 support, the code should not be surprised to see them, and fail meaningfully. messy teardown/cleanup, including some potential blocks & deadlocks. None of these is significant (e.g multi-NIC support), but together they'd make for a network client more resilient to config problems & slightly easier to debug when things are playing up. I think its the fact that they are so minor that nobody ever sits down to fix them. Together they'd be good. This JIRA can simply act as a place to aggregate/link those outstanding issues under the common theme of resilience and diagnostics. things like performance & checksums would be separate bits of work altogether
            stevel@apache.org Steve Loughran added a comment -

            +Chris, HADOOP-8198 covers NICs

            stevel@apache.org Steve Loughran added a comment - +Chris, HADOOP-8198 covers NICs
            cmccabe Colin McCabe added a comment -

            Thanks again, Steve, great to see some activity on this.

            I think from a pragmatic point of view, we might want to limit this JIRA to just better network error diagnostics. Expanding the scope to cover HADOOP-8198 might make it harder to complete. (Of course, if you've got the bandwidth to implement a full multi-NIC solution for Hadoop, that would be great.) But it really seems like HADOOP-8198 is a big enough JIRA that it should have its own set of subtasks, rather than being a subtask of this JIRA.

            Another important thing to point out here is that we have a lot of people using multi-NIC in Hadoop via interface bonding. Basically you can make two hardware ethernet cards (or onboard ports, etc) look like one by loading the Linux ethernet bonding driver. And then no Java code changes are needed. Of course this doesn't cover all the multi-NIC cases, but it does help explain why multi-NIC hasn't been much of a pain point for us (and hasn't been completed).

            cmccabe Colin McCabe added a comment - Thanks again, Steve, great to see some activity on this. I think from a pragmatic point of view, we might want to limit this JIRA to just better network error diagnostics. Expanding the scope to cover HADOOP-8198 might make it harder to complete. (Of course, if you've got the bandwidth to implement a full multi-NIC solution for Hadoop, that would be great.) But it really seems like HADOOP-8198 is a big enough JIRA that it should have its own set of subtasks, rather than being a subtask of this JIRA. Another important thing to point out here is that we have a lot of people using multi-NIC in Hadoop via interface bonding. Basically you can make two hardware ethernet cards (or onboard ports, etc) look like one by loading the Linux ethernet bonding driver. And then no Java code changes are needed. Of course this doesn't cover all the multi-NIC cases, but it does help explain why multi-NIC hasn't been much of a pain point for us (and hasn't been completed).
            stevel@apache.org Steve Loughran added a comment -

            Colin, I'd like to do this simple and primarily run through what we have in outstanding JIRAs for short fixes in diagnostics. Example, HADOOP-9844, which improves transportation of exception string values over IPC. Two lines -high value.

            There's enough out there (suspiciously a lot my be), that they could be reviewed and committed fast without worrying about major bits of work. A lot of them aren't easy to test, though that's not an excuse to avoid tests out of laziness.

            stevel@apache.org Steve Loughran added a comment - Colin, I'd like to do this simple and primarily run through what we have in outstanding JIRAs for short fixes in diagnostics. Example, HADOOP-9844 , which improves transportation of exception string values over IPC. Two lines -high value. There's enough out there (suspiciously a lot my be), that they could be reviewed and committed fast without worrying about major bits of work. A lot of them aren't easy to test, though that's not an excuse to avoid tests out of laziness.
            stevel@apache.org Steve Loughran added a comment -

            Thinking some more, how about some simple sentences to scope this

            Ops:

            "I can diagnose problems in my cluster without needing to know java stack traces & with an IDE to hand".

            Dev:

            If my application fails due to some problem on a remote machine, enough information is returned to me that I know where to begin fixing it"
            stevel@apache.org Steve Loughran added a comment - Thinking some more, how about some simple sentences to scope this Ops: "I can diagnose problems in my cluster without needing to know java stack traces & with an IDE to hand". Dev: If my application fails due to some problem on a remote machine, enough information is returned to me that I know where to begin fixing it"
            rchiang Ray Chiang added a comment -

            I like the user-centric definitions above. Then for each type of error, such as:

            • DNS/UnknownHostException
            • RPC/RemoteException
            • SecurityException

            We can see where it's deficient in the context of each user.

            As with most of our log messages, I might worry a bit about finding the right balance of giving notification and filling the logs too much.

            rchiang Ray Chiang added a comment - I like the user-centric definitions above. Then for each type of error, such as: DNS/UnknownHostException RPC/RemoteException SecurityException We can see where it's deficient in the context of each user. As with most of our log messages, I might worry a bit about finding the right balance of giving notification and filling the logs too much.

            People

              Unassigned Unassigned
              stevel@apache.org Steve Loughran
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:

                Time Tracking

                  Estimated:
                  Original Estimate - 0.5h
                  0.5h
                  Remaining:
                  Remaining Estimate - 0.5h
                  0.5h
                  Logged:
                  Time Spent - Not Specified
                  Not Specified