Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.11.2
    • Fix Version/s: None
    • Component/s: ipc
    • Labels:
      None

      Description

      We need some mechanism for RPC calls that get exceptions to automatically retry the call under certain circumstances. In particular, we often end up with calls to rpcs being wrapped with retry loops for timeouts. We should be able to make a retrying proxy that will call the rpc and retry in some circumstances.

      1. HADOOP-601-v1.patch
        5 kB
        Johan Oskarsson
      2. HADOOP-601-v1.patch
        5 kB
        Johan Oskarsson

        Activity

        Hide
        Owen O'Malley added a comment -

        Ok, I propose that we add annotations to the rpc protocol classes:

        public interface FooBar extends VersionedProtocol {

        @RpcReliability(@Try) /* single try, no exception for failure */
        void progress();

        @RpcReliability(@TryNotify) /* single try, exception for failure */
        void setStatus(String msg);

        @RpcReliability(@Forever) /* keep trying forever */
        void done();

        @RpcReliability(@Timed(30,5)) /* try for 30 seconds waiting 5 seconds between attempts and then throw */
        void doWork();

        /* Set a default policy with some explicit handlers for specific exceptions.
        Matches with remote exceptions will also be handled. */
        @RpcReliability(value=@Timed(30,5),
        handlers=

        {@ErrorHandler(err=SocketTimeOutException.class, response=@Forever), @ErrorHandler(err=NullPointerException, reponse=@TryNotify)}

        )
        void complexWork();

        }

        Thoughts?

        Show
        Owen O'Malley added a comment - Ok, I propose that we add annotations to the rpc protocol classes: public interface FooBar extends VersionedProtocol { @RpcReliability(@Try) /* single try, no exception for failure */ void progress(); @RpcReliability(@TryNotify) /* single try, exception for failure */ void setStatus(String msg); @RpcReliability(@Forever) /* keep trying forever */ void done(); @RpcReliability(@Timed(30,5)) /* try for 30 seconds waiting 5 seconds between attempts and then throw */ void doWork(); /* Set a default policy with some explicit handlers for specific exceptions. Matches with remote exceptions will also be handled. */ @RpcReliability(value=@Timed(30,5), handlers= {@ErrorHandler(err=SocketTimeOutException.class, response=@Forever), @ErrorHandler(err=NullPointerException, reponse=@TryNotify)} ) void complexWork(); } Thoughts?
        Hide
        Doug Cutting added a comment -

        This looks good to me. +1

        For comparison, the proxy approach might look something like:

        XxxProtocol proxy = new RetryProxy(XxxProtocol.class, Try.FOREVER);

        The weakness of this approach would be (a) the client would need to know how to build an appropriate proxy for the protocol, and (b) specifying different reliability contracts for different methods would be awkward. By contrast, your annotation-based approach keeps the reliability requirements with the protocol and per-method.

        Show
        Doug Cutting added a comment - This looks good to me. +1 For comparison, the proxy approach might look something like: XxxProtocol proxy = new RetryProxy(XxxProtocol.class, Try.FOREVER); The weakness of this approach would be (a) the client would need to know how to build an appropriate proxy for the protocol, and (b) specifying different reliability contracts for different methods would be awkward. By contrast, your annotation-based approach keeps the reliability requirements with the protocol and per-method.
        Hide
        Doug Cutting added a comment -

        Note: VersionedProtocol.getProtocolVersion() should retry at least a few times.

        Show
        Doug Cutting added a comment - Note: VersionedProtocol.getProtocolVersion() should retry at least a few times.
        Hide
        Johan Oskarsson added a comment -

        Any news on this one?

        Whenever the jobtracker is busy and it makes our JobClient.submitJob calls fail.
        This in turn means a whole chain of jobs that depends on that submit fails, it's a big issue when it comes to using hadoop in production.

        Show
        Johan Oskarsson added a comment - Any news on this one? Whenever the jobtracker is busy and it makes our JobClient.submitJob calls fail. This in turn means a whole chain of jobs that depends on that submit fails, it's a big issue when it comes to using hadoop in production.
        Hide
        Johan Oskarsson added a comment -

        Created a draft patch to solve this issue more or less as suggested by Owen.

        I left out the Try option (try once, no exception) since it's not possible to return null if the proxied method has a primitive return value.

        This patch just contains the annotations and retry bits, I'm going to leave it up to someone with more intimate knowledge of hadoop to decide what retry policy goes where.

        Suggestions and corrections welcome.

        Example usage:
        @RpcReliability(RetryPolicy.FOREVER)
        public void doSomething() throws IOException;

        @RpcReliability(RetryPolicy.TRY_NOTIFY)
        public void doSomething() throws IOException;

        @RpcReliability(value=RetryPolicy.TIMED, timedDelaySecs=5, timedTrySecs=30)
        public int doSomething() throws IOException;

        @RpcReliability(value=RetryPolicy.TIMED, timedDelaySecs=5, timedTrySecs=30,
        handlers=

        {@ErrorHandler(err=SocketTimeoutException.class, response=RetryPolicy.FOREVER), @ErrorHandler(err=NullPointerException.class, response=RetryPolicy.TRY_NOTIFY)}

        )
        public int doSomething() throws IOException;

        Show
        Johan Oskarsson added a comment - Created a draft patch to solve this issue more or less as suggested by Owen. I left out the Try option (try once, no exception) since it's not possible to return null if the proxied method has a primitive return value. This patch just contains the annotations and retry bits, I'm going to leave it up to someone with more intimate knowledge of hadoop to decide what retry policy goes where. Suggestions and corrections welcome. Example usage: @RpcReliability(RetryPolicy.FOREVER) public void doSomething() throws IOException; @RpcReliability(RetryPolicy.TRY_NOTIFY) public void doSomething() throws IOException; @RpcReliability(value=RetryPolicy.TIMED, timedDelaySecs=5, timedTrySecs=30) public int doSomething() throws IOException; @RpcReliability(value=RetryPolicy.TIMED, timedDelaySecs=5, timedTrySecs=30, handlers= {@ErrorHandler(err=SocketTimeoutException.class, response=RetryPolicy.FOREVER), @ErrorHandler(err=NullPointerException.class, response=RetryPolicy.TRY_NOTIFY)} ) public int doSomething() throws IOException;
        Hide
        Johan Oskarsson added a comment -

        granted licence

        Show
        Johan Oskarsson added a comment - granted licence
        Hide
        Tom White added a comment -

        I created a generic mechanism for this as a part of HADOOP-997. (It doesn't have the annotations piece though, but it would be easy enough to add it.)

        Show
        Tom White added a comment - I created a generic mechanism for this as a part of HADOOP-997 . (It doesn't have the annotations piece though, but it would be easy enough to add it.)
        Hide
        Johan Oskarsson added a comment -

        Looks much better!
        Hope to see that in a future version.

        Show
        Johan Oskarsson added a comment - Looks much better! Hope to see that in a future version.
        Hide
        Tom White added a comment -

        Thanks Johan. It would be good to add the annotations handling from your patch to the mechanism in HADOOP-997. This is probably best done after HADOOP-997 is committed, as one of a series of follow up Jiras to introduce retries in various parts of Hadoop. Note that if you use annotations you can't get parameters from the configuration file (unless anyone knows of a way to do this?)

        Also, I've put the retry classes in a new package org.apache.hadoop.io.retry. Do people feel this is OK? It doesn't feel quite right as it's not exclusively for IO retries.

        Show
        Tom White added a comment - Thanks Johan. It would be good to add the annotations handling from your patch to the mechanism in HADOOP-997 . This is probably best done after HADOOP-997 is committed, as one of a series of follow up Jiras to introduce retries in various parts of Hadoop. Note that if you use annotations you can't get parameters from the configuration file (unless anyone knows of a way to do this?) Also, I've put the retry classes in a new package org.apache.hadoop.io.retry. Do people feel this is OK? It doesn't feel quite right as it's not exclusively for IO retries.
        Hide
        Doug Cutting added a comment -

        This is mostly included in HADOOP-997.

        Show
        Doug Cutting added a comment - This is mostly included in HADOOP-997 .

          People

          • Assignee:
            Owen O'Malley
            Reporter:
            Owen O'Malley
          • Votes:
            1 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development