Hadoop Common
  1. Hadoop Common
  2. HADOOP-1111

Job completion notification to a job configured URL

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.12.2
    • Fix Version/s: 0.13.0
    • Component/s: None
    • Labels:
      None
    • Environment:

      all

      Description

      Currently clients have to poll the JobTracker to find if a job has completed or not.

      When invoking Hadoop from other systems is desirable to have a notification mechanism on job completion.

      The notification approach simplifies the client waiting for completion and removes load from the JobTracker as polling can be avoided.

      Proposed solution:

      When the JobTracker processes the completion of a job (success and failure) if the job configuration has a jobEnd.notificationUrl property it will make a HTTP GET request to the specified URL.

      The jobEnd.notificationUrl property may include 2 variables in it '$

      {jobId}

      ' and '$

      {jobStatus}

      '. if they are present, they will be replaced with tehe job ID and status of the job and the URL will be invoked.

      Two additional properties, 'jobEnd.retries' and 'jobEnd.retryInterval', will indicate retry behavior.

      Not to delay the JobTracker processing while doing notifications, a ConsumerProducer Queue will be used to queue up job notification upon completion.

      A daemon thread will consume job notifications from the above Queue and will make the URL invocation.

      On notification failure, the job notification is queue up again on the notification queue.

      The queue will be a java.util.concurrent.DelayQueue. This will make job notifications (on retries) to be avaiable on the consumer side only when the retry time is up.

      The changes will be done in the JobTracker and in the LocalJobRunner.

      1. patch-1111.txt
        23 kB
        Alejandro Abdelnur
      2. patch-1111.txt
        23 kB
        Alejandro Abdelnur
      3. patch-1111.txt
        23 kB
        Alejandro Abdelnur
      4. patch-1111.txt
        21 kB
        Alejandro Abdelnur
      5. patch-1111.txt
        21 kB
        Alejandro Abdelnur
      6. patch-1111.txt
        11 kB
        Ruchir
      7. patch-1111.txt
        11 kB
        Ruchir
      8. patch-1111.txt
        10 kB
        Alejandro Abdelnur
      9. patch-1111.txt
        10 kB
        Alejandro Abdelnur
      10. patch-1111.txt
        10 kB
        Alejandro Abdelnur
      11. patch-1111.txt
        11 kB
        Alejandro Abdelnur

        Activity

        Hide
        Hadoop QA added a comment -
        Show
        Hadoop QA added a comment - Integrated in Hadoop-Nightly #63 (See http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/63/ )
        Hide
        Tom White added a comment -

        I've just committed this. Thanks Alejandro!

        Show
        Tom White added a comment - I've just committed this. Thanks Alejandro!
        Hide
        Tom White added a comment -

        On 18/04/07, Nigel Daley <ndaley@yahoo-inc.com> wrote:
        > All the tests get test.build.data from a system property. It's set
        > by ant when the tests are run.

        Sorry - my mistake.

        Show
        Tom White added a comment - On 18/04/07, Nigel Daley <ndaley@yahoo-inc.com> wrote: > All the tests get test.build.data from a system property. It's set > by ant when the tests are run. Sorry - my mistake.
        Hide
        Tom White added a comment -

        I think this should get test.build.data from the Hadoop Configuration object rather than from system properties.

        Show
        Tom White added a comment - I think this should get test.build.data from the Hadoop Configuration object rather than from system properties.
        Hide
        Alejandro Abdelnur added a comment -

        changes based on Doug feedback

        Show
        Alejandro Abdelnur added a comment - changes based on Doug feedback
        Show
        Hadoop QA added a comment - +1 http://issues.apache.org/jira/secure/attachment/12355663/patch-1111.txt applied and successfully tested against trunk revision r529763. Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/59/testReport/ Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/59/console
        Hide
        Doug Cutting added a comment -

        This looks good to me, except the unit tests write in the connected directory rather than to the value of the test.build.data system property. Also, I'd rather have something like "localhost" in hadoop-default.xml rather than a real hostname like "foo.com".

        Show
        Doug Cutting added a comment - This looks good to me, except the unit tests write in the connected directory rather than to the value of the test.build.data system property. Also, I'd rather have something like "localhost" in hadoop-default.xml rather than a real hostname like "foo.com".
        Hide
        Alejandro Abdelnur added a comment -

        I've run patch on Linux and OS X and it works fine.

        Hudson is failing but the testReport page has not being created it is not possible to see what the problem is.

        Show
        Alejandro Abdelnur added a comment - I've run patch on Linux and OS X and it works fine. Hudson is failing but the testReport page has not being created it is not possible to see what the problem is.
        Hide
        Hadoop QA added a comment -

        -1, build or testing failed

        2 attempts failed to build and test the latest attachment http://issues.apache.org/jira/secure/attachment/12355659/patch-1111.txt against trunk revision r529432.

        Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/56/testReport/
        Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/56/console

        Please note that this message is automatically generated and may represent a problem with the automation system and not the patch.

        Show
        Hadoop QA added a comment - -1, build or testing failed 2 attempts failed to build and test the latest attachment http://issues.apache.org/jira/secure/attachment/12355659/patch-1111.txt against trunk revision r529432. Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/56/testReport/ Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/56/console Please note that this message is automatically generated and may represent a problem with the automation system and not the patch.
        Hide
        Alejandro Abdelnur added a comment -

        Made http server port to receive notification dynamically assigned (setting it zero).

        Show
        Alejandro Abdelnur added a comment - Made http server port to receive notification dynamically assigned (setting it zero).
        Hide
        Alejandro Abdelnur added a comment -

        notification webserver has hardcoded host, to make it dynamically chosen

        Show
        Alejandro Abdelnur added a comment - notification webserver has hardcoded host, to make it dynamically chosen
        Hide
        Alejandro Abdelnur added a comment -

        Testcases now using build/test directory for dfs and testing/ for local. coded to run on DFS.

        Show
        Alejandro Abdelnur added a comment - Testcases now using build/test directory for dfs and testing/ for local. coded to run on DFS.
        Hide
        Alejandro Abdelnur added a comment -

        Patch tested at revision 529472.

        Show
        Alejandro Abdelnur added a comment - Patch tested at revision 529472.
        Hide
        Alejandro Abdelnur added a comment -

        I'm working on the unittests, they were running on my box (Mac OS X) but failing on Linux.

        Found the reason, was writing to '/testing/' and on Linux don't have write permissions to '/'.

        Show
        Alejandro Abdelnur added a comment - I'm working on the unittests, they were running on my box (Mac OS X) but failing on Linux. Found the reason, was writing to '/testing/' and on Linux don't have write permissions to '/'.
        Hide
        Tom White added a comment -

        Alejandro, the new unit tests are failing (on Hudson and when I run them on my machine - could you take a look please?

        Show
        Tom White added a comment - Alejandro, the new unit tests are failing (on Hudson and when I run them on my machine - could you take a look please?
        Hide
        Hadoop QA added a comment -

        -1, build or testing failed

        2 attempts failed to build and test the latest attachment http://issues.apache.org/jira/secure/attachment/12355421/patch-1111.txt against trunk revision r527711.

        Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/38/testReport/
        Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/38/console

        Please note that this message is automatically generated and may represent a problem with the automation system and not the patch.

        Show
        Hadoop QA added a comment - -1, build or testing failed 2 attempts failed to build and test the latest attachment http://issues.apache.org/jira/secure/attachment/12355421/patch-1111.txt against trunk revision r527711. Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/38/testReport/ Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/38/console Please note that this message is automatically generated and may represent a problem with the automation system and not the patch.
        Hide
        Alejandro Abdelnur added a comment -

        patch tested At revision 527913.

        Show
        Alejandro Abdelnur added a comment - patch tested At revision 527913.
        Hide
        Alejandro Abdelnur added a comment -

        patch was not in sync with SVN head

        Show
        Alejandro Abdelnur added a comment - patch was not in sync with SVN head
        Hide
        Alejandro Abdelnur added a comment -

        patch running at revision 527898.

        Show
        Alejandro Abdelnur added a comment - patch running at revision 527898.
        Hide
        Hadoop QA added a comment -

        -1, build or testing failed

        2 attempts failed to build and test the latest attachment http://issues.apache.org/jira/secure/attachment/12355221/patch-1111.txt against trunk revision r527711.

        Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/37/testReport/
        Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/37/console

        Please note that this message is automatically generated and may represent a problem with the automation system and not the patch.

        Show
        Hadoop QA added a comment - -1, build or testing failed 2 attempts failed to build and test the latest attachment http://issues.apache.org/jira/secure/attachment/12355221/patch-1111.txt against trunk revision r527711. Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/37/testReport/ Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/37/console Please note that this message is automatically generated and may represent a problem with the automation system and not the patch.
        Hide
        Alejandro Abdelnur added a comment -

        Latest uploaded patch contains testcases as requested

        Show
        Alejandro Abdelnur added a comment - Latest uploaded patch contains testcases as requested
        Hide
        Alejandro Abdelnur added a comment -

        Added testcases for job notification, both local and cluster modes.

        Introduced a HadoopTestCase class that simplifies running testcases in cluster/local dfs/local-fs.

        Show
        Alejandro Abdelnur added a comment - Added testcases for job notification, both local and cluster modes. Introduced a HadoopTestCase class that simplifies running testcases in cluster/local dfs/local-fs.
        Hide
        Doug Cutting added a comment -

        I'd rather wait to commit this until we have some unit tests for it.

        Show
        Doug Cutting added a comment - I'd rather wait to commit this until we have some unit tests for it.
        Show
        Hadoop QA added a comment - +1, because http://issues.apache.org/jira/secure/attachment/12353859/patch-1111.txt applied and successfully tested against trunk revision http://svn.apache.org/repos/asf/lucene/hadoop/trunk/523072 . Results are at http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch
        Hide
        Alejandro Abdelnur added a comment -

        Fixed 2 nits:

        Thread waiting for local runner case was not being done within a synchronized block, added one. Also changed wait() to sleep() call as the thread is sleeping.

        Retry attempts decrement was happening before comparison for remaining retries.

        Still have to get the testcases to work, having problem getting the testcase up and running (looking at the test cases is is missing a base TestCase class that takes care of running local/cluster local/dfs hadoop by configuration, working on it, will be another patch).

        Show
        Alejandro Abdelnur added a comment - Fixed 2 nits: Thread waiting for local runner case was not being done within a synchronized block, added one. Also changed wait() to sleep() call as the thread is sleeping. Retry attempts decrement was happening before comparison for remaining retries. Still have to get the testcases to work, having problem getting the testcase up and running (looking at the test cases is is missing a base TestCase class that takes care of running local/cluster local/dfs hadoop by configuration, working on it, will be another patch).
        Hide
        Tom White added a comment -

        Great - thanks. There are some notes on writing tests here: http://wiki.apache.org/lucene-hadoop/HowToContribute.

        Show
        Tom White added a comment - Great - thanks. There are some notes on writing tests here: http://wiki.apache.org/lucene-hadoop/HowToContribute .
        Hide
        Alejandro Abdelnur added a comment -

        OK, I will put together an example.

        BTW, any special thing to do to create a testcase?

        How do you do a test for local testing and for cluster testing?

        Just found out a couple of nits in the latest patch, I'll make the necessary corrections and I'll resubmit a fixed patch.

        Show
        Alejandro Abdelnur added a comment - OK, I will put together an example. BTW, any special thing to do to create a testcase? How do you do a test for local testing and for cluster testing? Just found out a couple of nits in the latest patch, I'll make the necessary corrections and I'll resubmit a fixed patch.
        Hide
        Tom White added a comment -

        OK - sounds like the current approach is an acceptable solution. It would be nice if an example could be added so people can see how to use it in practice.

        Show
        Tom White added a comment - OK - sounds like the current approach is an acceptable solution. It would be nice if an example could be added so people can see how to use it in practice.
        Hide
        Owen O'Malley added a comment -

        The architecture of Google's map/reduce is pretty irrelevant. Hadoop's architecture is currently REST (client always initiates connections) both in HDFS and map/reduce. My feeling is that using http from the JobTracker to the client-defined url is ok here, even though it is violating the architecture. I wouldn't like to see a RPC server being launched in the client code or the framework depending on "downward" calls without a very serious discussion.

        Show
        Owen O'Malley added a comment - The architecture of Google's map/reduce is pretty irrelevant. Hadoop's architecture is currently REST (client always initiates connections) both in HDFS and map/reduce. My feeling is that using http from the JobTracker to the client-defined url is ok here, even though it is violating the architecture. I wouldn't like to see a RPC server being launched in the client code or the framework depending on "downward" calls without a very serious discussion.
        Hide
        Alejandro Abdelnur added a comment -

        Tom, my initial idea was to define a public interface as you suggest and it would be up to the job to provide an implementation.

        But I was pointed out that that would mean user code running on the JobTracker. And that is something not desirable as poorly written code could slow down, halt or crash the JobTracker.

        Show
        Alejandro Abdelnur added a comment - Tom, my initial idea was to define a public interface as you suggest and it would be up to the job to provide an implementation. But I was pointed out that that would mean user code running on the JobTracker. And that is something not desirable as poorly written code could slow down, halt or crash the JobTracker.
        Hide
        Tom White added a comment -

        In HDFS, by design NameNodes never initiate connections. However, I'm not sure the same design principle applies to JobTracker. For a start, the MapReduce paper says (section 3.1, point 7)

        "When all map tasks and reduce tasks have been completed, the master wakes up the user program."

        which implies the master initiates a connection. What do others think?

        However, perhaps we should question the use of HTTP - would IPC and a callback interface not be both more consistent with the rest of MapReduce in Hadoop, and a nicer API for the user?

        Show
        Tom White added a comment - In HDFS, by design NameNodes never initiate connections. However, I'm not sure the same design principle applies to JobTracker. For a start, the MapReduce paper says (section 3.1, point 7) "When all map tasks and reduce tasks have been completed, the master wakes up the user program." which implies the master initiates a connection. What do others think? However, perhaps we should question the use of HTTP - would IPC and a callback interface not be both more consistent with the rest of MapReduce in Hadoop, and a nicer API for the user?
        Hide
        Arun C Murthy added a comment -

        At the risk of sounding prudish, I have some reservations about this feature - inasmuch that the architecture as it exists today only has one-way communication i.e. initiated by the tasktrackers/job-clients. I'd be wary of letting the the JobTracker initiate connections...

        Show
        Arun C Murthy added a comment - At the risk of sounding prudish, I have some reservations about this feature - inasmuch that the architecture as it exists today only has one-way communication i.e. initiated by the tasktrackers/job-clients. I'd be wary of letting the the JobTracker initiate connections...
        Hide
        David Bowen added a comment -

        OK, I withdraw my objections. The retry logic makes perfect sense. I looked at the HTTP spec again, and the reason that GET is supposed to be "safe" (as well as "idempotent") is so that user agents can warn the user that they are doing something that might be dangerous. So, for programmatic use, this really isn't an issue.

        +1.

        Show
        David Bowen added a comment - OK, I withdraw my objections. The retry logic makes perfect sense. I looked at the HTTP spec again, and the reason that GET is supposed to be "safe" (as well as "idempotent") is so that user agents can warn the user that they are doing something that might be dangerous. So, for programmatic use, this really isn't an issue. +1.
        Hide
        Alejandro Abdelnur added a comment -

        On the DelayQueue, you are right, there is not delay when the JobEndStatusInfo is scheduled for delivery. But when a retry has to happen the use of the DelayQueue becomes obvious, after a failure, the retry interval is computed on the JobEndStatusInfo and requeued, this time the delay will not be zero.

        On the POST vs GET, I meant heavier on the sense you have to send/receive a payload. IMO sending the jobid & status as a form encoded payload is a little strange. I get the point of GET versus POST and changing state, but according to the HTTP spec GET calls have to be idempotent, that would mean (my interpretation of it) that if state is changed it should be changed only once.

        Show
        Alejandro Abdelnur added a comment - On the DelayQueue, you are right, there is not delay when the JobEndStatusInfo is scheduled for delivery. But when a retry has to happen the use of the DelayQueue becomes obvious, after a failure, the retry interval is computed on the JobEndStatusInfo and requeued, this time the delay will not be zero. On the POST vs GET, I meant heavier on the sense you have to send/receive a payload. IMO sending the jobid & status as a form encoded payload is a little strange. I get the point of GET versus POST and changing state, but according to the HTTP spec GET calls have to be idempotent, that would mean (my interpretation of it) that if state is changed it should be changed only once.
        Hide
        David Bowen added a comment -

        I was questioning the use of DelayQueue because there isn't any delay. The delayTime in a JobEndStatusInfo is always its creation time, so it is ready as soon as it is queued.

        I don't think that POST is necessarily any more heavyweight than GET. It does mean that you'd need to switch from doing text substitution for the parameters to just passing the parameters as HTTP parameters, but that seems simple enough. I don't know if there is a precedent for using GET in Hadoop to change server state, but if not it would seem to be a bad idea to set such a precedent in violation of the convention I cited earlier that is defined in the HTTP spec.

        My mistake on the thread termination - I misread the code. It looks fine.

        Show
        David Bowen added a comment - I was questioning the use of DelayQueue because there isn't any delay. The delayTime in a JobEndStatusInfo is always its creation time, so it is ready as soon as it is queued. I don't think that POST is necessarily any more heavyweight than GET. It does mean that you'd need to switch from doing text substitution for the parameters to just passing the parameters as HTTP parameters, but that seems simple enough. I don't know if there is a precedent for using GET in Hadoop to change server state, but if not it would seem to be a bad idea to set such a precedent in violation of the convention I cited earlier that is defined in the HTTP spec. My mistake on the thread termination - I misread the code. It looks fine.
        Hide
        Alejandro Abdelnur added a comment -

        Answering to David's comments:

        On the rationale for using DelayQueue is that the element becomes available to the consumer no when is added to the queue but when the delay time is over. For the consumer thread this means that it does not have to peek on the head element of the queue to see if it is time to process the element (the notification), if it is an element available it is time to process it.

        On using GET get instead of POST, the idea was to make this notification as lightweight as possible for both the JobTracker and the receiver of the notification. It is just a ping, if the receiver is interested in more data about the end job it can use the JobClient to gather detail info about the run.

        On the notification thread ending, if running is set to false (shutdown condition) then the 'Thread has ended unexpectedly' message is not print. Else it is.

        On making 'running' volatile, yes, I've missed that one, will do. (change made in new patch).

        On using a singleton instead static members, yes I agree with you. I'm just following the style of JobTracker (start, stop methods are static). Plus it was simpler, less changes to the JobTracker (no instance variable, no getter method to expose it, etc). I think we should consider (not now) adding hooks in key sections of the JobTracker (job start, job end, job kill, job task start, job task end, etc) to enable adding this kind of logic without having to modify the JobTracker and other classes.

        Answering to Tom's comments:

        On retries for Job Notification, at first I've thought about making it without retries. When Ruchir came up with the implementation using DelayQueue I've thought it was simple enough and it would provide significant value (notification robustness) if desired (default retries is 0).

        On using the HttpClient retry mechanism, same as the retry policies is meant to be synchronous. In this case, while the notification retries would not tie up the JobTracker, as it is running in a side thread, it would tie up other notifiations, to avoid we've introduced the usage of the DelayQueue.

        On the local runner notification, we could use the same mechanism, but then we would have to put logic in place to wait for the notifications to be delivered before the local runner ends, I've thought this would be simpler. The other possibility is not to have notifications when using the local runner (as is local the executor is in process). IMO, the benefit of having it in the local runner is that enables (simplifies) integration testing.

        On the infinite loop in the local runner notification, yes, thanks for the catch the loop condition should be 'while (notification.configureForRetiry())'. (change made to latest patch).

        Show
        Alejandro Abdelnur added a comment - Answering to David's comments: On the rationale for using DelayQueue is that the element becomes available to the consumer no when is added to the queue but when the delay time is over. For the consumer thread this means that it does not have to peek on the head element of the queue to see if it is time to process the element (the notification), if it is an element available it is time to process it. On using GET get instead of POST, the idea was to make this notification as lightweight as possible for both the JobTracker and the receiver of the notification. It is just a ping, if the receiver is interested in more data about the end job it can use the JobClient to gather detail info about the run. On the notification thread ending, if running is set to false (shutdown condition) then the 'Thread has ended unexpectedly' message is not print. Else it is. On making 'running' volatile, yes, I've missed that one, will do. (change made in new patch). On using a singleton instead static members, yes I agree with you. I'm just following the style of JobTracker (start, stop methods are static). Plus it was simpler, less changes to the JobTracker (no instance variable, no getter method to expose it, etc). I think we should consider (not now) adding hooks in key sections of the JobTracker (job start, job end, job kill, job task start, job task end, etc) to enable adding this kind of logic without having to modify the JobTracker and other classes. Answering to Tom's comments: On retries for Job Notification, at first I've thought about making it without retries. When Ruchir came up with the implementation using DelayQueue I've thought it was simple enough and it would provide significant value (notification robustness) if desired (default retries is 0). On using the HttpClient retry mechanism, same as the retry policies is meant to be synchronous. In this case, while the notification retries would not tie up the JobTracker, as it is running in a side thread, it would tie up other notifiations, to avoid we've introduced the usage of the DelayQueue. On the local runner notification, we could use the same mechanism, but then we would have to put logic in place to wait for the notifications to be delivered before the local runner ends, I've thought this would be simpler. The other possibility is not to have notifications when using the local runner (as is local the executor is in process). IMO, the benefit of having it in the local runner is that enables (simplifies) integration testing. On the infinite loop in the local runner notification, yes, thanks for the catch the loop condition should be 'while (notification.configureForRetiry())'. (change made to latest patch).
        Hide
        Tom White added a comment -

        Alejandro - I don't think you're missing anything. The Hadoop retry framework is currently only appropriate for synchronous retries, whereas this patch uses asynchronous retries to avoid tying up the JobTracker.

        Stepping back a moment, though, I wonder why we need retries for job notification - could we start out simple and add them later? If not, could we look at the retry mechanisms that HttpClient provides (http://jakarta.apache.org/commons/httpclient/exception-handling.html). Or perhaps synchronous retries (using the Hadoop retry framework) from a LinkedBlockingQueue (as David suggests) would do.

        Why is there a special case for local runner notification? Could it not use the same mechanism as the more general case? (Also, I think the code in localRunnerNotification may go into an infinite loop if exceptions keep occurring, since I can't see where the number of retries is decremented.)

        Show
        Tom White added a comment - Alejandro - I don't think you're missing anything. The Hadoop retry framework is currently only appropriate for synchronous retries, whereas this patch uses asynchronous retries to avoid tying up the JobTracker. Stepping back a moment, though, I wonder why we need retries for job notification - could we start out simple and add them later? If not, could we look at the retry mechanisms that HttpClient provides ( http://jakarta.apache.org/commons/httpclient/exception-handling.html ). Or perhaps synchronous retries (using the Hadoop retry framework) from a LinkedBlockingQueue (as David suggests) would do. Why is there a special case for local runner notification? Could it not use the same mechanism as the more general case? (Also, I think the code in localRunnerNotification may go into an infinite loop if exceptions keep occurring, since I can't see where the number of retries is decremented.)
        Hide
        David Bowen added a comment -

        I don't see the rationale for using a DelayQueue here, since no delay is required. A LinkedBlockingQueue would seem more appropriate.

        Is there some reason for using GET rather than POST?

        Looks like the notification thread can only end with the error message that it is ending unexpectedly. Right?

        I think the running flag should be declared volatile.

        Stylistically, I think a singleton class is generally preferable to a class with only static members. But others may disagree.

        Show
        David Bowen added a comment - I don't see the rationale for using a DelayQueue here, since no delay is required. A LinkedBlockingQueue would seem more appropriate. Is there some reason for using GET rather than POST? Looks like the notification thread can only end with the error message that it is ending unexpectedly. Right? I think the running flag should be declared volatile. Stylistically, I think a singleton class is generally preferable to a class with only static members. But others may disagree.
        Hide
        Devaraj Das added a comment -

        Sorry, I didn't look at the package thoroughly enough. I now see that it can be used in the non-RPC cases also.

        Show
        Devaraj Das added a comment - Sorry, I didn't look at the package thoroughly enough. I now see that it can be used in the non-RPC cases also.
        Hide
        Devaraj Das added a comment -

        By the way, isn't the io.retry package meant for the Hadoop's RPC framework? In the patch, a simple http communication is initiated from the JT to the (URL of the) client that submitted the job earlier, and in the implementation of this communication, there is no RPC involved. So I don't see where io.retry can be used easily here??

        Show
        Devaraj Das added a comment - By the way, isn't the io.retry package meant for the Hadoop's RPC framework? In the patch, a simple http communication is initiated from the JT to the (URL of the) client that submitted the job earlier, and in the implementation of this communication, there is no RPC involved. So I don't see where io.retry can be used easily here??
        Hide
        Alejandro Abdelnur added a comment -

        Patch with indentation of 2 and max line len of 80.

        Tom,

        The problem with the hadoop retry framework (if I understand correctly how it works) is that the policy goes to sleep synchronously within from which is invoked.

        It could be used for the LocalJobRunner but for the JobTracker it would block it while both sending the notification and waiting for the retries.

        With the producer-consumer queue being consumed by a side thread no blocking happens in the JobTracker job handling logic.

        What am I missing?

        A

        Show
        Alejandro Abdelnur added a comment - Patch with indentation of 2 and max line len of 80. Tom, The problem with the hadoop retry framework (if I understand correctly how it works) is that the policy goes to sleep synchronously within from which is invoked. It could be used for the LocalJobRunner but for the JobTracker it would block it while both sending the notification and waiting for the retries. With the producer-consumer queue being consumed by a side thread no blocking happens in the JobTracker job handling logic. What am I missing? A
        Hide
        Tom White added a comment -

        I think the retry code could be simplified by using Hadoop's retry mechanism (http://lucene.apache.org/hadoop/api/org/apache/hadoop/io/retry/package-summary.html).

        Also, indentation should be 2 spaces not 4.

        Would you be able to resubmit the patch with these changes? Thanks.

        Show
        Tom White added a comment - I think the retry code could be simplified by using Hadoop's retry mechanism ( http://lucene.apache.org/hadoop/api/org/apache/hadoop/io/retry/package-summary.html ). Also, indentation should be 2 spaces not 4. Would you be able to resubmit the patch with these changes? Thanks.
        Hide
        Alejandro Abdelnur added a comment -

        The patch implements the proposed functionality
        (thxs to Ruchir for implementing it)

        Show
        Alejandro Abdelnur added a comment - The patch implements the proposed functionality (thxs to Ruchir for implementing it)
        Hide
        David Bowen added a comment -

        I don't think that you should use HTTP GET for this, because GETs should not change the state of the server. See http://www.w3.org/Protocols/rfc2616/rfc2616-sec9.html#sec9.1.1. POST or possibly PUT would be appropriate instead.

        Rather than having a syntax for embedding jobid etc in the url, these could just be passed as HTTP parameters.

        Show
        David Bowen added a comment - I don't think that you should use HTTP GET for this, because GETs should not change the state of the server. See http://www.w3.org/Protocols/rfc2616/rfc2616-sec9.html#sec9.1.1 . POST or possibly PUT would be appropriate instead. Rather than having a syntax for embedding jobid etc in the url, these could just be passed as HTTP parameters.

          People

          • Assignee:
            Unassigned
            Reporter:
            Alejandro Abdelnur
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development