Details

    • Type: Sub-task Sub-task
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.4.0
    • Fix Version/s: 0.5.0
    • Component/s: bsp core
    • Labels:

      Description

      As described in http://wiki.apache.org/hama/GroomServerFaultTolerance
      BSPTask should periodically ping its parent 'GroomServer' for their health status.

      1. If Tasks are unable to ping their parent 'GroomServer', it should be killed themselves.
      2. And, if GroomServer does not receive ping from the childs, GroomServer should check whether that child is running.

      You don't need to implement recovery logic in this issue.

      1. HAMA-498.patch
        42 kB
        Suraj Menon
      2. HAMA-498-v2.patch
        23 kB
        Suraj Menon
      3. netstatcheck.txt
        128 kB
        Suraj Menon
      4. HAMA-498-v3.patch
        42 kB
        Suraj Menon
      5. HAMA-498-445.patch
        53 kB
        Suraj Menon
      6. HAMA-498-testfix.patch
        7 kB
        Suraj Menon

        Activity

        Hide
        Suraj Menon added a comment -

        Have 2 questions -
        1. What reasonable value for period are we looking at for ping here? I am currently setting it at 1 sec. Is it too high or low?
        2. BSPPeerChild waits for the completion of the task. Would we be getting rid of this once we have this feature? If not, how is pinging helping the cause? Say the main logic of BSPTask(or BSPTaskRunner) hangs, but the pinging thread in BSPTask thread is active. The current code excerpt looks like this -

        private static class PingGroomServer implements Runnable{

        private BSPPeerProtocol pingRPC;
        private TaskAttemptID taskId;

        public PingGroomServer(BSPPeerProtocol umbilical, TaskAttemptID id)

        { pingRPC = umbilical; taskId = id; }

        @Override
        public void run() {

        try

        { LOG.debug("Pinging at time " + Calendar.getInstance().toString()); pingRPC.ping(taskId); }

        catch (IOException e)

        { LOG.error( new StringBuilder("IOException pinging GroomServer from task - ") .append(taskId), e); //System.exit(1); }

        catch (Exception e)

        { LOG.error( new StringBuilder("Exception pinging GroomServer from task - ") .append(taskId), e); //System.exit(1); }

        }
        }
        ...// body of BSPTask ..

        this.pingService = Executors.newScheduledThreadPool(1);

        private void startPingingGroom(BSPJob job, BSPPeerProtocol umbilical){
        LOG.debug("Scheduling ping service");
        long pingPeriod = job.getConf().getLong(Constants.GROOM_PING_PERIOD,
        Constants.DEFAULT_GROOM_PING_PERIOD)/2;
        LOG.debug("Scheduling with fixed delay for bsp task" + taskId);
        try{
        if(pingPeriod > 0)

        { pingService.scheduleWithFixedDelay( new PingGroomServer(umbilical, taskId), 0, pingPeriod,TimeUnit.MILLISECONDS); }

        }
        catch(Exception e)

        { LOG.error("Error scheduling ping service", e); }

        LOG.debug("Scheduled ping service");
        }

        private void stopPingingGroom(){
        if(pingService != null)

        { pingService.shutdownNow(); }

        }

        Show
        Suraj Menon added a comment - Have 2 questions - 1. What reasonable value for period are we looking at for ping here? I am currently setting it at 1 sec. Is it too high or low? 2. BSPPeerChild waits for the completion of the task. Would we be getting rid of this once we have this feature? If not, how is pinging helping the cause? Say the main logic of BSPTask(or BSPTaskRunner) hangs, but the pinging thread in BSPTask thread is active. The current code excerpt looks like this - private static class PingGroomServer implements Runnable{ private BSPPeerProtocol pingRPC; private TaskAttemptID taskId; public PingGroomServer(BSPPeerProtocol umbilical, TaskAttemptID id) { pingRPC = umbilical; taskId = id; } @Override public void run() { try { LOG.debug("Pinging at time " + Calendar.getInstance().toString()); pingRPC.ping(taskId); } catch (IOException e) { LOG.error( new StringBuilder("IOException pinging GroomServer from task - ") .append(taskId), e); //System.exit(1); } catch (Exception e) { LOG.error( new StringBuilder("Exception pinging GroomServer from task - ") .append(taskId), e); //System.exit(1); } } } ...// body of BSPTask .. this.pingService = Executors.newScheduledThreadPool(1); private void startPingingGroom(BSPJob job, BSPPeerProtocol umbilical){ LOG.debug("Scheduling ping service"); long pingPeriod = job.getConf().getLong(Constants.GROOM_PING_PERIOD, Constants.DEFAULT_GROOM_PING_PERIOD)/2; LOG.debug("Scheduling with fixed delay for bsp task" + taskId); try{ if(pingPeriod > 0) { pingService.scheduleWithFixedDelay( new PingGroomServer(umbilical, taskId), 0, pingPeriod,TimeUnit.MILLISECONDS); } } catch(Exception e) { LOG.error("Error scheduling ping service", e); } LOG.debug("Scheduled ping service"); } private void stopPingingGroom(){ if(pingService != null) { pingService.shutdownNow(); } }
        Hide
        Suraj Menon added a comment -

        That does not look neat . The code excerpt could be found here - http://pastebin.com/TxGtqJqw

        Show
        Suraj Menon added a comment - That does not look neat . The code excerpt could be found here - http://pastebin.com/TxGtqJqw
        Hide
        Thomas Jungblut added a comment -

        You can use the

        { noformat }

        tags

         
        Hello World!
        
        Show
        Thomas Jungblut added a comment - You can use the { noformat } tags Hello World!
        Hide
        Suraj Menon added a comment -

        Hi, as I am testing my changes, I found that sometimes on failure of BSPTask, the BSPMaster steps in and kills the failed task.
        I see that when doing so, it just sets the state of the task in variable "runningTasks" (that contains a list of TaskInProgress objects) as KILLED. It does not purge the task from the list. Isn't it supposed to purge the task from the runningTasks? Currently, I am purging tasks when I find out of contact BPSTasks.
        Please Refer: Added to GroomServer.java

          private class BSPTasksMonitor extends Thread{
        
            private List<TaskInProgress> outOfContactTasks =
                new ArrayList<GroomServer.TaskInProgress>(
                    conf.getInt(Constants.MAX_TASKS_PER_GROOM, 3));
        
            private BSPTasksMonitor(){
        
              outOfContactTasks =
                  new ArrayList<GroomServer.TaskInProgress>(
                      conf.getInt(Constants.MAX_TASKS_PER_GROOM, 3));
            }
        
        
            public void run(){
        
              getObliviousTasks(outOfContactTasks);
        
              if(outOfContactTasks.size() > 0){
                LOG.debug("Got " + outOfContactTasks.size() + " oblivious tasks");
              }
        
              Iterator<TaskInProgress> taskIter = outOfContactTasks.iterator();
        
              while(taskIter.hasNext()){
                TaskInProgress tip = taskIter.next();
                try{
                  LOG.debug("Purging task " + tip);
                  purgeTask(tip, true);
                }
                catch(Exception e){
                  LOG.error(new StringBuilder(
                      "Error while removing a timed-out task - ")
                      .append(tip.toString()) , e);
        
                }
              }
              outOfContactTasks.clear();
        
            }
          }
        
        
          private synchronized void getObliviousTasks(
              List<TaskInProgress> outOfContactTasks){
        
            if(runningTasks == null){
              LOG.debug("returning null");
              return;
            }
        
            long currentTime = Calendar.getInstance().getTimeInMillis();
            long monitorPeriod = conf.getLong(Constants.GROOM_PING_PERIOD,
                Constants.DEFAULT_GROOM_PING_PERIOD);
        
            for (Map.Entry<TaskAttemptID, TaskInProgress> entry : runningTasks
                .entrySet()) {
              TaskInProgress tip = entry.getValue();
              
              // Task is out of contact if it has not pinged since more than 
              // monitorPeriod. A task is given a leeway of 10 times monitorPeriod
              // to get started.
              if ( tip.taskStatus.getRunState().equals(TaskStatus.State.RUNNING) &&
                  ( ((tip.lastPingedTimestamp == 0 &&
                    ((currentTime - tip.startTime) > 10*monitorPeriod)) ||
                    ((tip.lastPingedTimestamp > 0) &&
                      (currentTime - tip.lastPingedTimestamp) > monitorPeriod)))){
        
                LOG.info("adding purge task: " + tip.getTask().getTaskID());
        
                outOfContactTasks.add(tip);
                
              }
        
            }
        
          }
        
              
        
        Show
        Suraj Menon added a comment - Hi, as I am testing my changes, I found that sometimes on failure of BSPTask, the BSPMaster steps in and kills the failed task. I see that when doing so, it just sets the state of the task in variable "runningTasks" (that contains a list of TaskInProgress objects) as KILLED. It does not purge the task from the list. Isn't it supposed to purge the task from the runningTasks? Currently, I am purging tasks when I find out of contact BPSTasks. Please Refer: Added to GroomServer.java private class BSPTasksMonitor extends Thread{ private List<TaskInProgress> outOfContactTasks = new ArrayList<GroomServer.TaskInProgress>( conf.getInt(Constants.MAX_TASKS_PER_GROOM, 3)); private BSPTasksMonitor(){ outOfContactTasks = new ArrayList<GroomServer.TaskInProgress>( conf.getInt(Constants.MAX_TASKS_PER_GROOM, 3)); } public void run(){ getObliviousTasks(outOfContactTasks); if(outOfContactTasks.size() > 0){ LOG.debug("Got " + outOfContactTasks.size() + " oblivious tasks"); } Iterator<TaskInProgress> taskIter = outOfContactTasks.iterator(); while(taskIter.hasNext()){ TaskInProgress tip = taskIter.next(); try{ LOG.debug("Purging task " + tip); purgeTask(tip, true); } catch(Exception e){ LOG.error(new StringBuilder( "Error while removing a timed-out task - ") .append(tip.toString()) , e); } } outOfContactTasks.clear(); } } private synchronized void getObliviousTasks( List<TaskInProgress> outOfContactTasks){ if(runningTasks == null){ LOG.debug("returning null"); return; } long currentTime = Calendar.getInstance().getTimeInMillis(); long monitorPeriod = conf.getLong(Constants.GROOM_PING_PERIOD, Constants.DEFAULT_GROOM_PING_PERIOD); for (Map.Entry<TaskAttemptID, TaskInProgress> entry : runningTasks .entrySet()) { TaskInProgress tip = entry.getValue(); // Task is out of contact if it has not pinged since more than // monitorPeriod. A task is given a leeway of 10 times monitorPeriod // to get started. if ( tip.taskStatus.getRunState().equals(TaskStatus.State.RUNNING) && ( ((tip.lastPingedTimestamp == 0 && ((currentTime - tip.startTime) > 10*monitorPeriod)) || ((tip.lastPingedTimestamp > 0) && (currentTime - tip.lastPingedTimestamp) > monitorPeriod)))){ LOG.info("adding purge task: " + tip.getTask().getTaskID()); outOfContactTasks.add(tip); } } }
        Hide
        Edward J. Yoon added a comment -

        +1

        Show
        Edward J. Yoon added a comment - +1
        Hide
        Suraj Menon added a comment - - edited

        While testing with fault injection in different points I found few issues. I am changing the implementation of BSPTask to conform to the documentation of BSP. Failure in bsp function skips cleanup function call.
        Today's code

        private final <KEYIN, VALUEIN, KEYOUT, VALUEOUT, M extends Writable> void runBSP(
              final BSPJob job, BSPPeerImpl<KEYIN, VALUEIN, KEYOUT, VALUEOUT, M> bspPeer,
              final BytesWritable rawSplit, final BSPPeerProtocol umbilical)
                  throws IOException, SyncException, ClassNotFoundException,
                  InterruptedException {
        
            BSP<KEYIN, VALUEIN, KEYOUT, VALUEOUT, M> bsp = (BSP<KEYIN, VALUEIN, KEYOUT, VALUEOUT, M>) ReflectionUtils
                .newInstance(job.getConf().getClass("bsp.work.class", BSP.class),
                    job.getConf());
            bsp.setup(bspPeer);
            bsp.bsp(bspPeer);
            bsp.cleanup(bspPeer);
            bspPeer.close();
        }
            
        

        Changed.

        private final <KEYIN, VALUEIN, KEYOUT, VALUEOUT, M extends Writable> void runBSP(
              final BSPJob job, BSPPeerImpl<KEYIN, VALUEIN, KEYOUT, VALUEOUT, M> bspPeer,
              final BytesWritable rawSplit, final BSPPeerProtocol umbilical)
                  throws IOException, SyncException, ClassNotFoundException,
                  InterruptedException {
        
            BSP<KEYIN, VALUEIN, KEYOUT, VALUEOUT, M> bsp = (BSP<KEYIN, VALUEIN, KEYOUT, VALUEOUT, M>) ReflectionUtils
                .newInstance(job.getConf().getClass("bsp.work.class", BSP.class),
                    job.getConf());
            bsp.setup(bspPeer);
            try{
              bsp.bsp(bspPeer);
            }
            finally{
              try{
                bsp.cleanup(bspPeer);
              finally{
                // Trusting close to not throw exception should we?
                // Will need to check for exception and rethrow it masking 
                // exception from bspPeer.close.
                bspPeer.close();
              }
            }
        }
        

        We would have to ensure what exception(s) we expose.
        Let me know if you have any comments on it. I shall make necessary changes before I upload the patch.

        Show
        Suraj Menon added a comment - - edited While testing with fault injection in different points I found few issues. I am changing the implementation of BSPTask to conform to the documentation of BSP. Failure in bsp function skips cleanup function call. Today's code private final <KEYIN, VALUEIN, KEYOUT, VALUEOUT, M extends Writable> void runBSP( final BSPJob job, BSPPeerImpl<KEYIN, VALUEIN, KEYOUT, VALUEOUT, M> bspPeer, final BytesWritable rawSplit, final BSPPeerProtocol umbilical) throws IOException, SyncException, ClassNotFoundException, InterruptedException { BSP<KEYIN, VALUEIN, KEYOUT, VALUEOUT, M> bsp = (BSP<KEYIN, VALUEIN, KEYOUT, VALUEOUT, M>) ReflectionUtils .newInstance(job.getConf().getClass("bsp.work.class", BSP.class), job.getConf()); bsp.setup(bspPeer); bsp.bsp(bspPeer); bsp.cleanup(bspPeer); bspPeer.close(); } Changed. private final <KEYIN, VALUEIN, KEYOUT, VALUEOUT, M extends Writable> void runBSP( final BSPJob job, BSPPeerImpl<KEYIN, VALUEIN, KEYOUT, VALUEOUT, M> bspPeer, final BytesWritable rawSplit, final BSPPeerProtocol umbilical) throws IOException, SyncException, ClassNotFoundException, InterruptedException { BSP<KEYIN, VALUEIN, KEYOUT, VALUEOUT, M> bsp = (BSP<KEYIN, VALUEIN, KEYOUT, VALUEOUT, M>) ReflectionUtils .newInstance(job.getConf().getClass("bsp.work.class", BSP.class), job.getConf()); bsp.setup(bspPeer); try{ bsp.bsp(bspPeer); } finally{ try{ bsp.cleanup(bspPeer); finally{ // Trusting close to not throw exception should we? // Will need to check for exception and rethrow it masking // exception from bspPeer.close. bspPeer.close(); } } } We would have to ensure what exception(s) we expose. Let me know if you have any comments on it. I shall make necessary changes before I upload the patch.
        Hide
        Thomas Jungblut added a comment -

        Oh crap, I believe it was like this times ago... Thanks for making it right.
        Localrunner does this wrong as well.

        But the YARN runner makes it correctly!

          public void startComputation() throws Exception {
            BSP bspInstance = ReflectionUtils.newInstance(bspClass, conf);
            try {
              bspInstance.setup(peer);
              bspInstance.bsp(peer);
            } catch (Exception e) {
              throw e;
            } finally {
              bspInstance.cleanup(peer);
              peer.close();
            }
          }
        
        

        And that is the way it actually should be (in my opinion).

        Show
        Thomas Jungblut added a comment - Oh crap, I believe it was like this times ago... Thanks for making it right. Localrunner does this wrong as well. But the YARN runner makes it correctly! public void startComputation() throws Exception { BSP bspInstance = ReflectionUtils.newInstance(bspClass, conf); try { bspInstance.setup(peer); bspInstance.bsp(peer); } catch (Exception e) { throw e; } finally { bspInstance.cleanup(peer); peer.close(); } } And that is the way it actually should be (in my opinion).
        Hide
        Suraj Menon added a comment -

        I see that cleanup should be called even if setup fails. In the above code, we have a problem if cleanup throws an exception, we might loose the first exception raised.

        Show
        Suraj Menon added a comment - I see that cleanup should be called even if setup fails. In the above code, we have a problem if cleanup throws an exception, we might loose the first exception raised.
        Hide
        Thomas Jungblut added a comment -

        Yes. Back a few months ago the cleanup hasn't thrown any exception.

        So your solution is much better. Do you mind patching the local runner and the yarn task runner? Would be great if they are consistent.

        Show
        Thomas Jungblut added a comment - Yes. Back a few months ago the cleanup hasn't thrown any exception. So your solution is much better. Do you mind patching the local runner and the yarn task runner? Would be great if they are consistent.
        Hide
        Suraj Menon added a comment -

        Patch for the issue.

        Show
        Suraj Menon added a comment - Patch for the issue.
        Hide
        Suraj Menon added a comment -

        The patch uploaded has few things to be considered:

        • The test cases are passing most of the times. It still depends on scheduling of certains threads.
        • I still see killed tasks is runningTasks at Groomserver.
        • LocalBSPRunner and the YARNRunner has been also changed with the new way of handling errors in BSP jobs.
        Show
        Suraj Menon added a comment - The patch uploaded has few things to be considered: The test cases are passing most of the times. It still depends on scheduling of certains threads. I still see killed tasks is runningTasks at Groomserver. LocalBSPRunner and the YARNRunner has been also changed with the new way of handling errors in BSP jobs.
        Hide
        Edward J. Yoon added a comment -

        By the way, rat-check fails.

        You should add license headers to UtilsForTesting and TestBSPTask classes.

        [INFO] ------------------------------------------------------------------------
        [INFO] Reactor Summary:
        [INFO] 
        [INFO] Apache Hama parent POM ............................ SUCCESS [2.585s]
        [INFO] core .............................................. FAILURE [18.325s]
        [INFO] graph ............................................. SKIPPED
        [INFO] examples .......................................... SKIPPED
        [INFO] yarn .............................................. SKIPPED
        [INFO] dist .............................................. SKIPPED
        [INFO] ------------------------------------------------------------------------
        [INFO] BUILD FAILURE
        [INFO] ------------------------------------------------------------------------
        [INFO] Total time: 21.379s
        [INFO] Finished at: Wed Feb 29 16:30:51 KST 2012
        
        [INFO] Final Memory: 16M/47M
        [INFO] ------------------------------------------------------------------------
        [ERROR] Failed to execute goal org.apache.rat:apache-rat-plugin:0.8:check (default) on project hama-core: Too many unapproved licenses: 2 -> [Help 1]
        
        Show
        Edward J. Yoon added a comment - By the way, rat-check fails. You should add license headers to UtilsForTesting and TestBSPTask classes. [INFO] ------------------------------------------------------------------------ [INFO] Reactor Summary: [INFO] [INFO] Apache Hama parent POM ............................ SUCCESS [2.585s] [INFO] core .............................................. FAILURE [18.325s] [INFO] graph ............................................. SKIPPED [INFO] examples .......................................... SKIPPED [INFO] yarn .............................................. SKIPPED [INFO] dist .............................................. SKIPPED [INFO] ------------------------------------------------------------------------ [INFO] BUILD FAILURE [INFO] ------------------------------------------------------------------------ [INFO] Total time: 21.379s [INFO] Finished at: Wed Feb 29 16:30:51 KST 2012 [INFO] Final Memory: 16M/47M [INFO] ------------------------------------------------------------------------ [ERROR] Failed to execute goal org.apache.rat:apache-rat-plugin:0.8:check ( default ) on project hama-core: Too many unapproved licenses: 2 -> [Help 1]
        Hide
        Thomas Jungblut added a comment -

        Can you please tell us which testcases? I think we can solve this with some intelligent locking.

        Show
        Thomas Jungblut added a comment - Can you please tell us which testcases? I think we can solve this with some intelligent locking.
        Hide
        Thomas Jungblut added a comment -

        Oh I see: TestBSPTask

        Ok I have a few comments on it:

        • we can move this testcase to the bsp package, we don't need the ft subpackage
        • "UtilsForTesting.raiseUncaughtException()" is a good idea, but I guess "throw new RuntimeException()" would be enough
        • Instead of using a dummy LocalTestSyncClient you could use LocalBSPRunner.LocalSyncClient.class. (even if you don't need locking)
        • It would be great if you can organize the imports, I see a lot of warnings in Eclipse (use CTRL-SHIFT-o in eclipse)
        • Formatting would be nice as well (CTRL-SHIFT-f)
        • Like Edward told there are some missing license headers

        Otherwise I think this is a good patch Nice work!

        For the scheduling stuff:

        try{
              Thread.sleep(15000);
            }
            catch(Exception e){
              LOG.error("Interrupted the timer for 10 sec.", e);
            }
        

        This causes race conditions, you should better use a completion service that waits until the thread really finishes. For a single thread you can get a Future from the threadpool.

        You can express this with

            ScheduledFuture<?> schedule = this.testBSPTaskService.schedule(
                new TestBSPTaskThreadRunner(job), 0, TimeUnit.MILLISECONDS);
            try {
              schedule.get();
            } catch (InterruptedException e1) {
              e1.printStackTrace();
            } catch (ExecutionException e1) {
              e1.printStackTrace();
            }
        

        You actually used it on the third testcase, so why do you rely on sleep in the first one?

        And we have some TCP cleanup issues, once the testcase failed it causes TIME_WAIT problems

          thomasjungblut@ubuntu:~/workspace/hama-trunk$ sudo netstat -p | grep 40000
        tcp        0      0 localhost.localdo:40000 localhost.localdo:36948 TIME_WAIT   -               
        tcp        0      0 localhost.localdo:40000 localhost.localdo:36947 TIME_WAIT   - 
        

        Seems to me like a cleanup problem..

        Show
        Thomas Jungblut added a comment - Oh I see: TestBSPTask Ok I have a few comments on it: we can move this testcase to the bsp package, we don't need the ft subpackage "UtilsForTesting.raiseUncaughtException()" is a good idea, but I guess "throw new RuntimeException()" would be enough Instead of using a dummy LocalTestSyncClient you could use LocalBSPRunner.LocalSyncClient.class. (even if you don't need locking) It would be great if you can organize the imports, I see a lot of warnings in Eclipse (use CTRL-SHIFT-o in eclipse) Formatting would be nice as well (CTRL-SHIFT-f) Like Edward told there are some missing license headers Otherwise I think this is a good patch Nice work! For the scheduling stuff: try{ Thread.sleep(15000); } catch(Exception e){ LOG.error("Interrupted the timer for 10 sec.", e); } This causes race conditions, you should better use a completion service that waits until the thread really finishes. For a single thread you can get a Future from the threadpool. You can express this with ScheduledFuture<?> schedule = this.testBSPTaskService.schedule( new TestBSPTaskThreadRunner(job), 0, TimeUnit.MILLISECONDS); try { schedule.get(); } catch (InterruptedException e1) { e1.printStackTrace(); } catch (ExecutionException e1) { e1.printStackTrace(); } You actually used it on the third testcase, so why do you rely on sleep in the first one? And we have some TCP cleanup issues, once the testcase failed it causes TIME_WAIT problems thomasjungblut@ubuntu:~/workspace/hama-trunk$ sudo netstat -p | grep 40000 tcp 0 0 localhost.localdo:40000 localhost.localdo:36948 TIME_WAIT - tcp 0 0 localhost.localdo:40000 localhost.localdo:36947 TIME_WAIT - Seems to me like a cleanup problem..
        Hide
        Suraj Menon added a comment -

        Oops! got caught for disabling rat check in my pom.xml. Same goes with my indifference to warnings. Sorry

        There is a reason why I had the Future.get in the last test case and not in the first three. I felt that the whole point of implementation was that there should be a minimum number of pings coming if the task has run for a particular period of time. The problem where the test case could fail is when the task takes too long to start. When such a case happens repeatedly, then I think the test cases should fail and we should reconsider the leeway given to each task to start. For the last test case, I had to ensure that I got the first ping, then kill the RPC connection and then wait for the process to die for its exit status. I had a deterministic sequence of events to wait for.

        I had discussed the enigma of port 40000 with you. I thought I got over it. I am not running BSPMaster for any of these test cases. I shall check and find a fix. This should be happening only for the last test case where server closes the connection before proxy. I shall find a fix.

        I ran into some issues when I used the LocalBSPRunner.LocalSyncClient class. Shall look into it.

        Show
        Suraj Menon added a comment - Oops! got caught for disabling rat check in my pom.xml. Same goes with my indifference to warnings. Sorry There is a reason why I had the Future.get in the last test case and not in the first three. I felt that the whole point of implementation was that there should be a minimum number of pings coming if the task has run for a particular period of time. The problem where the test case could fail is when the task takes too long to start. When such a case happens repeatedly, then I think the test cases should fail and we should reconsider the leeway given to each task to start. For the last test case, I had to ensure that I got the first ping, then kill the RPC connection and then wait for the process to die for its exit status. I had a deterministic sequence of events to wait for. I had discussed the enigma of port 40000 with you. I thought I got over it. I am not running BSPMaster for any of these test cases. I shall check and find a fix. This should be happening only for the last test case where server closes the connection before proxy. I shall find a fix. I ran into some issues when I used the LocalBSPRunner.LocalSyncClient class. Shall look into it.
        Hide
        Thomas Jungblut added a comment -

        Yeah I thought so that the sleep had a meaning^^

        What about using the return value of the callable to determine how many seconds have passed in the method via System.currentTimeMillis() and compare it with the number of pings we received in the period. I guess this seems saver.

        I ran into some issues when I used the LocalBSPRunner.LocalSyncClient class. Shall look into it.

        Let me know what is wrong, but since you're not syncing there shouldn't be a problem.

        I had discussed the enigma of port 40000 with you.

        Yeah, I take a deeper look into it later, but I guess because of the runtime exception the server not closes the socket properly so it blocks during the next execution. Some time after the socket receives a timeout so it frees itself. But that is not a really clean solution.

        Show
        Thomas Jungblut added a comment - Yeah I thought so that the sleep had a meaning^^ What about using the return value of the callable to determine how many seconds have passed in the method via System.currentTimeMillis() and compare it with the number of pings we received in the period. I guess this seems saver. I ran into some issues when I used the LocalBSPRunner.LocalSyncClient class. Shall look into it. Let me know what is wrong, but since you're not syncing there shouldn't be a problem. I had discussed the enigma of port 40000 with you. Yeah, I take a deeper look into it later, but I guess because of the runtime exception the server not closes the socket properly so it blocks during the next execution. Some time after the socket receives a timeout so it frees itself. But that is not a really clean solution.
        Hide
        Thomas Jungblut added a comment -

        I have time again tomorrow (all my exams are over for the next 3 months!) so I can help you make it run
        Have you made any progress between the days?

        Show
        Thomas Jungblut added a comment - I have time again tomorrow (all my exams are over for the next 3 months!) so I can help you make it run Have you made any progress between the days?
        Hide
        Suraj Menon added a comment -

        First of all sorry I couldn't look into these matters yet; got a little busy with work. Good news is that Friday is here. I would be uploading the changes for 498 and 445 soon.I think most of the suggested changes were comparatively minor and test related. It is better that I don't slow you down on your work.

        I forgot to mention in the previous posts that I had fixed the problem of open port 40000 by not sublclassing from HamaTestCase and subclassing from TestCase instead. I want to make sure if that what you saw in netstat is a side-effect of HAMA-498 test cases. I am not starting a BSPMaster instance(usually bound on the port) for the tests. This is the only part of changes I foresee would take some time. Let me use the time difference between Berlin and Washington DC before I get back to you tomorrow.

        Show
        Suraj Menon added a comment - First of all sorry I couldn't look into these matters yet; got a little busy with work. Good news is that Friday is here. I would be uploading the changes for 498 and 445 soon.I think most of the suggested changes were comparatively minor and test related. It is better that I don't slow you down on your work. I forgot to mention in the previous posts that I had fixed the problem of open port 40000 by not sublclassing from HamaTestCase and subclassing from TestCase instead. I want to make sure if that what you saw in netstat is a side-effect of HAMA-498 test cases. I am not starting a BSPMaster instance(usually bound on the port) for the tests. This is the only part of changes I foresee would take some time. Let me use the time difference between Berlin and Washington DC before I get back to you tomorrow.
        Hide
        Suraj Menon added a comment -

        Sorry couldn't get started on the patch sooner. I have also attached the netstat ouput while running the related test cases on Eclipse. Let me know if anything is not as supposed to be.

        Show
        Suraj Menon added a comment - Sorry couldn't get started on the patch sooner. I have also attached the netstat ouput while running the related test cases on Eclipse. Let me know if anything is not as supposed to be.
        Hide
        Suraj Menon added a comment -

        Please disregard HAMA-498-v2.patch. It does not contain diff on new Java class.

        Show
        Suraj Menon added a comment - Please disregard HAMA-498 -v2.patch. It does not contain diff on new Java class.
        Hide
        Suraj Menon added a comment -

        Re-attaching the patch. This contains fixes for both HAMA-498 and HAMA-445.

        Show
        Suraj Menon added a comment - Re-attaching the patch. This contains fixes for both HAMA-498 and HAMA-445 .
        Hide
        Edward J. Yoon added a comment -

        +1 patch looks good and builds are OK.

        I've committed this and additionally changed a typo "hama.messanger.class" to "hama.messenger.class" in hama-default.xml.

        We also need to add new properties to http://wiki.apache.org/hama/GettingStarted/Properties

        Show
        Edward J. Yoon added a comment - +1 patch looks good and builds are OK. I've committed this and additionally changed a typo "hama.messanger.class" to "hama.messenger.class" in hama-default.xml. We also need to add new properties to http://wiki.apache.org/hama/GettingStarted/Properties
        Hide
        Edward J. Yoon added a comment -

        Resolved. Thanks Suraj!

        Show
        Edward J. Yoon added a comment - Resolved. Thanks Suraj!
        Hide
        Suraj Menon added a comment -

        Didn't know every test in a suite was running parallely. This patch should fix the problem. However I have to verify if internally BSPNetUtils.getFreePort is thread safe. It was giving same port number to more than one tests.

        Show
        Suraj Menon added a comment - Didn't know every test in a suite was running parallely. This patch should fix the problem. However I have to verify if internally BSPNetUtils.getFreePort is thread safe. It was giving same port number to more than one tests.
        Hide
        Thomas Jungblut added a comment -

        We could have turned off mavens parallel execution though, getFreePort is not threadsafe, but you can seed it with different start ports so that it becomes "thread safe".
        I take a look later at it when I'm at home.

        Show
        Thomas Jungblut added a comment - We could have turned off mavens parallel execution though, getFreePort is not threadsafe, but you can seed it with different start ports so that it becomes "thread safe". I take a look later at it when I'm at home.
        Hide
        Thomas Jungblut added a comment -

        Done. Thanks for the contribution

        Show
        Thomas Jungblut added a comment - Done. Thanks for the contribution

          People

          • Assignee:
            Suraj Menon
            Reporter:
            Edward J. Yoon
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development