Details

    • Type: Sub-task Sub-task
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.3.0
    • Fix Version/s: 0.4.0
    • Component/s: bsp core
    • Labels:
      None

      Description

      By HAMA-410 patch, BSPPeer object will be constructed at child process. Now we can just remove limitation on the number of tasks.

      Here's TODO list:

      1. The number of tasks per groom should be configurable e.g., 'bsp.local.tasks.maximum'.
      2. The 'totalTaskCapacity' should be calculated at BSPMaster.getClusterStatus().
      3. When scheduling tasks, consider how to allocate them.
      4. Each BSPPeer should know all created peers of Hama cluster by job. It can be listed based on actions of GroomServer.
      5. In examples, 'cluster.getGroomServers()' can be changed to 'cluster.getMaxTasks()'.

      1. HAMA-413_v07.patch
        57 kB
        Edward J. Yoon
      2. HAMA-413_v06.patch
        36 kB
        Edward J. Yoon
      3. HAMA_NEW.patch
        37 kB
        Edward J. Yoon
      4. HAMA-413_v05.patch
        32 kB
        Edward J. Yoon
      5. HAMA_413_v04.patch
        30 kB
        Edward J. Yoon
      6. HAMA-413_v03.patch
        32 kB
        Edward J. Yoon
      7. HAMA-413_v02.patch
        24 kB
        Edward J. Yoon
      8. HAMA-413_v01.patch
        23 kB
        Edward J. Yoon

        Activity

        Hide
        Edward J. Yoon added a comment -
        
        [INFO] ------------------------------------------------------------------------
        [INFO] Reactor Summary:
        [INFO] 
        [INFO] Apache Hama parent POM ............................ SUCCESS [0.768s]
        [INFO] Apache Hama Core .................................. SUCCESS [36.565s]
        [INFO] Apache Hama Examples .............................. SUCCESS [4.974s]
        [INFO] ------------------------------------------------------------------------
        [INFO] BUILD SUCCESS
        [INFO] ------------------------------------------------------------------------
        [INFO] Total time: 42.444s
        [INFO] Finished at: Fri Aug 05 20:15:58 KST 2011
        [INFO] Final Memory: 28M/189M
        

        Unit tests are passed. But, I have to test this on really physical Hama cluster.

        Show
        Edward J. Yoon added a comment - [INFO] ------------------------------------------------------------------------ [INFO] Reactor Summary: [INFO] [INFO] Apache Hama parent POM ............................ SUCCESS [0.768s] [INFO] Apache Hama Core .................................. SUCCESS [36.565s] [INFO] Apache Hama Examples .............................. SUCCESS [4.974s] [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------ [INFO] Total time: 42.444s [INFO] Finished at: Fri Aug 05 20:15:58 KST 2011 [INFO] Final Memory: 28M/189M Unit tests are passed. But, I have to test this on really physical Hama cluster.
        Hide
        Edward J. Yoon added a comment -

        Below is the results on 16 physical nodes.

        JobClient LOG:
        11/08/23 15:27:57 DEBUG bsp.BSPJobClient: BSPJobClient.submitJobDir: hdfs://hnode15:9000/tmp/hadoop-root/bsp/system/submit_22he6c
        11/08/23 15:27:58 INFO bsp.BSPJobClient: Running job: job_201108231527_0001
        11/08/23 15:28:01 INFO bsp.BSPJobClient: Current supersteps number: 0
        11/08/23 15:28:22 INFO bsp.BSPJobClient: The total number of supersteps: 0
        java.io.FileNotFoundException: File does not exist: /tmp/pi-example/output
                at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:457)
                at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:676)
                at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
                at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
                at org.apache.hama.examples.PiEstimator.printOutput(PiEstimator.java:109)
                at org.apache.hama.examples.PiEstimator.main(PiEstimator.java:151)
                at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
                at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
                at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
                at java.lang.reflect.Method.invoke(Method.java:597)
                at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
                at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
                at org.apache.hama.examples.ExampleDriver.main(ExampleDriver.java:37)
                at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
                at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
                at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
                at java.lang.reflect.Method.invoke(Method.java:597)
                at org.apache.hama.util.RunJar.main(RunJar.java:145)
        
        ----
        LOG of node16 groomserver:
        
        2011-08-23 15:28:02,743 INFO org.apache.hama.bsp.TaskRunner: attempt_201108231527_0001_000025_0 11/08/23 15:28:02 WARN bsp.GroomServer: Error running child
        2011-08-23 15:28:02,744 INFO org.apache.hama.bsp.TaskRunner: attempt_201108231527_0001_000025_0 java.lang.NullPointerException
        2011-08-23 15:28:02,744 INFO org.apache.hama.bsp.TaskRunner: attempt_201108231527_0001_000025_0         at org.apache.hama.bsp.BSPPeer.send(BSPPeer.java:167)
        2011-08-23 15:28:02,744 INFO org.apache.hama.bsp.TaskRunner: attempt_201108231527_0001_000025_0         at org.apache.hama.examples.PiEstimator$MyEstimator.bsp(PiEstimator.java:64)
        2011-08-23 15:28:02,744 INFO org.apache.hama.bsp.TaskRunner: attempt_201108231527_0001_000025_0         at org.apache.hama.bsp.BSPTask.run(BSPTask.java:60)
        2011-08-23 15:28:02,744 INFO org.apache.hama.bsp.TaskRunner: attempt_201108231527_0001_000025_0         at org.apache.hama.bsp.GroomServer$Child.main(GroomServer.java:875)
        2011-08-23 15:28:02,744 INFO org.apache.hama.bsp.TaskRunner: attempt_201108231527_0001_000025_0 11/08/23 15:28:02 INFO ipc.Server: Stopping server on 61001
        2011-08-23 15:28:02,744 INFO org.apache.hama.bsp.TaskRunner: attempt_201108231527_0001_000025_0 11/08/23 15:28:02 INFO ipc.Server: IPC Server handler 0 on 61001: exiting
        2011-08-23 15:28:02,744 INFO org.apache.hama.bsp.TaskRunner: attempt_201108231527_0001_000025_0 11/08/23 15:28:02 INFO ipc.Server: Stopping IPC Server listener on 61001
        2011-08-23 15:28:02,744 INFO org.apache.hama.bsp.TaskRunner: attempt_201108231527_0001_000025_0 11/08/23 15:28:02 INFO ipc.Server: Stopping IPC Server Responder
        2011-08-23 15:28:02,764 INFO org.apache.hama.bsp.TaskRunner: attempt_201108231527_0001_000039_0 11/08/23 15:28:02 INFO ipc.Server: IPC Server Responder: starting
        2011-08-23 15:28:02,764 INFO org.apache.hama.bsp.TaskRunner: attempt_201108231527_0001_000039_0 11/08/23 15:28:02 INFO ipc.Server: IPC Server listener on 61002: starting
        ...
        
        2011-08-23 15:28:02,951 INFO org.apache.hama.bsp.TaskRunner: attempt_201108231527_0001_000039_0 11/08/23 15:28:02 INFO ipc.Server: Stopping server on 61002
        2011-08-23 15:28:02,951 INFO org.apache.hama.bsp.TaskRunner: attempt_201108231527_0001_000039_0 11/08/23 15:28:02 INFO ipc.Server: IPC Server handler 0 on 61002: exiting
        2011-08-23 15:28:02,951 INFO org.apache.hama.bsp.TaskRunner: attempt_201108231527_0001_000039_0 11/08/23 15:28:02 INFO ipc.Server: Stopping IPC Server listener on 61002
        2011-08-23 15:28:02,951 INFO org.apache.hama.bsp.TaskRunner: attempt_201108231527_0001_000039_0 11/08/23 15:28:02 INFO ipc.Server: Stopping IPC Server Responder
        2011-08-23 15:28:03,306 INFO org.apache.hama.bsp.GroomServer: Lost connection to BSP Master [hnode1/10.33.1.101:40000].  Retrying...
        java.util.ConcurrentModificationException
                at java.util.LinkedHashMap$LinkedHashIterator.nextEntry(LinkedHashMap.java:373)
                at java.util.LinkedHashMap$EntryIterator.next(LinkedHashMap.java:392)
                at java.util.LinkedHashMap$EntryIterator.next(LinkedHashMap.java:391)
                at org.apache.hama.bsp.GroomServer.offerService(GroomServer.java:394)
                at org.apache.hama.bsp.GroomServer.run(GroomServer.java:634)
                at java.lang.Thread.run(Thread.java:662)
        
        Show
        Edward J. Yoon added a comment - Below is the results on 16 physical nodes. JobClient LOG: 11/08/23 15:27:57 DEBUG bsp.BSPJobClient: BSPJobClient.submitJobDir: hdfs: //hnode15:9000/tmp/hadoop-root/bsp/system/submit_22he6c 11/08/23 15:27:58 INFO bsp.BSPJobClient: Running job: job_201108231527_0001 11/08/23 15:28:01 INFO bsp.BSPJobClient: Current supersteps number: 0 11/08/23 15:28:22 INFO bsp.BSPJobClient: The total number of supersteps: 0 java.io.FileNotFoundException: File does not exist: /tmp/pi-example/output at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:457) at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:676) at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417) at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412) at org.apache.hama.examples.PiEstimator.printOutput(PiEstimator.java:109) at org.apache.hama.examples.PiEstimator.main(PiEstimator.java:151) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) at org.apache.hama.examples.ExampleDriver.main(ExampleDriver.java:37) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hama.util.RunJar.main(RunJar.java:145) ---- LOG of node16 groomserver: 2011-08-23 15:28:02,743 INFO org.apache.hama.bsp.TaskRunner: attempt_201108231527_0001_000025_0 11/08/23 15:28:02 WARN bsp.GroomServer: Error running child 2011-08-23 15:28:02,744 INFO org.apache.hama.bsp.TaskRunner: attempt_201108231527_0001_000025_0 java.lang.NullPointerException 2011-08-23 15:28:02,744 INFO org.apache.hama.bsp.TaskRunner: attempt_201108231527_0001_000025_0 at org.apache.hama.bsp.BSPPeer.send(BSPPeer.java:167) 2011-08-23 15:28:02,744 INFO org.apache.hama.bsp.TaskRunner: attempt_201108231527_0001_000025_0 at org.apache.hama.examples.PiEstimator$MyEstimator.bsp(PiEstimator.java:64) 2011-08-23 15:28:02,744 INFO org.apache.hama.bsp.TaskRunner: attempt_201108231527_0001_000025_0 at org.apache.hama.bsp.BSPTask.run(BSPTask.java:60) 2011-08-23 15:28:02,744 INFO org.apache.hama.bsp.TaskRunner: attempt_201108231527_0001_000025_0 at org.apache.hama.bsp.GroomServer$Child.main(GroomServer.java:875) 2011-08-23 15:28:02,744 INFO org.apache.hama.bsp.TaskRunner: attempt_201108231527_0001_000025_0 11/08/23 15:28:02 INFO ipc.Server: Stopping server on 61001 2011-08-23 15:28:02,744 INFO org.apache.hama.bsp.TaskRunner: attempt_201108231527_0001_000025_0 11/08/23 15:28:02 INFO ipc.Server: IPC Server handler 0 on 61001: exiting 2011-08-23 15:28:02,744 INFO org.apache.hama.bsp.TaskRunner: attempt_201108231527_0001_000025_0 11/08/23 15:28:02 INFO ipc.Server: Stopping IPC Server listener on 61001 2011-08-23 15:28:02,744 INFO org.apache.hama.bsp.TaskRunner: attempt_201108231527_0001_000025_0 11/08/23 15:28:02 INFO ipc.Server: Stopping IPC Server Responder 2011-08-23 15:28:02,764 INFO org.apache.hama.bsp.TaskRunner: attempt_201108231527_0001_000039_0 11/08/23 15:28:02 INFO ipc.Server: IPC Server Responder: starting 2011-08-23 15:28:02,764 INFO org.apache.hama.bsp.TaskRunner: attempt_201108231527_0001_000039_0 11/08/23 15:28:02 INFO ipc.Server: IPC Server listener on 61002: starting ... 2011-08-23 15:28:02,951 INFO org.apache.hama.bsp.TaskRunner: attempt_201108231527_0001_000039_0 11/08/23 15:28:02 INFO ipc.Server: Stopping server on 61002 2011-08-23 15:28:02,951 INFO org.apache.hama.bsp.TaskRunner: attempt_201108231527_0001_000039_0 11/08/23 15:28:02 INFO ipc.Server: IPC Server handler 0 on 61002: exiting 2011-08-23 15:28:02,951 INFO org.apache.hama.bsp.TaskRunner: attempt_201108231527_0001_000039_0 11/08/23 15:28:02 INFO ipc.Server: Stopping IPC Server listener on 61002 2011-08-23 15:28:02,951 INFO org.apache.hama.bsp.TaskRunner: attempt_201108231527_0001_000039_0 11/08/23 15:28:02 INFO ipc.Server: Stopping IPC Server Responder 2011-08-23 15:28:03,306 INFO org.apache.hama.bsp.GroomServer: Lost connection to BSP Master [hnode1/10.33.1.101:40000]. Retrying... java.util.ConcurrentModificationException at java.util.LinkedHashMap$LinkedHashIterator.nextEntry(LinkedHashMap.java:373) at java.util.LinkedHashMap$EntryIterator.next(LinkedHashMap.java:392) at java.util.LinkedHashMap$EntryIterator.next(LinkedHashMap.java:391) at org.apache.hama.bsp.GroomServer.offerService(GroomServer.java:394) at org.apache.hama.bsp.GroomServer.run(GroomServer.java:634) at java.lang. Thread .run( Thread .java:662)
        Hide
        ChiaHung Lin added a comment -

        `File does not exist' probably comes from the non-sequential read of hdfs, which requires the coordination between client and bsppeer in the future.

        Show
        ChiaHung Lin added a comment - `File does not exist' probably comes from the non-sequential read of hdfs, which requires the coordination between client and bsppeer in the future.
        Hide
        Thomas Jungblut added a comment -

        Yes that could be true.

        We should fix the Nullpointer Exception in the send() method.

        Show
        Thomas Jungblut added a comment - Yes that could be true. We should fix the Nullpointer Exception in the send() method.
        Hide
        Edward J. Yoon added a comment -

        Currently, GroomServer checks running tasks periodically to report their status to BSPMaster in a infinite loop.

        To avoid concurrency issue, now each child process reports directly.

        Show
        Edward J. Yoon added a comment - Currently, GroomServer checks running tasks periodically to report their status to BSPMaster in a infinite loop. To avoid concurrency issue, now each child process reports directly.
        Hide
        Edward J. Yoon added a comment -

        I think so, ChiaHung and Thomas.

        Show
        Edward J. Yoon added a comment - I think so, ChiaHung and Thomas.
        Hide
        Edward J. Yoon added a comment -

        This patch is very dirty but runs well.

        Tested Pi and Serialized Printing on physical 2 nodes, 2/6 tasks.

        11/08/23 21:47:48 DEBUG bsp.BSPJobClient: BSPJobClient.submitJobDir: hdfs://slave.udanax.org:9000/tmp/hadoop-edward/bsp/system/submit_dxqugl
        11/08/23 21:47:49 INFO bsp.BSPJobClient: Running job: job_201108232147_0001
        11/08/23 21:47:52 INFO bsp.BSPJobClient: Current supersteps number: 0
        11/08/23 21:47:58 INFO bsp.BSPJobClient: Current supersteps number: 2
        11/08/23 21:47:58 INFO bsp.BSPJobClient: The total number of supersteps: 2
        Each task printed the "Hello World" as below:
        Tue Aug 23 21:47:53 KST 2011: Hello BSP from 1 of 2: slave.udanax.org:61000
        Tue Aug 23 21:49:27 KST 2011: Hello BSP from 2 of 2: tweetple.com:61000
        edward@slave:~/workspace/hama-trunk$ core/bin/hama jar ./examples/target/hama-examples-0.4.0-incubating-SNAPSHOT.jar pi
        11/08/23 21:48:07 DEBUG bsp.BSPJobClient: BSPJobClient.submitJobDir: hdfs://slave.udanax.org:9000/tmp/hadoop-edward/bsp/system/submit_vmkvwb
        11/08/23 21:48:09 INFO bsp.BSPJobClient: Running job: job_201108232147_0002
        11/08/23 21:48:12 INFO bsp.BSPJobClient: Current supersteps number: 0
        11/08/23 21:48:18 INFO bsp.BSPJobClient: Current supersteps number: 1
        11/08/23 21:48:18 INFO bsp.BSPJobClient: The total number of supersteps: 1
        Estimated value of PI is 3.1370000000000005
        
        ----
        
        2011-08-23 21:48:12,515 INFO org.apache.hama.bsp.TaskRunner: attempt_201108232147_0002_000001_0 11/08/23 21:48:12 INFO bsp.GroomServer: >>>>>> [slave.udanax.org:61000, slave.udanax.org:61001, tweetple.com:61000, tweetple.com:61001, tweetple.com:61002, tweetple.com:61003]
        2011-08-23 21:48:12,743 INFO org.apache.hama.bsp.TaskRunner: attempt_201108232147_0002_000001_0 11/08/23 21:48:12 DEBUG bsp.BSPPeer: Local send bytes (3.1292)
        2011-08-23 21:48:12,816 INFO org.apache.hama.bsp.TaskRunner: attempt_201108232147_0002_000001_0 11/08/23 21:48:12 DEBUG bsp.BSPPeer: [slave.udanax.org:61000] enter the enterbarrier: 0
        2011-08-23 21:48:12,824 INFO org.apache.hama.bsp.TaskRunner: attempt_201108232147_0002_000003_0 11/08/23 21:48:12 DEBUG bsp.BSPPeer: Send bytes (3.1304) to slave.udanax.org:61000
        2011-08-23 21:48:12,824 INFO org.apache.hama.bsp.TaskRunner: attempt_201108232147_0002_000003_0 11/08/23 21:48:12 DEBUG bsp.BSPPeer: [slave.udanax.org:61001] enter the enterbarrier: 0
        2011-08-23 21:48:16,525 INFO org.apache.hama.bsp.TaskRunner: attempt_201108232147_0002_000003_0 11/08/23 21:48:16 INFO ipc.Server: Stopping server on 61001
        2011-08-23 21:48:16,525 INFO org.apache.hama.bsp.TaskRunner: attempt_201108232147_0002_000003_0 11/08/23 21:48:16 INFO ipc.Server: IPC Server handler 0 on 61001: exiting
        2011-08-23 21:48:16,525 INFO org.apache.hama.bsp.TaskRunner: attempt_201108232147_0002_000003_0 11/08/23 21:48:16 INFO ipc.Server: Stopping IPC Server listener on 61001
        2011-08-23 21:48:16,526 INFO org.apache.hama.bsp.TaskRunner: attempt_201108232147_0002_000003_0 11/08/23 21:48:16 INFO ipc.Server: Stopping IPC Server Responder
        2011-08-23 21:48:17,072 INFO org.apache.hama.bsp.TaskRunner: attempt_201108232147_0002_000001_0 11/08/23 21:48:17 INFO ipc.Server: Stopping server on 61000
        2011-08-23 21:48:17,072 INFO org.apache.hama.bsp.TaskRunner: attempt_201108232147_0002_000001_0 11/08/23 21:48:17 INFO ipc.Server: Stopping IPC Server listener on 61000
        2011-08-23 21:48:17,072 INFO org.apache.hama.bsp.TaskRunner: attempt_201108232147_0002_000001_0 11/08/23 21:48:17 INFO ipc.Server: IPC Server handler 0 on 61000: exiting
        2011-08-23 21:48:17,072 INFO org.apache.hama.bsp.TaskRunner: attempt_201108232147_0002_000001_0 11/08/23 21:48:17 INFO ipc.Server: Stopping IPC Server Responder
        
        
        Show
        Edward J. Yoon added a comment - This patch is very dirty but runs well. Tested Pi and Serialized Printing on physical 2 nodes, 2/6 tasks. 11/08/23 21:47:48 DEBUG bsp.BSPJobClient: BSPJobClient.submitJobDir: hdfs: //slave.udanax.org:9000/tmp/hadoop-edward/bsp/system/submit_dxqugl 11/08/23 21:47:49 INFO bsp.BSPJobClient: Running job: job_201108232147_0001 11/08/23 21:47:52 INFO bsp.BSPJobClient: Current supersteps number: 0 11/08/23 21:47:58 INFO bsp.BSPJobClient: Current supersteps number: 2 11/08/23 21:47:58 INFO bsp.BSPJobClient: The total number of supersteps: 2 Each task printed the "Hello World" as below: Tue Aug 23 21:47:53 KST 2011: Hello BSP from 1 of 2: slave.udanax.org:61000 Tue Aug 23 21:49:27 KST 2011: Hello BSP from 2 of 2: tweetple.com:61000 edward@slave:~/workspace/hama-trunk$ core/bin/hama jar ./examples/target/hama-examples-0.4.0-incubating-SNAPSHOT.jar pi 11/08/23 21:48:07 DEBUG bsp.BSPJobClient: BSPJobClient.submitJobDir: hdfs: //slave.udanax.org:9000/tmp/hadoop-edward/bsp/system/submit_vmkvwb 11/08/23 21:48:09 INFO bsp.BSPJobClient: Running job: job_201108232147_0002 11/08/23 21:48:12 INFO bsp.BSPJobClient: Current supersteps number: 0 11/08/23 21:48:18 INFO bsp.BSPJobClient: Current supersteps number: 1 11/08/23 21:48:18 INFO bsp.BSPJobClient: The total number of supersteps: 1 Estimated value of PI is 3.1370000000000005 ---- 2011-08-23 21:48:12,515 INFO org.apache.hama.bsp.TaskRunner: attempt_201108232147_0002_000001_0 11/08/23 21:48:12 INFO bsp.GroomServer: >>>>>> [slave.udanax.org:61000, slave.udanax.org:61001, tweetple.com:61000, tweetple.com:61001, tweetple.com:61002, tweetple.com:61003] 2011-08-23 21:48:12,743 INFO org.apache.hama.bsp.TaskRunner: attempt_201108232147_0002_000001_0 11/08/23 21:48:12 DEBUG bsp.BSPPeer: Local send bytes (3.1292) 2011-08-23 21:48:12,816 INFO org.apache.hama.bsp.TaskRunner: attempt_201108232147_0002_000001_0 11/08/23 21:48:12 DEBUG bsp.BSPPeer: [slave.udanax.org:61000] enter the enterbarrier: 0 2011-08-23 21:48:12,824 INFO org.apache.hama.bsp.TaskRunner: attempt_201108232147_0002_000003_0 11/08/23 21:48:12 DEBUG bsp.BSPPeer: Send bytes (3.1304) to slave.udanax.org:61000 2011-08-23 21:48:12,824 INFO org.apache.hama.bsp.TaskRunner: attempt_201108232147_0002_000003_0 11/08/23 21:48:12 DEBUG bsp.BSPPeer: [slave.udanax.org:61001] enter the enterbarrier: 0 2011-08-23 21:48:16,525 INFO org.apache.hama.bsp.TaskRunner: attempt_201108232147_0002_000003_0 11/08/23 21:48:16 INFO ipc.Server: Stopping server on 61001 2011-08-23 21:48:16,525 INFO org.apache.hama.bsp.TaskRunner: attempt_201108232147_0002_000003_0 11/08/23 21:48:16 INFO ipc.Server: IPC Server handler 0 on 61001: exiting 2011-08-23 21:48:16,525 INFO org.apache.hama.bsp.TaskRunner: attempt_201108232147_0002_000003_0 11/08/23 21:48:16 INFO ipc.Server: Stopping IPC Server listener on 61001 2011-08-23 21:48:16,526 INFO org.apache.hama.bsp.TaskRunner: attempt_201108232147_0002_000003_0 11/08/23 21:48:16 INFO ipc.Server: Stopping IPC Server Responder 2011-08-23 21:48:17,072 INFO org.apache.hama.bsp.TaskRunner: attempt_201108232147_0002_000001_0 11/08/23 21:48:17 INFO ipc.Server: Stopping server on 61000 2011-08-23 21:48:17,072 INFO org.apache.hama.bsp.TaskRunner: attempt_201108232147_0002_000001_0 11/08/23 21:48:17 INFO ipc.Server: Stopping IPC Server listener on 61000 2011-08-23 21:48:17,072 INFO org.apache.hama.bsp.TaskRunner: attempt_201108232147_0002_000001_0 11/08/23 21:48:17 INFO ipc.Server: IPC Server handler 0 on 61000: exiting 2011-08-23 21:48:17,072 INFO org.apache.hama.bsp.TaskRunner: attempt_201108232147_0002_000001_0 11/08/23 21:48:17 INFO ipc.Server: Stopping IPC Server Responder
        Hide
        Edward J. Yoon added a comment -

        Since ClusterStatus.getActiveGroomNames() is changed, that part of sssp and pagerank examples also should be fixed.

        Show
        Edward J. Yoon added a comment - Since ClusterStatus.getActiveGroomNames() is changed, that part of sssp and pagerank examples also should be fixed.
        Hide
        Thomas Jungblut added a comment -

        The question is how to change this, since the partitioning depends on the groom names.
        So we have to make the partitioning part of the job submission lifecycle or we have to somehow know how many tasks are spawned and what names they have.

        Let's open a task to fix this. Do you have an idea?

        Show
        Thomas Jungblut added a comment - The question is how to change this, since the partitioning depends on the groom names. So we have to make the partitioning part of the job submission lifecycle or we have to somehow know how many tasks are spawned and what names they have. Let's open a task to fix this. Do you have an idea?
        Hide
        Edward J. Yoon added a comment -

        Now, ClusterStatus.getActiveGroomNames() method returns the names of groom servers and their host names e.g., ["groomd_test.com_50000", "test.com"], ... It was the pair of 'groom server name' and 'peer name' e.g., ["groomd_test.com_50000", "test.com:61000"], ....

        Others are the same with before. So, changed below part only in PiEstimator.

        PiEstimator:
        
            // Choose one as a master
            for (String hostName : cluster.getActiveGroomNames().values()) {
              conf.set(MASTER_TASK, hostName + ":" + Constants.DEFAULT_PEER_PORT);
              break;
            }
        
        Show
        Edward J. Yoon added a comment - Now, ClusterStatus.getActiveGroomNames() method returns the names of groom servers and their host names e.g., ["groomd_test.com_50000", "test.com"] , ... It was the pair of 'groom server name' and 'peer name' e.g., ["groomd_test.com_50000", "test.com:61000"] , .... Others are the same with before. So, changed below part only in PiEstimator. PiEstimator: // Choose one as a master for ( String hostName : cluster.getActiveGroomNames().values()) { conf.set(MASTER_TASK, hostName + ":" + Constants.DEFAULT_PEER_PORT); break ; }
        Hide
        Thomas Jungblut added a comment -

        Yes, we can keep that, but then the tasks would operate on the same partitioned file since the files are tagged with the grooms name.
        Do you see the problem?

        Another idea would be to keep this partitioning, and just let the task determine how many other tasks are running for this job on this groom. And then do a % number of tasks on this groom for each record.
        This is ultra-hacky, but I don't see this (a not hacky solution) without implementing a whole IO system now...

        Show
        Thomas Jungblut added a comment - Yes, we can keep that, but then the tasks would operate on the same partitioned file since the files are tagged with the grooms name. Do you see the problem? Another idea would be to keep this partitioning, and just let the task determine how many other tasks are running for this job on this groom. And then do a % number of tasks on this groom for each record. This is ultra-hacky, but I don't see this (a not hacky solution) without implementing a whole IO system now...
        Hide
        Edward J. Yoon added a comment -

        NOTE to myself, currently GroomServerStatus.countTasks() method returns always zero.

        Show
        Edward J. Yoon added a comment - NOTE to myself, currently GroomServerStatus.countTasks() method returns always zero.
        Hide
        Edward J. Yoon added a comment -

        I ran Pi on physical 16 nodes, but there's a bug of scheduler.

        2011-08-24 11:55:48,930 DEBUG org.apache.hama.bsp.GroomServer: Got Response from BSPMaster with 4 actions
        2011-08-24 11:55:48,931 INFO org.apache.hama.bsp.GroomServer: >>>> attempt_201108241154_0002_000001_0, 61000
        2011-08-24 11:55:48,931 INFO org.apache.hama.bsp.GroomServer: >>>> attempt_201108241154_0002_000003_0, 61001
        2011-08-24 11:55:48,931 INFO org.apache.hama.bsp.GroomServer: >>>> attempt_201108241154_0002_000005_0, 61002
        2011-08-24 11:55:48,931 INFO org.apache.hama.bsp.GroomServer: >>>> attempt_201108241154_0002_000007_0, 61003
        2011-08-24 11:55:48,932 INFO org.apache.hama.bsp.GroomServer: xxxx TaskInProgress: org.apache.hama.bsp.BSPJob@4d911540
        
        Show
        Edward J. Yoon added a comment - I ran Pi on physical 16 nodes, but there's a bug of scheduler. 2011-08-24 11:55:48,930 DEBUG org.apache.hama.bsp.GroomServer: Got Response from BSPMaster with 4 actions 2011-08-24 11:55:48,931 INFO org.apache.hama.bsp.GroomServer: >>>> attempt_201108241154_0002_000001_0, 61000 2011-08-24 11:55:48,931 INFO org.apache.hama.bsp.GroomServer: >>>> attempt_201108241154_0002_000003_0, 61001 2011-08-24 11:55:48,931 INFO org.apache.hama.bsp.GroomServer: >>>> attempt_201108241154_0002_000005_0, 61002 2011-08-24 11:55:48,931 INFO org.apache.hama.bsp.GroomServer: >>>> attempt_201108241154_0002_000007_0, 61003 2011-08-24 11:55:48,932 INFO org.apache.hama.bsp.GroomServer: xxxx TaskInProgress: org.apache.hama.bsp.BSPJob@4d911540
        Hide
        Edward J. Yoon added a comment -

        Tested on physical 16 nodes.

        2011-08-24 13:22:32,107 INFO org.apache.hama.bsp.BSPMaster: Starting RUNNING
        2011-08-24 13:22:44,252 DEBUG org.apache.hama.bsp.JobInProgress: numBSPTasks: 48
        2011-08-24 13:22:44,254 DEBUG org.apache.hama.bsp.JobInProgress: Job is initialized.
        2011-08-24 13:22:49,546 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000029_0' has finished successfully.
        2011-08-24 13:22:49,546 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000029' has completed.
        2011-08-24 13:22:49,546 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000045_0' has finished successfully.
        2011-08-24 13:22:49,546 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000045' has completed.
        2011-08-24 13:22:49,547 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000037_0' has finished successfully.
        2011-08-24 13:22:49,547 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000037' has completed.
        2011-08-24 13:22:49,547 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000010_0' has finished successfully.
        2011-08-24 13:22:49,547 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000010' has completed.
        2011-08-24 13:22:49,547 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000009_0' has finished successfully.
        2011-08-24 13:22:49,547 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000009' has completed.
        2011-08-24 13:22:49,547 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000018_0' has finished successfully.
        2011-08-24 13:22:49,547 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000018' has completed.
        2011-08-24 13:22:49,547 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000015_0' has finished successfully.
        2011-08-24 13:22:49,547 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000015' has completed.
        2011-08-24 13:22:49,547 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000024_0' has finished successfully.
        2011-08-24 13:22:49,547 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000024' has completed.
        2011-08-24 13:22:49,548 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000030_0' has finished successfully.
        2011-08-24 13:22:49,548 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000030' has completed.
        2011-08-24 13:22:49,548 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000013_0' has finished successfully.
        2011-08-24 13:22:49,548 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000013' has completed.
        2011-08-24 13:22:49,548 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000027_0' has finished successfully.
        2011-08-24 13:22:49,548 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000027' has completed.
        2011-08-24 13:22:49,548 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000000_0' has finished successfully.
        2011-08-24 13:22:49,548 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000000' has completed.
        2011-08-24 13:22:49,548 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000022_0' has finished successfully.
        2011-08-24 13:22:49,548 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000022' has completed.
        2011-08-24 13:22:49,548 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000017_0' has finished successfully.
        2011-08-24 13:22:49,548 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000017' has completed.
        2011-08-24 13:22:49,549 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000006_0' has finished successfully.
        2011-08-24 13:22:49,549 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000006' has completed.
        2011-08-24 13:22:49,549 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000016_0' has finished successfully.
        2011-08-24 13:22:49,549 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000016' has completed.
        2011-08-24 13:22:49,549 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000039_0' has finished successfully.
        2011-08-24 13:22:49,549 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000039' has completed.
        2011-08-24 13:22:49,549 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000023_0' has finished successfully.
        2011-08-24 13:22:49,549 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000023' has completed.
        2011-08-24 13:22:49,549 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000038_0' has finished successfully.
        2011-08-24 13:22:49,549 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000038' has completed.
        2011-08-24 13:22:49,549 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000046_0' has finished successfully.
        2011-08-24 13:22:49,549 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000046' has completed.
        2011-08-24 13:22:49,549 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000021_0' has finished successfully.
        2011-08-24 13:22:49,549 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000021' has completed.
        2011-08-24 13:22:49,550 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000028_0' has finished successfully.
        2011-08-24 13:22:49,550 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000028' has completed.
        2011-08-24 13:22:49,550 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000044_0' has finished successfully.
        2011-08-24 13:22:49,550 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000044' has completed.
        2011-08-24 13:22:49,550 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000014_0' has finished successfully.
        2011-08-24 13:22:49,550 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000014' has completed.
        2011-08-24 13:22:49,550 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000032_0' has finished successfully.
        2011-08-24 13:22:49,550 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000032' has completed.
        2011-08-24 13:22:49,550 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000011_0' has finished successfully.
        2011-08-24 13:22:49,550 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000011' has completed.
        2011-08-24 13:22:49,550 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000042_0' has finished successfully.
        2011-08-24 13:22:49,550 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000042' has completed.
        2011-08-24 13:22:49,550 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000033_0' has finished successfully.
        2011-08-24 13:22:49,551 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000033' has completed.
        2011-08-24 13:22:49,551 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000005_0' has finished successfully.
        2011-08-24 13:22:49,551 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000005' has completed.
        2011-08-24 13:22:49,551 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000043_0' has finished successfully.
        2011-08-24 13:22:49,551 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000043' has completed.
        2011-08-24 13:22:49,551 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000002_0' has finished successfully.
        2011-08-24 13:22:49,551 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000002' has completed.
        2011-08-24 13:22:49,551 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000025_0' has finished successfully.
        2011-08-24 13:22:49,551 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000025' has completed.
        2011-08-24 13:22:49,551 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000019_0' has finished successfully.
        2011-08-24 13:22:49,551 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000019' has completed.
        2011-08-24 13:22:49,552 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000040_0' has finished successfully.
        2011-08-24 13:22:49,552 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000040' has completed.
        2011-08-24 13:22:49,552 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000034_0' has finished successfully.
        2011-08-24 13:22:49,552 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000034' has completed.
        2011-08-24 13:22:49,552 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000031_0' has finished successfully.
        2011-08-24 13:22:49,552 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000031' has completed.
        2011-08-24 13:22:49,552 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000008_0' has finished successfully.
        2011-08-24 13:22:49,552 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000008' has completed.
        2011-08-24 13:22:49,552 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000020_0' has finished successfully.
        2011-08-24 13:22:49,553 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000020' has completed.
        2011-08-24 13:22:49,553 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000003_0' has finished successfully.
        2011-08-24 13:22:49,553 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000003' has completed.
        2011-08-24 13:22:49,553 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000036_0' has finished successfully.
        2011-08-24 13:22:49,553 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000036' has completed.
        2011-08-24 13:22:49,553 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000035_0' has finished successfully.
        2011-08-24 13:22:49,553 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000035' has completed.
        2011-08-24 13:22:49,553 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000026_0' has finished successfully.
        2011-08-24 13:22:49,553 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000026' has completed.
        2011-08-24 13:22:49,554 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000007_0' has finished successfully.
        2011-08-24 13:22:49,554 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000007' has completed.
        2011-08-24 13:22:49,554 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000012_0' has finished successfully.
        2011-08-24 13:22:49,554 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000012' has completed.
        2011-08-24 13:22:49,557 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000047_0' has finished successfully.
        2011-08-24 13:22:49,557 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000047' has completed.
        2011-08-24 13:22:49,557 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000004_0' has finished successfully.
        2011-08-24 13:22:49,557 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000004' has completed.
        2011-08-24 13:22:49,558 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000041_0' has finished successfully.
        2011-08-24 13:22:49,558 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000041' has completed.
        2011-08-24 13:22:49,769 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000001_0' has finished successfully.
        2011-08-24 13:22:49,769 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000001' has completed.
        2011-08-24 13:22:49,771 DEBUG org.apache.hama.bsp.JobInProgress: Job successfully done.
        
        Show
        Edward J. Yoon added a comment - Tested on physical 16 nodes. 2011-08-24 13:22:32,107 INFO org.apache.hama.bsp.BSPMaster: Starting RUNNING 2011-08-24 13:22:44,252 DEBUG org.apache.hama.bsp.JobInProgress: numBSPTasks: 48 2011-08-24 13:22:44,254 DEBUG org.apache.hama.bsp.JobInProgress: Job is initialized. 2011-08-24 13:22:49,546 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000029_0' has finished successfully. 2011-08-24 13:22:49,546 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000029' has completed. 2011-08-24 13:22:49,546 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000045_0' has finished successfully. 2011-08-24 13:22:49,546 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000045' has completed. 2011-08-24 13:22:49,547 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000037_0' has finished successfully. 2011-08-24 13:22:49,547 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000037' has completed. 2011-08-24 13:22:49,547 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000010_0' has finished successfully. 2011-08-24 13:22:49,547 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000010' has completed. 2011-08-24 13:22:49,547 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000009_0' has finished successfully. 2011-08-24 13:22:49,547 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000009' has completed. 2011-08-24 13:22:49,547 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000018_0' has finished successfully. 2011-08-24 13:22:49,547 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000018' has completed. 2011-08-24 13:22:49,547 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000015_0' has finished successfully. 2011-08-24 13:22:49,547 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000015' has completed. 2011-08-24 13:22:49,547 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000024_0' has finished successfully. 2011-08-24 13:22:49,547 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000024' has completed. 2011-08-24 13:22:49,548 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000030_0' has finished successfully. 2011-08-24 13:22:49,548 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000030' has completed. 2011-08-24 13:22:49,548 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000013_0' has finished successfully. 2011-08-24 13:22:49,548 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000013' has completed. 2011-08-24 13:22:49,548 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000027_0' has finished successfully. 2011-08-24 13:22:49,548 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000027' has completed. 2011-08-24 13:22:49,548 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000000_0' has finished successfully. 2011-08-24 13:22:49,548 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000000' has completed. 2011-08-24 13:22:49,548 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000022_0' has finished successfully. 2011-08-24 13:22:49,548 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000022' has completed. 2011-08-24 13:22:49,548 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000017_0' has finished successfully. 2011-08-24 13:22:49,548 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000017' has completed. 2011-08-24 13:22:49,549 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000006_0' has finished successfully. 2011-08-24 13:22:49,549 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000006' has completed. 2011-08-24 13:22:49,549 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000016_0' has finished successfully. 2011-08-24 13:22:49,549 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000016' has completed. 2011-08-24 13:22:49,549 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000039_0' has finished successfully. 2011-08-24 13:22:49,549 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000039' has completed. 2011-08-24 13:22:49,549 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000023_0' has finished successfully. 2011-08-24 13:22:49,549 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000023' has completed. 2011-08-24 13:22:49,549 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000038_0' has finished successfully. 2011-08-24 13:22:49,549 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000038' has completed. 2011-08-24 13:22:49,549 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000046_0' has finished successfully. 2011-08-24 13:22:49,549 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000046' has completed. 2011-08-24 13:22:49,549 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000021_0' has finished successfully. 2011-08-24 13:22:49,549 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000021' has completed. 2011-08-24 13:22:49,550 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000028_0' has finished successfully. 2011-08-24 13:22:49,550 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000028' has completed. 2011-08-24 13:22:49,550 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000044_0' has finished successfully. 2011-08-24 13:22:49,550 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000044' has completed. 2011-08-24 13:22:49,550 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000014_0' has finished successfully. 2011-08-24 13:22:49,550 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000014' has completed. 2011-08-24 13:22:49,550 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000032_0' has finished successfully. 2011-08-24 13:22:49,550 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000032' has completed. 2011-08-24 13:22:49,550 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000011_0' has finished successfully. 2011-08-24 13:22:49,550 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000011' has completed. 2011-08-24 13:22:49,550 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000042_0' has finished successfully. 2011-08-24 13:22:49,550 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000042' has completed. 2011-08-24 13:22:49,550 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000033_0' has finished successfully. 2011-08-24 13:22:49,551 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000033' has completed. 2011-08-24 13:22:49,551 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000005_0' has finished successfully. 2011-08-24 13:22:49,551 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000005' has completed. 2011-08-24 13:22:49,551 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000043_0' has finished successfully. 2011-08-24 13:22:49,551 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000043' has completed. 2011-08-24 13:22:49,551 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000002_0' has finished successfully. 2011-08-24 13:22:49,551 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000002' has completed. 2011-08-24 13:22:49,551 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000025_0' has finished successfully. 2011-08-24 13:22:49,551 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000025' has completed. 2011-08-24 13:22:49,551 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000019_0' has finished successfully. 2011-08-24 13:22:49,551 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000019' has completed. 2011-08-24 13:22:49,552 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000040_0' has finished successfully. 2011-08-24 13:22:49,552 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000040' has completed. 2011-08-24 13:22:49,552 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000034_0' has finished successfully. 2011-08-24 13:22:49,552 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000034' has completed. 2011-08-24 13:22:49,552 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000031_0' has finished successfully. 2011-08-24 13:22:49,552 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000031' has completed. 2011-08-24 13:22:49,552 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000008_0' has finished successfully. 2011-08-24 13:22:49,552 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000008' has completed. 2011-08-24 13:22:49,552 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000020_0' has finished successfully. 2011-08-24 13:22:49,553 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000020' has completed. 2011-08-24 13:22:49,553 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000003_0' has finished successfully. 2011-08-24 13:22:49,553 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000003' has completed. 2011-08-24 13:22:49,553 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000036_0' has finished successfully. 2011-08-24 13:22:49,553 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000036' has completed. 2011-08-24 13:22:49,553 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000035_0' has finished successfully. 2011-08-24 13:22:49,553 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000035' has completed. 2011-08-24 13:22:49,553 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000026_0' has finished successfully. 2011-08-24 13:22:49,553 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000026' has completed. 2011-08-24 13:22:49,554 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000007_0' has finished successfully. 2011-08-24 13:22:49,554 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000007' has completed. 2011-08-24 13:22:49,554 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000012_0' has finished successfully. 2011-08-24 13:22:49,554 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000012' has completed. 2011-08-24 13:22:49,557 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000047_0' has finished successfully. 2011-08-24 13:22:49,557 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000047' has completed. 2011-08-24 13:22:49,557 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000004_0' has finished successfully. 2011-08-24 13:22:49,557 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000004' has completed. 2011-08-24 13:22:49,558 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000041_0' has finished successfully. 2011-08-24 13:22:49,558 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000041' has completed. 2011-08-24 13:22:49,769 INFO org.apache.hama.bsp.JobInProgress: Taskid 'attempt_201108241322_0001_000001_0' has finished successfully. 2011-08-24 13:22:49,769 INFO org.apache.hama.bsp.TaskInProgress: Task 'task_201108241322_0001_000001' has completed. 2011-08-24 13:22:49,771 DEBUG org.apache.hama.bsp.JobInProgress: Job successfully done.
        Hide
        ChiaHung Lin added a comment -

        For the issue of `File does not exist', can we let client (PiEstimator, SerializePrinting) use zk checking if the data has been written to hdfs or not at the moment (as temporarily solution)? Or before set taskStatus's phase to cleanup, we can check if the execution has finished or not.

        The problem I can think of with child process reporting to master directly is the system may end up with 3 times rpc requests (e.g. 6,000 groom servers with 12,000 rpc execution). The master may simply be busy dealing with such trivial tasks. This would have impact on the performance because if I remember correctly, one of reason the birth of mapreduce 2.0 is too many rpc, including tasktracker and clients.

        Show
        ChiaHung Lin added a comment - For the issue of `File does not exist', can we let client (PiEstimator, SerializePrinting) use zk checking if the data has been written to hdfs or not at the moment (as temporarily solution)? Or before set taskStatus's phase to cleanup, we can check if the execution has finished or not. The problem I can think of with child process reporting to master directly is the system may end up with 3 times rpc requests (e.g. 6,000 groom servers with 12,000 rpc execution). The master may simply be busy dealing with such trivial tasks. This would have impact on the performance because if I remember correctly, one of reason the birth of mapreduce 2.0 is too many rpc, including tasktracker and clients.
        Hide
        Edward J. Yoon added a comment -

        The problem I can think of with child process reporting to master directly is the system may end up with 3 times rpc requests (e.g. 6,000 groom servers with 12,000 rpc execution). The master may simply be busy dealing with such trivial tasks. This would have impact on the performance because if I remember correctly, one of reason the birth of mapreduce 2.0 is too many rpc, including tasktracker and clients.

        Agree with you.

        BTW, I can't remember why you replaced to the doReport() from the heartbeat. If we have to report statuses to BSPMaster periodically, what's the difference?

        Show
        Edward J. Yoon added a comment - The problem I can think of with child process reporting to master directly is the system may end up with 3 times rpc requests (e.g. 6,000 groom servers with 12,000 rpc execution). The master may simply be busy dealing with such trivial tasks. This would have impact on the performance because if I remember correctly, one of reason the birth of mapreduce 2.0 is too many rpc, including tasktracker and clients. Agree with you. BTW, I can't remember why you replaced to the doReport() from the heartbeat. If we have to report statuses to BSPMaster periodically, what's the difference?
        Hide
        Edward J. Yoon added a comment -

        P.S., "the issue of `File does not exist'" was not a problem of file read.

        Show
        Edward J. Yoon added a comment - P.S., "the issue of `File does not exist'" was not a problem of file read.
        Hide
        ChiaHung Lin added a comment -

        Originally we have groom servers heartbeating to bspmaster for requesting task assign and updating task status. After HAMA-346, the master plays a role to proactively schedule tasks, and doReport() is used by groom server for reporting tasks status back to master; but seemingly it's moved to offerService() because of the ticket HAMA-298.

        Can you give a bit more explain on the root cause of `File does not exist' issue? Last time I observed this issue and then solved it by setting zk at client side (e.g. PiEstimator, SerializePrinting) imposing client to check if data has been written to hdfs by BSPPeer; and the result shows it worked as expected. If this is not the case, then we will need to find out the main problem behind it.

        Show
        ChiaHung Lin added a comment - Originally we have groom servers heartbeating to bspmaster for requesting task assign and updating task status. After HAMA-346 , the master plays a role to proactively schedule tasks, and doReport() is used by groom server for reporting tasks status back to master; but seemingly it's moved to offerService() because of the ticket HAMA-298 . Can you give a bit more explain on the root cause of `File does not exist' issue? Last time I observed this issue and then solved it by setting zk at client side (e.g. PiEstimator, SerializePrinting) imposing client to check if data has been written to hdfs by BSPPeer; and the result shows it worked as expected. If this is not the case, then we will need to find out the main problem behind it.
        Hide
        Edward J. Yoon added a comment -

        Simply, there were no files. Each peer reads/writes their own file with seq number.

              SequenceFile.Writer writer = SequenceFile.createWriter(fileSys, conf,
                  new Path(TMP_OUTPUT + i), LongWritable.class, Text.class,
        
        Show
        Edward J. Yoon added a comment - Simply, there were no files. Each peer reads/writes their own file with seq number. SequenceFile.Writer writer = SequenceFile.createWriter(fileSys, conf, new Path(TMP_OUTPUT + i), LongWritable.class, Text.class,
        Hide
        Edward J. Yoon added a comment -
                // Reports to a BSPMaster
                for (Map.Entry<TaskAttemptID, TaskInProgress> e : runningTasks
                    .entrySet()) {
                  Thread.sleep(REPORT_INTERVAL);
                  TaskInProgress tip = e.getValue();
                  TaskStatus taskStatus = tip.getStatus();
        
                  if (taskStatus.getRunState() == TaskStatus.State.RUNNING) {
                    taskStatus.setProgress(taskStatus.getSuperstepCount());
        
                    if (!tip.runner.isAlive()) {
                      if (taskStatus.getRunState() != TaskStatus.State.FAILED) {
                        taskStatus.setRunState(TaskStatus.State.SUCCEEDED);
                      }
                      taskStatus.setPhase(TaskStatus.Phase.CLEANUP);
                    }
                  }
        
                  doReport(taskStatus);
                }
        
                Thread.sleep(REPORT_INTERVAL);
        

        I noticed that original reporting times are the same.

        Show
        Edward J. Yoon added a comment - // Reports to a BSPMaster for (Map.Entry<TaskAttemptID, TaskInProgress> e : runningTasks .entrySet()) { Thread .sleep(REPORT_INTERVAL); TaskInProgress tip = e.getValue(); TaskStatus taskStatus = tip.getStatus(); if (taskStatus.getRunState() == TaskStatus.State.RUNNING) { taskStatus.setProgress(taskStatus.getSuperstepCount()); if (!tip.runner.isAlive()) { if (taskStatus.getRunState() != TaskStatus.State.FAILED) { taskStatus.setRunState(TaskStatus.State.SUCCEEDED); } taskStatus.setPhase(TaskStatus.Phase.CLEANUP); } } doReport(taskStatus); } Thread .sleep(REPORT_INTERVAL); I noticed that original reporting times are the same.
        Hide
        Edward J. Yoon added a comment -

        TODO:

        reduce reporting times.

        Show
        Edward J. Yoon added a comment - TODO: reduce reporting times.
        Hide
        Edward J. Yoon added a comment -

        [NOTE] just found another issue:

        
        2011-08-25 15:50:39,042 INFO org.apache.hama.bsp.TaskRunner: attempt_201108251548_0001_000005_0 11/08/25 15:50:39 INFO bsp.BSPPeer: >>> cnode2.ucloud:61000
        2011-08-25 15:50:39,042 INFO org.apache.hama.bsp.TaskRunner: attempt_201108251548_0001_000005_0 11/08/25 15:50:39 INFO bsp.BSPPeer: 6, 6
        2011-08-25 15:50:49,183 INFO org.apache.hadoop.ipc.Server: Error register incrementSuperstepCount
        java.lang.IllegalArgumentException: Duplicate metricsName:incrementSuperstepCount
        	at org.apache.hadoop.metrics.util.MetricsRegistry.add(MetricsRegistry.java:53)
        	at org.apache.hadoop.metrics.util.MetricsTimeVaryingRate.<init>(MetricsTimeVaryingRate.java:89)
        	at org.apache.hadoop.metrics.util.MetricsTimeVaryingRate.<init>(MetricsTimeVaryingRate.java:99)
        	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:523)
        	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
        	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
        	at java.security.AccessController.doPrivileged(Native Method)
        	at javax.security.auth.Subject.doAs(Subject.java:396)
        	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)
        
        Show
        Edward J. Yoon added a comment - [NOTE] just found another issue: 2011-08-25 15:50:39,042 INFO org.apache.hama.bsp.TaskRunner: attempt_201108251548_0001_000005_0 11/08/25 15:50:39 INFO bsp.BSPPeer: >>> cnode2.ucloud:61000 2011-08-25 15:50:39,042 INFO org.apache.hama.bsp.TaskRunner: attempt_201108251548_0001_000005_0 11/08/25 15:50:39 INFO bsp.BSPPeer: 6, 6 2011-08-25 15:50:49,183 INFO org.apache.hadoop.ipc.Server: Error register incrementSuperstepCount java.lang.IllegalArgumentException: Duplicate metricsName:incrementSuperstepCount at org.apache.hadoop.metrics.util.MetricsRegistry.add(MetricsRegistry.java:53) at org.apache.hadoop.metrics.util.MetricsTimeVaryingRate.<init>(MetricsTimeVaryingRate.java:89) at org.apache.hadoop.metrics.util.MetricsTimeVaryingRate.<init>(MetricsTimeVaryingRate.java:99) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:523) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)
        Hide
        ChiaHung Lin added a comment - - edited

        Below is what I observe.

        GroomServer periodically checks if TaskRunner is not running (!tip.runner.isAlive()), then it sets the phase to cleanup and reports back to BSPMaster. However, within TaskRunner's run(), its execution may immediately finish if it simply launches another thread along with spawning another child process (i.e. BSPPeer); for example, in the patch HAMA-398 TaskRunner.run()

        public void run() {
        ...
        BspChildRunner bspPeer = new BspChildRunner(bspArgs, workDir); // spawn bsp peer child process
        bspPeer.start();
        ... // after start(), it immediate returns so within offerService() taskStatus will be set to cleanup because runner.isAlive() is false
            // but the writing data to hdfs perhaps is not yet finished.
        }
        

        In the HAMA-398 v1 patch, assert with join, which in turns makes use of Future.get() would ideally have the same effect as original procedure with waitFor().

        public void run() {
        ...
        BspChildRunner bspPeer = new BspChildRunner(bspArgs, workDir); // spawn bsp peer child process
        bspPeer.start();
        bspPeer.join(); // wait for bsppeer finishes its execution, including writing data to hdfs.  
        ...
        }
        
        Show
        ChiaHung Lin added a comment - - edited Below is what I observe. GroomServer periodically checks if TaskRunner is not running (!tip.runner.isAlive()), then it sets the phase to cleanup and reports back to BSPMaster. However, within TaskRunner's run(), its execution may immediately finish if it simply launches another thread along with spawning another child process (i.e. BSPPeer); for example, in the patch HAMA-398 TaskRunner.run() public void run() { ... BspChildRunner bspPeer = new BspChildRunner(bspArgs, workDir); // spawn bsp peer child process bspPeer.start(); ... // after start(), it immediate returns so within offerService() taskStatus will be set to cleanup because runner.isAlive() is false // but the writing data to hdfs perhaps is not yet finished. } In the HAMA-398 v1 patch, assert with join, which in turns makes use of Future.get() would ideally have the same effect as original procedure with waitFor(). public void run() { ... BspChildRunner bspPeer = new BspChildRunner(bspArgs, workDir); // spawn bsp peer child process bspPeer.start(); bspPeer.join(); // wait for bsppeer finishes its execution, including writing data to hdfs. ... }
        Hide
        Edward J. Yoon added a comment -

        Note,

        When the cluster is initially created, all ZK nodes should be cleared.

        Show
        Edward J. Yoon added a comment - Note, When the cluster is initially created, all ZK nodes should be cleared.
        Hide
        Edward J. Yoon added a comment -

        <chl5011> Do we still decide to let BSPPeer directly report back to BSPMaster? I notice the latest patch (v5) seems still use bsppeer update task status back to bspmaster (umbilical.updateTaskStatusAndReport(taskid)

        Nope, i'm still testing other problems.

        Show
        Edward J. Yoon added a comment - <chl5011> Do we still decide to let BSPPeer directly report back to BSPMaster? I notice the latest patch (v5) seems still use bsppeer update task status back to bspmaster (umbilical.updateTaskStatusAndReport(taskid) Nope, i'm still testing other problems.
        Hide
        Edward J. Yoon added a comment -

        I found why RandBench doesn't work. Job hangs When some tasks are finished.

        Show
        Edward J. Yoon added a comment - I found why RandBench doesn't work. Job hangs When some tasks are finished.
        Hide
        Edward J. Yoon added a comment -

        This patch fixes EndOfStreamException and HAMA-407 issues.

        Show
        Edward J. Yoon added a comment - This patch fixes EndOfStreamException and HAMA-407 issues.
        Hide
        Edward J. Yoon added a comment -

        attach my new patch.

        Show
        Edward J. Yoon added a comment - attach my new patch.
        Hide
        Edward J. Yoon added a comment -

        Running patch through hudson.

        My tests are all done.

        Show
        Edward J. Yoon added a comment - Running patch through hudson. My tests are all done.
        Hide
        Hudson added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12491998/HAMA-413_v07.patch
        against trunk revision 1162501.

        @author +1. The patch does not contain any @author tags.

        tests included +1. The patch appears to include 6 new or modified tests.

        core tests +1. The patch passed core unit tests.

        Changes : http://builds.apache.org/hudson/job/Hama-Patch/354/changes/
        Console output: http://builds.apache.org/hudson/job/Hama-Patch/354/console

        This message is automatically generated.

        Show
        Hudson added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12491998/HAMA-413_v07.patch against trunk revision 1162501. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 6 new or modified tests. core tests +1. The patch passed core unit tests. Changes : http://builds.apache.org/hudson/job/Hama-Patch/354/changes/ Console output: http://builds.apache.org/hudson/job/Hama-Patch/354/console This message is automatically generated.
        Hide
        Edward J. Yoon added a comment -

        I've committed this.

        If there's some problems or something that can be fixed, let's open new ticket.

        Show
        Edward J. Yoon added a comment - I've committed this. If there's some problems or something that can be fixed, let's open new ticket.
        Hide
        Hudson added a comment -

        Integrated in Hama-Nightly #296 (See https://builds.apache.org/job/Hama-Nightly/296/)
        examples were missed in commit of HAMA-413
        Commit HAMA-413 patch.

        edwardyoon :
        Files :

        • /incubator/hama/trunk/examples/src/main/java/org/apache/hama/examples/PiEstimator.java
        • /incubator/hama/trunk/examples/src/main/java/org/apache/hama/examples/SerializePrinting.java

        edwardyoon :
        Files :

        • /incubator/hama/trunk/core/conf/hama-default.xml
        • /incubator/hama/trunk/core/src/main/java/org/apache/hama/Constants.java
        • /incubator/hama/trunk/core/src/main/java/org/apache/hama/bsp/BSPMaster.java
        • /incubator/hama/trunk/core/src/main/java/org/apache/hama/bsp/BSPPeer.java
        • /incubator/hama/trunk/core/src/main/java/org/apache/hama/bsp/ClusterStatus.java
        • /incubator/hama/trunk/core/src/main/java/org/apache/hama/bsp/DispatchTasksDirective.java
        • /incubator/hama/trunk/core/src/main/java/org/apache/hama/bsp/GroomServer.java
        • /incubator/hama/trunk/core/src/main/java/org/apache/hama/bsp/GroomServerManager.java
        • /incubator/hama/trunk/core/src/main/java/org/apache/hama/bsp/GroomServerStatus.java
        • /incubator/hama/trunk/core/src/main/java/org/apache/hama/bsp/JobInProgress.java
        • /incubator/hama/trunk/core/src/main/java/org/apache/hama/bsp/PeerNames.java
        • /incubator/hama/trunk/core/src/main/java/org/apache/hama/bsp/SimpleTaskScheduler.java
        • /incubator/hama/trunk/core/src/main/java/org/apache/hama/bsp/TaskRunner.java
        • /incubator/hama/trunk/core/src/main/java/org/apache/hama/ipc/BSPPeerProtocol.java
        • /incubator/hama/trunk/core/src/main/java/org/apache/hama/ipc/MasterProtocol.java
        • /incubator/hama/trunk/core/src/test/java/org/apache/hama/HamaClusterTestCase.java
        • /incubator/hama/trunk/core/src/test/java/org/apache/hama/bsp/TestBSPMasterGroomServer.java
        Show
        Hudson added a comment - Integrated in Hama-Nightly #296 (See https://builds.apache.org/job/Hama-Nightly/296/ ) examples were missed in commit of HAMA-413 Commit HAMA-413 patch. edwardyoon : Files : /incubator/hama/trunk/examples/src/main/java/org/apache/hama/examples/PiEstimator.java /incubator/hama/trunk/examples/src/main/java/org/apache/hama/examples/SerializePrinting.java edwardyoon : Files : /incubator/hama/trunk/core/conf/hama-default.xml /incubator/hama/trunk/core/src/main/java/org/apache/hama/Constants.java /incubator/hama/trunk/core/src/main/java/org/apache/hama/bsp/BSPMaster.java /incubator/hama/trunk/core/src/main/java/org/apache/hama/bsp/BSPPeer.java /incubator/hama/trunk/core/src/main/java/org/apache/hama/bsp/ClusterStatus.java /incubator/hama/trunk/core/src/main/java/org/apache/hama/bsp/DispatchTasksDirective.java /incubator/hama/trunk/core/src/main/java/org/apache/hama/bsp/GroomServer.java /incubator/hama/trunk/core/src/main/java/org/apache/hama/bsp/GroomServerManager.java /incubator/hama/trunk/core/src/main/java/org/apache/hama/bsp/GroomServerStatus.java /incubator/hama/trunk/core/src/main/java/org/apache/hama/bsp/JobInProgress.java /incubator/hama/trunk/core/src/main/java/org/apache/hama/bsp/PeerNames.java /incubator/hama/trunk/core/src/main/java/org/apache/hama/bsp/SimpleTaskScheduler.java /incubator/hama/trunk/core/src/main/java/org/apache/hama/bsp/TaskRunner.java /incubator/hama/trunk/core/src/main/java/org/apache/hama/ipc/BSPPeerProtocol.java /incubator/hama/trunk/core/src/main/java/org/apache/hama/ipc/MasterProtocol.java /incubator/hama/trunk/core/src/test/java/org/apache/hama/HamaClusterTestCase.java /incubator/hama/trunk/core/src/test/java/org/apache/hama/bsp/TestBSPMasterGroomServer.java
        Hide
        Hudson added a comment -

        Integrated in Hama-Patch #355 (See https://builds.apache.org/job/Hama-Patch/355/)
        examples were missed in commit of HAMA-413
        Commit HAMA-413 patch.

        edwardyoon : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1162690
        Files :

        • /incubator/hama/trunk/examples/src/main/java/org/apache/hama/examples/PiEstimator.java
        • /incubator/hama/trunk/examples/src/main/java/org/apache/hama/examples/SerializePrinting.java

        edwardyoon : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1162689
        Files :

        • /incubator/hama/trunk/core/conf/hama-default.xml
        • /incubator/hama/trunk/core/src/main/java/org/apache/hama/Constants.java
        • /incubator/hama/trunk/core/src/main/java/org/apache/hama/bsp/BSPMaster.java
        • /incubator/hama/trunk/core/src/main/java/org/apache/hama/bsp/BSPPeer.java
        • /incubator/hama/trunk/core/src/main/java/org/apache/hama/bsp/ClusterStatus.java
        • /incubator/hama/trunk/core/src/main/java/org/apache/hama/bsp/DispatchTasksDirective.java
        • /incubator/hama/trunk/core/src/main/java/org/apache/hama/bsp/GroomServer.java
        • /incubator/hama/trunk/core/src/main/java/org/apache/hama/bsp/GroomServerManager.java
        • /incubator/hama/trunk/core/src/main/java/org/apache/hama/bsp/GroomServerStatus.java
        • /incubator/hama/trunk/core/src/main/java/org/apache/hama/bsp/JobInProgress.java
        • /incubator/hama/trunk/core/src/main/java/org/apache/hama/bsp/PeerNames.java
        • /incubator/hama/trunk/core/src/main/java/org/apache/hama/bsp/SimpleTaskScheduler.java
        • /incubator/hama/trunk/core/src/main/java/org/apache/hama/bsp/TaskRunner.java
        • /incubator/hama/trunk/core/src/main/java/org/apache/hama/ipc/BSPPeerProtocol.java
        • /incubator/hama/trunk/core/src/main/java/org/apache/hama/ipc/MasterProtocol.java
        • /incubator/hama/trunk/core/src/test/java/org/apache/hama/HamaClusterTestCase.java
        • /incubator/hama/trunk/core/src/test/java/org/apache/hama/bsp/TestBSPMasterGroomServer.java
        Show
        Hudson added a comment - Integrated in Hama-Patch #355 (See https://builds.apache.org/job/Hama-Patch/355/ ) examples were missed in commit of HAMA-413 Commit HAMA-413 patch. edwardyoon : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1162690 Files : /incubator/hama/trunk/examples/src/main/java/org/apache/hama/examples/PiEstimator.java /incubator/hama/trunk/examples/src/main/java/org/apache/hama/examples/SerializePrinting.java edwardyoon : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1162689 Files : /incubator/hama/trunk/core/conf/hama-default.xml /incubator/hama/trunk/core/src/main/java/org/apache/hama/Constants.java /incubator/hama/trunk/core/src/main/java/org/apache/hama/bsp/BSPMaster.java /incubator/hama/trunk/core/src/main/java/org/apache/hama/bsp/BSPPeer.java /incubator/hama/trunk/core/src/main/java/org/apache/hama/bsp/ClusterStatus.java /incubator/hama/trunk/core/src/main/java/org/apache/hama/bsp/DispatchTasksDirective.java /incubator/hama/trunk/core/src/main/java/org/apache/hama/bsp/GroomServer.java /incubator/hama/trunk/core/src/main/java/org/apache/hama/bsp/GroomServerManager.java /incubator/hama/trunk/core/src/main/java/org/apache/hama/bsp/GroomServerStatus.java /incubator/hama/trunk/core/src/main/java/org/apache/hama/bsp/JobInProgress.java /incubator/hama/trunk/core/src/main/java/org/apache/hama/bsp/PeerNames.java /incubator/hama/trunk/core/src/main/java/org/apache/hama/bsp/SimpleTaskScheduler.java /incubator/hama/trunk/core/src/main/java/org/apache/hama/bsp/TaskRunner.java /incubator/hama/trunk/core/src/main/java/org/apache/hama/ipc/BSPPeerProtocol.java /incubator/hama/trunk/core/src/main/java/org/apache/hama/ipc/MasterProtocol.java /incubator/hama/trunk/core/src/test/java/org/apache/hama/HamaClusterTestCase.java /incubator/hama/trunk/core/src/test/java/org/apache/hama/bsp/TestBSPMasterGroomServer.java

          People

          • Assignee:
            Edward J. Yoon
            Reporter:
            Edward J. Yoon
          • Votes:
            2 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development