Hama
  1. Hama
  2. HAMA-476

Splitter doesn't work correctly

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.3.0
    • Fix Version/s: 0.4.0
    • Component/s: bsp core
    • Labels:
      None

      Description

      • To split sequencefile as user requested size, there's no way to avoid read/write records. I think we have to use just blockSize.
      • Unlike MapReduce, we are unable to queuing tasks when exceeds cluster capacity (I have no idea at the moment).
      1. patch.txt
        5 kB
        Edward J. Yoon
      2. patch_01.txt
        4 kB
        Edward J. Yoon

        Activity

        Hide
        Hudson added a comment -

        Integrated in Hama-Nightly #416 (See https://builds.apache.org/job/Hama-Nightly/416/)
        HAMA-476 Splitter doesn't work correctly

        edwardyoon :
        Files :

        • /incubator/hama/trunk/core/src/main/java/org/apache/hama/bsp/BSPJobClient.java
        • /incubator/hama/trunk/examples/src/main/java/org/apache/hama/examples/ShortestPaths.java
        Show
        Hudson added a comment - Integrated in Hama-Nightly #416 (See https://builds.apache.org/job/Hama-Nightly/416/ ) HAMA-476 Splitter doesn't work correctly edwardyoon : Files : /incubator/hama/trunk/core/src/main/java/org/apache/hama/bsp/BSPJobClient.java /incubator/hama/trunk/examples/src/main/java/org/apache/hama/examples/ShortestPaths.java
        Hide
        Edward J. Yoon added a comment -

        seems work good. I just closed this.

        Show
        Edward J. Yoon added a comment - seems work good. I just closed this.
        Hide
        Edward J. Yoon added a comment -

        Test passed on my cluster. I'm commit this at the moment.

        Let's consider more efficient re-partitioning in the next step.

        Show
        Edward J. Yoon added a comment - Test passed on my cluster. I'm commit this at the moment. Let's consider more efficient re-partitioning in the next step.
        Hide
        Edward J. Yoon added a comment -

        if user not set task size, max size will be used.

        Show
        Edward J. Yoon added a comment - if user not set task size, max size will be used.
        Hide
        Edward J. Yoon added a comment -

        here's more optimized code.

        Show
        Edward J. Yoon added a comment - here's more optimized code.
        Hide
        Thomas Jungblut added a comment -

        Cool feature, but I guess we have to repartion the dataset to maxtask and add log warnings

        Show
        Thomas Jungblut added a comment - Cool feature, but I guess we have to repartion the dataset to maxtask and add log warnings
        Hide
        Edward J. Yoon added a comment -

        This patch adds simple logic to extract proper size of tasks in the max task capacity.

        root@Cnode1:/usr/local/src/hama-trunk# core/bin/hama jar examples/target/hama-examples-0.4.0-incubating-SNAPSHOT.jar sssp 3 result /user/root/sssp/sssp-small.seq 4
        12/01/04 16:02:54 INFO bsp.FileInputFormat: Total input paths to process : 1
        12/01/04 16:02:54 INFO bsp.FileInputFormat: Total # of splits: 2
        12/01/04 16:03:03 INFO bsp.FileInputFormat: Total input paths to process : 4
        12/01/04 16:03:03 INFO bsp.FileInputFormat: Total # of splits: 4
        12/01/04 16:03:04 INFO bsp.BSPJobClient: Running job: job_201201041546_0005
        12/01/04 16:03:07 INFO bsp.BSPJobClient: Launched tasks: 3/4
        12/01/04 16:03:10 INFO bsp.BSPJobClient: Launched tasks: 4/4
        12/01/04 16:03:19 INFO bsp.BSPJobClient: Current supersteps number: 23
        12/01/04 16:03:22 INFO bsp.BSPJobClient: Current supersteps number: 44
        12/01/04 16:03:25 INFO bsp.BSPJobClient: Current supersteps number: 84
        12/01/04 16:03:28 INFO bsp.BSPJobClient: Current supersteps number: 104
        12/01/04 16:03:31 INFO bsp.BSPJobClient: Current supersteps number: 125
        12/01/04 16:03:34 INFO bsp.BSPJobClient: Current supersteps number: 147
        
        Show
        Edward J. Yoon added a comment - This patch adds simple logic to extract proper size of tasks in the max task capacity. root@Cnode1:/usr/local/src/hama-trunk# core/bin/hama jar examples/target/hama-examples-0.4.0-incubating-SNAPSHOT.jar sssp 3 result /user/root/sssp/sssp-small.seq 4 12/01/04 16:02:54 INFO bsp.FileInputFormat: Total input paths to process : 1 12/01/04 16:02:54 INFO bsp.FileInputFormat: Total # of splits: 2 12/01/04 16:03:03 INFO bsp.FileInputFormat: Total input paths to process : 4 12/01/04 16:03:03 INFO bsp.FileInputFormat: Total # of splits: 4 12/01/04 16:03:04 INFO bsp.BSPJobClient: Running job: job_201201041546_0005 12/01/04 16:03:07 INFO bsp.BSPJobClient: Launched tasks: 3/4 12/01/04 16:03:10 INFO bsp.BSPJobClient: Launched tasks: 4/4 12/01/04 16:03:19 INFO bsp.BSPJobClient: Current supersteps number: 23 12/01/04 16:03:22 INFO bsp.BSPJobClient: Current supersteps number: 44 12/01/04 16:03:25 INFO bsp.BSPJobClient: Current supersteps number: 84 12/01/04 16:03:28 INFO bsp.BSPJobClient: Current supersteps number: 104 12/01/04 16:03:31 INFO bsp.BSPJobClient: Current supersteps number: 125 12/01/04 16:03:34 INFO bsp.BSPJobClient: Current supersteps number: 147
        Hide
        Thomas Jungblut added a comment -

        How do you deal with data-locality? How should this work?

        And remember, setup is for the user.

        Show
        Thomas Jungblut added a comment - How do you deal with data-locality? How should this work? And remember, setup is for the user.
        Hide
        Edward J. Yoon added a comment -

        Why not data redistribution among peers at setup() step?

        Show
        Edward J. Yoon added a comment - Why not data redistribution among peers at setup() step?
        Hide
        Thomas Jungblut added a comment -

        Sure, but I don't see a solution to this without append release of HDFS.
        Or you can schedule a MapReduce job to partition them.

        If a number of jobs are submitted concurrently,

        This log is not needed in my opinion, let's move this to the BSPMaster server side, there it isn't buggy like this and is stored correctly in my opinion.

        Show
        Thomas Jungblut added a comment - Sure, but I don't see a solution to this without append release of HDFS. Or you can schedule a MapReduce job to partition them. If a number of jobs are submitted concurrently, This log is not needed in my opinion, let's move this to the BSPMaster server side, there it isn't buggy like this and is stored correctly in my opinion.
        Hide
        Edward J. Yoon added a comment -

        NOTE:

        Currently partition() method incurs too many open files.

        Show
        Edward J. Yoon added a comment - NOTE: Currently partition() method incurs too many open files.
        Hide
        Edward J. Yoon added a comment -

        NOTE:

        If a number of jobs are submitted concurrently,

        11/12/13 11:49:09 INFO bsp.FileInputFormat: Total input paths to process : 1
        11/12/13 11:49:09 INFO bsp.FileInputFormat: Total # of splits: 42
        11/12/13 11:52:02 INFO bsp.FileInputFormat: Total input paths to process : 42
        11/12/13 11:52:02 INFO bsp.FileInputFormat: Total # of splits: 42
        11/12/13 11:52:04 INFO bsp.BSPJobClient: Running job: job_201112131021_0003
        11/12/13 11:52:07 INFO bsp.BSPJobClient: Launched tasks: 61/42
        11/12/13 11:52:10 INFO bsp.BSPJobClient: Launched tasks: 84/42
        11/12/13 12:03:55 INFO bsp.BSPJobClient: Launched tasks: 67/42
        11/12/13 12:03:58 INFO bsp.BSPJobClient: Launched tasks: 42/42
        
        Show
        Edward J. Yoon added a comment - NOTE: If a number of jobs are submitted concurrently, 11/12/13 11:49:09 INFO bsp.FileInputFormat: Total input paths to process : 1 11/12/13 11:49:09 INFO bsp.FileInputFormat: Total # of splits: 42 11/12/13 11:52:02 INFO bsp.FileInputFormat: Total input paths to process : 42 11/12/13 11:52:02 INFO bsp.FileInputFormat: Total # of splits: 42 11/12/13 11:52:04 INFO bsp.BSPJobClient: Running job: job_201112131021_0003 11/12/13 11:52:07 INFO bsp.BSPJobClient: Launched tasks: 61/42 11/12/13 11:52:10 INFO bsp.BSPJobClient: Launched tasks: 84/42 11/12/13 12:03:55 INFO bsp.BSPJobClient: Launched tasks: 67/42 11/12/13 12:03:58 INFO bsp.BSPJobClient: Launched tasks: 42/42
        Hide
        ChiaHung Lin added a comment -

        Looks like I misunderstand the original question. What is concerned is users may request arbitrary forms of split blocks (not just contiguous). So basically we can provide a layer which allows users compose blocks they want (including contiguous), and on top of which wrapper classes e.g. read/ write records can serve for contiguous, etc. read/ write record request from users.

        Show
        ChiaHung Lin added a comment - Looks like I misunderstand the original question. What is concerned is users may request arbitrary forms of split blocks (not just contiguous). So basically we can provide a layer which allows users compose blocks they want (including contiguous), and on top of which wrapper classes e.g. read/ write records can serve for contiguous, etc. read/ write record request from users.
        Hide
        Thomas Jungblut added a comment -

        To split sequencefile as user requested size, there's no way to avoid read/write records. I think we have to use just blockSize.

        Correct, we have to split via the blocks.

        Unlike MapReduce, we are unable to queuing tasks when exceeds cluster capacity (I have no idea at the moment).

        There is no idea to have, we have to restrict more tasks than the cluster capacity. In YARN this issue is even worse, because you don't know the capacity.

        From what I discovered so far, the first one ideally can be achieved by applying tiling strategy. Then we can provide wrapper classes for user to access according to range requested.

        How is this tiling gonna work without rewriting sequence files?

        Show
        Thomas Jungblut added a comment - To split sequencefile as user requested size, there's no way to avoid read/write records. I think we have to use just blockSize. Correct, we have to split via the blocks. Unlike MapReduce, we are unable to queuing tasks when exceeds cluster capacity (I have no idea at the moment). There is no idea to have, we have to restrict more tasks than the cluster capacity. In YARN this issue is even worse, because you don't know the capacity. From what I discovered so far, the first one ideally can be achieved by applying tiling strategy. Then we can provide wrapper classes for user to access according to range requested. How is this tiling gonna work without rewriting sequence files?
        Hide
        ChiaHung Lin added a comment - - edited

        From what I discovered so far, the first one ideally can be achieved by applying tiling strategy. Then we can provide wrapper classes for user to access according to range requested.

        Show
        ChiaHung Lin added a comment - - edited From what I discovered so far, the first one ideally can be achieved by applying tiling strategy. Then we can provide wrapper classes for user to access according to range requested.

          People

          • Assignee:
            Edward J. Yoon
            Reporter:
            Edward J. Yoon
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development