Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.3.0
    • Fix Version/s: 0.4.0
    • Component/s: bsp core
    • Labels:
      None

      Description

      This issue will handle the input and output system with data splitter.

      1. HAMA-258_improved.patch
        60 kB
        Thomas Jungblut
      2. HAMA-partitioners_1.patch
        6 kB
        Thomas Jungblut
      3. HAMA-partitioners_2.patch
        52 kB
        Thomas Jungblut
      4. io_v01.patch
        44 kB
        Edward J. Yoon
      5. io_v02.patch
        91 kB
        Edward J. Yoon
      6. io_v03.patch
        97 kB
        Edward J. Yoon
      7. io_v04.patch
        97 kB
        Edward J. Yoon
      8. IONoInput.patch
        14 kB
        Thomas Jungblut

        Issue Links

          Activity

          Hide
          udanax Edward J. Yoon added a comment -

          http://mail-archives.apache.org/mod_mbox/incubator-hama-dev/201005.mbox/browser

          As mentioned below, graph package will be removed. BTW, I see the in/out formatter classes in graph package. Move them to BSP package or Delete them if it is not good.

          Show
          udanax Edward J. Yoon added a comment - http://mail-archives.apache.org/mod_mbox/incubator-hama-dev/201005.mbox/browser As mentioned below, graph package will be removed. BTW, I see the in/out formatter classes in graph package. Move them to BSP package or Delete them if it is not good.
          Hide
          hyunsik.choi Hyunsik Choi added a comment -

          I suggest that input has not only sequential read but also random read feature.

          Show
          hyunsik.choi Hyunsik Choi added a comment - I suggest that input has not only sequential read but also random read feature.
          Hide
          udanax Edward J. Yoon added a comment -

          I'm copying and pasting in/output related things from Wiki, and changing the affect fix version from 0.2 to 0.3 as we discussed on mailing list.

              // Set input/output format classes
              bsp.setInputFormat(MyInputFormat.class); 
              bsp.setOutputFormat(MyOutputFormat.class); 
          
              // Like new MR API, set input/output targets with some concrete classes of both InputFormat
              // and OutputFormat. This method provides more flexible ways to specify targets.
              MyInputFormat.addInputPath(bsp, "/data/data1");
              MyOutputFormat.addOutputPath(bsp, "/result/data1");
              
              // or we could have another type of input/output targets as follows:
              // MyDBInputFormat.addInputTable(bsp, "hbase://localhost/WebData");    
              // MyDBOutputFormat.addOutputTable(bsp, "hbase://localhost/OutputTable");
          
          Show
          udanax Edward J. Yoon added a comment - I'm copying and pasting in/output related things from Wiki, and changing the affect fix version from 0.2 to 0.3 as we discussed on mailing list. // Set input/output format classes bsp.setInputFormat(MyInputFormat.class); bsp.setOutputFormat(MyOutputFormat.class); // Like new MR API, set input/output targets with some concrete classes of both InputFormat // and OutputFormat. This method provides more flexible ways to specify targets. MyInputFormat.addInputPath(bsp, "/data/data1" ); MyOutputFormat.addOutputPath(bsp, "/result/data1" ); // or we could have another type of input/output targets as follows: // MyDBInputFormat.addInputTable(bsp, "hbase://localhost/WebData" ); // MyDBOutputFormat.addOutputTable(bsp, "hbase://localhost/OutputTable" );
          Hide
          udanax Edward J. Yoon added a comment -

          Change subject.

          Show
          udanax Edward J. Yoon added a comment - Change subject.
          Hide
          udanax Edward J. Yoon added a comment - - edited
              public void bsp(BSPPeerProtocol bspPeer) throws Exception {
          
                  // Send initial messages to each peers.
                  File input = File.open(“graph.dat”);
                  int i = 0;
                  for (String peerName : cluster.getActiveGroomNames().values()) {
                     Map<K, V> subSet = input.getSubset(i);
                     for(Map.Entry<K, V> entry : subSet.entrySet()) {
                       ...
                       bspPeer.send(peerName, new BSPMessage(Tag, Data));
                     }
                     i++;
                  }
                  bspPeer.sync();
             
                  …..
              }
          

          This is very rough idea. I think, we can broadcast initial messages to each peers like above code.

          Unlike MapReduce system, we can assign the data to each tasks flexibly.

          Show
          udanax Edward J. Yoon added a comment - - edited public void bsp(BSPPeerProtocol bspPeer) throws Exception { // Send initial messages to each peers. File input = File.open(“graph.dat”); int i = 0; for ( String peerName : cluster.getActiveGroomNames().values()) { Map<K, V> subSet = input.getSubset(i); for (Map.Entry<K, V> entry : subSet.entrySet()) { ... bspPeer.send(peerName, new BSPMessage(Tag, Data)); } i++; } bspPeer.sync(); ….. } This is very rough idea. I think, we can broadcast initial messages to each peers like above code. Unlike MapReduce system, we can assign the data to each tasks flexibly.
          Hide
          thomas.jungblut Thomas Jungblut added a comment - - edited

          Looks good, but we should provide readers and writers as described in the Pregel paper.
          These logics are just messing up the code of a BSP class.

          Output:
          We can provide a output collector on each peer.

          I'm currently writing a blog post for partitioning and splitting, I'll post it then. http://codingwiththomas.blogspot.com/2011/04/apache-hama-partitioning.html
          Basically it provides just a split for each groom with the modulo function : vertexId % sizeOfCluster.
          Then it just get's writtin into a SequenceFile and the groom can access it.

          This is going to be a more scalable solution, what if the messages are too large for the ram of the groom. Most people just have a small cluster but many data. We should take this into account.

          Show
          thomas.jungblut Thomas Jungblut added a comment - - edited Looks good, but we should provide readers and writers as described in the Pregel paper. These logics are just messing up the code of a BSP class. Output: We can provide a output collector on each peer. I'm currently writing a blog post for partitioning and splitting, I'll post it then. http://codingwiththomas.blogspot.com/2011/04/apache-hama-partitioning.html Basically it provides just a split for each groom with the modulo function : vertexId % sizeOfCluster. Then it just get's writtin into a SequenceFile and the groom can access it. This is going to be a more scalable solution, what if the messages are too large for the ram of the groom. Most people just have a small cluster but many data. We should take this into account.
          Hide
          udanax Edward J. Yoon added a comment -
          • Keep in the mind that the difference between bsp() function and Pregel compute() function.

          Agree with you. too messy.
          So, .. let's add a input system for auto-partitioning/loading local data.

          Above mentioned idea seems just related with 'custom partitioning strategy for efficient distributed computing'.

          Show
          udanax Edward J. Yoon added a comment - Keep in the mind that the difference between bsp() function and Pregel compute() function. Agree with you. too messy. So, .. let's add a input system for auto-partitioning/loading local data. Above mentioned idea seems just related with 'custom partitioning strategy for efficient distributed computing'.
          Hide
          thomas.jungblut Thomas Jungblut added a comment -

          What is the actual difference between these two functions?
          IMHO the compute() is called for each vertex, whereas we are using a bsp() for a task.

          We should provide a Reader class that reads SequenceFiles/TextFiles and HBase tables.

          How should we do the partitioning?

          • Block partitioning like Hadoop - this is not very flexible, depends on locality of the data
          • Key partitioning (like in my blog, you can just send the message to the groom that contains this vertexID) // this would be better for HBase input, or for SequenceFiles.

          Or like the last two, just with messaging, but this would be slower than writing it into a HDFS block.

          How about a simple outputsystem?
          We can provide an output collector in the BSPPeer and each peer has it's own outputfile in HDFS.
          If a user wants to output, he simply can and don't have to code a SequenceFile writer.

          Show
          thomas.jungblut Thomas Jungblut added a comment - What is the actual difference between these two functions? IMHO the compute() is called for each vertex, whereas we are using a bsp() for a task. We should provide a Reader class that reads SequenceFiles/TextFiles and HBase tables. How should we do the partitioning? Block partitioning like Hadoop - this is not very flexible, depends on locality of the data Key partitioning (like in my blog, you can just send the message to the groom that contains this vertexID) // this would be better for HBase input, or for SequenceFiles. Or like the last two, just with messaging, but this would be slower than writing it into a HDFS block. How about a simple outputsystem? We can provide an output collector in the BSPPeer and each peer has it's own outputfile in HDFS. If a user wants to output, he simply can and don't have to code a SequenceFile writer.
          Hide
          udanax Edward J. Yoon added a comment -

          Have you seen this old slides? http://www.slideshare.net/guest20d395b/apache-hama-an-introduction-tobulk-synchronization-parallel-on-hadoop

          See the BFSearch eample, ..
          the process() function ≒ current bsp() function
          the expand() function ≒ Google Pregel's compute() function

          Show
          udanax Edward J. Yoon added a comment - Have you seen this old slides? http://www.slideshare.net/guest20d395b/apache-hama-an-introduction-tobulk-synchronization-parallel-on-hadoop See the BFSearch eample, .. the process() function ≒ current bsp() function the expand() function ≒ Google Pregel's compute() function
          Hide
          thomas.jungblut Thomas Jungblut added a comment -

          Thanks, I read them several weeks ago, but never had the context of the Pregel paper.
          This made the whole Hama a bit more clearer

          Show
          thomas.jungblut Thomas Jungblut added a comment - Thanks, I read them several weeks ago, but never had the context of the Pregel paper. This made the whole Hama a bit more clearer
          Hide
          chl501 ChiaHung Lin added a comment -

          When testing to replace the code that launches BSPPeer child by TaskRunner with another thread, I observed in the example SerilaizePrinting where printOutput() tries to print data written in the stage of writeLogToFile() may have race condition, resulting in the printOutput()is executed earlier than writeLogToFile(); thus FileNotFoundException is thrown. Probably we will need to provide api which encapsulates the coordination between the client and bsp server program.

          Show
          chl501 ChiaHung Lin added a comment - When testing to replace the code that launches BSPPeer child by TaskRunner with another thread, I observed in the example SerilaizePrinting where printOutput() tries to print data written in the stage of writeLogToFile() may have race condition, resulting in the printOutput()is executed earlier than writeLogToFile(); thus FileNotFoundException is thrown. Probably we will need to provide api which encapsulates the coordination between the client and bsp server program.
          Hide
          udanax Edward J. Yoon added a comment -

          When testing to replace the code that launches BSPPeer child by TaskRunner with another thread, I observed in the example SerilaizePrinting where printOutput() tries to print data written in the stage of writeLogToFile() may have race condition, resulting in the printOutput()is executed earlier than writeLogToFile(); thus FileNotFoundException is thrown. Probably we will need to provide api which encapsulates the coordination between the client and bsp server program.

          Yeah, +1

          Show
          udanax Edward J. Yoon added a comment - When testing to replace the code that launches BSPPeer child by TaskRunner with another thread, I observed in the example SerilaizePrinting where printOutput() tries to print data written in the stage of writeLogToFile() may have race condition, resulting in the printOutput()is executed earlier than writeLogToFile(); thus FileNotFoundException is thrown. Probably we will need to provide api which encapsulates the coordination between the client and bsp server program. Yeah, +1
          Hide
          udanax Edward J. Yoon added a comment -

          I'm starting to implement this and schedule to 0.4

          Show
          udanax Edward J. Yoon added a comment - I'm starting to implement this and schedule to 0.4
          Hide
          thomas.jungblut Thomas Jungblut added a comment -

          +1.
          So in 0.4.0 we are feature complete and in 0.5.0 we can focus on more examples and performance improvements.
          Great.

          Please let me know if you need help.

          Show
          thomas.jungblut Thomas Jungblut added a comment - +1. So in 0.4.0 we are feature complete and in 0.5.0 we can focus on more examples and performance improvements. Great. Please let me know if you need help.
          Hide
          thomas.jungblut Thomas Jungblut added a comment - - edited

          Let's summarize a bit:

          • Input and InputFormat are managed the same way as it is in Hadoop.
          • Input send as initial messages
          • Coordination between client and BSP
          • Input has random read feature

          Input and InputFormat are managed the same way as it is in Hadoop.
          & Input send as initial messages

          This sounds clear, how do we split the data? And how does this going to be send to the tasks as initial messages?

          Coordination between client and BSP
          Not a part of the I/O System, is it? Sounds like a more general synchronization between the BSPClient and our BSPMaster.

          Input has random read feature
          This is great, but a lot of work? / Additional feature.

          What are your ideas Edward?

          Show
          thomas.jungblut Thomas Jungblut added a comment - - edited Let's summarize a bit: Input and InputFormat are managed the same way as it is in Hadoop. Input send as initial messages Coordination between client and BSP Input has random read feature Input and InputFormat are managed the same way as it is in Hadoop. & Input send as initial messages This sounds clear, how do we split the data? And how does this going to be send to the tasks as initial messages? Coordination between client and BSP Not a part of the I/O System, is it? Sounds like a more general synchronization between the BSPClient and our BSPMaster. Input has random read feature This is great, but a lot of work? / Additional feature. What are your ideas Edward?
          Hide
          chl501 ChiaHung Lin added a comment -

          Coordination between client and BSP

          That was the issue originally related to spawning bsp peer as a separate process, which concurrently executes its computation. And previously we did not wait until its execution is finished so the groom server would report bsp peer finishes its task after process is spawned. This feature now is not required in my viewpoint, unless other demands for different purposes otherwise.

          Show
          chl501 ChiaHung Lin added a comment - Coordination between client and BSP That was the issue originally related to spawning bsp peer as a separate process, which concurrently executes its computation. And previously we did not wait until its execution is finished so the groom server would report bsp peer finishes its task after process is spawned. This feature now is not required in my viewpoint, unless other demands for different purposes otherwise.
          Hide
          chl501 ChiaHung Lin added a comment -

          While working on some tasks, it is time consuming to repeatedly perform tasks e.g. manually split data, etc. So I would like to work on this issue first. Basically it is just to provide mechanism to split data and maybe collector for output.

          Show
          chl501 ChiaHung Lin added a comment - While working on some tasks, it is time consuming to repeatedly perform tasks e.g. manually split data, etc. So I would like to work on this issue first. Basically it is just to provide mechanism to split data and maybe collector for output.
          Hide
          udanax Edward J. Yoon added a comment -

          That's great news. I'll help you if I can.

          Show
          udanax Edward J. Yoon added a comment - That's great news. I'll help you if I can.
          Hide
          udanax Edward J. Yoon added a comment -

          This patch adds basic TextInputFormatter. RecordReader should be added.

              BSPJob bsp = new BSPJob(conf, TestBSP.class);
              // Set the job name
              bsp.setJobName("Read/Write Test");
              bsp.setBspClass(TestBSP.class);
          
              bsp.setInputFormat(TextInputFormat.class);
              FileInputFormat.setInputPaths(bsp,
                  new Path("/user/edward/text/CHANGES.txt"));
          

          Before this week, this task will be finished.

          Show
          udanax Edward J. Yoon added a comment - This patch adds basic TextInputFormatter. RecordReader should be added. BSPJob bsp = new BSPJob(conf, TestBSP.class); // Set the job name bsp.setJobName( "Read/Write Test" ); bsp.setBspClass(TestBSP.class); bsp.setInputFormat(TextInputFormat.class); FileInputFormat.setInputPaths(bsp, new Path( "/user/edward/text/CHANGES.txt" )); Before this week, this task will be finished.
          Hide
          chl501 ChiaHung Lin added a comment -

          The patch looks good. In addition to splitting data, we may also need to consider to add feature e.g. random write because it seems iterative applications require this feature for which hdfs is not designed.

          Show
          chl501 ChiaHung Lin added a comment - The patch looks good. In addition to splitting data, we may also need to consider to add feature e.g. random write because it seems iterative applications require this feature for which hdfs is not designed.
          Hide
          udanax Edward J. Yoon added a comment -

          we may also need to consider to add feature e.g. random write because it seems iterative applications require this feature for which hdfs is not designed.

          You're right but I'm not sure it's possible. Alternatively, HBaseInput/OutputFormatter can be implemented IMO.

          Show
          udanax Edward J. Yoon added a comment - we may also need to consider to add feature e.g. random write because it seems iterative applications require this feature for which hdfs is not designed. You're right but I'm not sure it's possible. Alternatively, HBaseInput/OutputFormatter can be implemented IMO.
          Hide
          chl501 ChiaHung Lin added a comment -

          The plan is we can provide an abstract layer so that user can access to different file system if needed. For hdfs, we probably only need to provide random read which is supported by hdfs; for other system, random write can be implemented in the future.

          Show
          chl501 ChiaHung Lin added a comment - The plan is we can provide an abstract layer so that user can access to different file system if needed. For hdfs, we probably only need to provide random read which is supported by hdfs; for other system, random write can be implemented in the future.
          Hide
          udanax Edward J. Yoon added a comment -

          +1

          Show
          udanax Edward J. Yoon added a comment - +1
          Hide
          udanax Edward J. Yoon added a comment -

          Tested on 2 nodes using 6 tasks. Works good.

          11/10/31 21:19:11 INFO examples.CombineExample$MyBSP: xxxxx key: 13185
          11/10/31 21:19:11 INFO examples.CombineExample$MyBSP: xxxxx value:                that are aliased in HamaAdmin. (samuel via edwardyoon)
          11/10/31 21:19:11 INFO examples.CombineExample$MyBSP: xxxxx key: 13255
          11/10/31 21:19:11 INFO examples.CombineExample$MyBSP: xxxxx value:     HAMA-89: Exception Handling of a ParseException (samuel via edwardyoon)
          11/10/31 21:19:11 INFO sync.ZooKeeperSyncClientImpl: 1. At superstep: 0 which task is waiting? attempt_201110312118_0001_000004_0 stat is null? null
          11/10/31 21:19:12 INFO sync.ZooKeeperSyncClientImpl: leaveBarrier() at superstep: 0 taskid:attempt_201110312118_0001_000004_0 lowest: attempt_201110312118_0001_000000_0 highest:attempt_201110312118_0001_000005_0
          11/10/31 21:19:12 INFO sync.ZooKeeperSyncClientImpl: leaveBarrier() znode at superstep:0 taskid:attempt_201110312118_0001_000004_0 exists, so delete it.
          11/10/31 21:19:12 INFO zookeeper.ZooKeeper: Session: 0x13333b17a0d0091 closed
          11/10/31 21:19:12 INFO ipc.Server: Stopping server on 61001
          11/10/31 21:19:12 INFO ipc.Server: IPC Server handler 0 on 61001: exiting
          11/10/31 21:19:12 INFO ipc.Server: Stopping IPC Server listener on 61001
          11/10/31 21:19:12 INFO ipc.Server: Stopping IPC Server Responder
          edward@slave:~/workspace/hama-trunk$ tail -15 core/logs/tasklogs/1/attempt_201110312118_0001_000005_0.log 
          11/10/31 21:19:11 INFO examples.CombineExample$MyBSP: xxxxx value:     HAMA-322: Make sure failed assertions on test threads are reported
          11/10/31 21:19:11 INFO examples.CombineExample$MyBSP: xxxxx key: 16428
          11/10/31 21:19:11 INFO examples.CombineExample$MyBSP: xxxxx value:                (Filipe Manana via edwardyoon)  
          11/10/31 21:19:11 INFO examples.CombineExample$MyBSP: xxxxx key: 16476
          11/10/31 21:19:11 INFO examples.CombineExample$MyBSP: xxxxx value:     HAMA-316: Renaming and Refactoring methods in BSPPeerInterface (Filipe Manana)
          11/10/31 21:19:11 INFO examples.CombineExample$MyBSP: xxxxx key: 16559
          11/10/31 21:19:11 INFO examples.CombineExample$MyBSP: xxxxx value:     HAMA-319: groom servers Map in HeartbeatResponse not correctly serialized
          11/10/31 21:19:11 INFO sync.ZooKeeperSyncClientImpl: 1. At superstep: 0 which task is waiting? attempt_201110312118_0001_000005_0 stat is null? null
          
          Show
          udanax Edward J. Yoon added a comment - Tested on 2 nodes using 6 tasks. Works good. 11/10/31 21:19:11 INFO examples.CombineExample$MyBSP: xxxxx key: 13185 11/10/31 21:19:11 INFO examples.CombineExample$MyBSP: xxxxx value: that are aliased in HamaAdmin. (samuel via edwardyoon) 11/10/31 21:19:11 INFO examples.CombineExample$MyBSP: xxxxx key: 13255 11/10/31 21:19:11 INFO examples.CombineExample$MyBSP: xxxxx value: HAMA-89: Exception Handling of a ParseException (samuel via edwardyoon) 11/10/31 21:19:11 INFO sync.ZooKeeperSyncClientImpl: 1. At superstep: 0 which task is waiting? attempt_201110312118_0001_000004_0 stat is null ? null 11/10/31 21:19:12 INFO sync.ZooKeeperSyncClientImpl: leaveBarrier() at superstep: 0 taskid:attempt_201110312118_0001_000004_0 lowest: attempt_201110312118_0001_000000_0 highest:attempt_201110312118_0001_000005_0 11/10/31 21:19:12 INFO sync.ZooKeeperSyncClientImpl: leaveBarrier() znode at superstep:0 taskid:attempt_201110312118_0001_000004_0 exists, so delete it. 11/10/31 21:19:12 INFO zookeeper.ZooKeeper: Session: 0x13333b17a0d0091 closed 11/10/31 21:19:12 INFO ipc.Server: Stopping server on 61001 11/10/31 21:19:12 INFO ipc.Server: IPC Server handler 0 on 61001: exiting 11/10/31 21:19:12 INFO ipc.Server: Stopping IPC Server listener on 61001 11/10/31 21:19:12 INFO ipc.Server: Stopping IPC Server Responder edward@slave:~/workspace/hama-trunk$ tail -15 core/logs/tasklogs/1/attempt_201110312118_0001_000005_0.log 11/10/31 21:19:11 INFO examples.CombineExample$MyBSP: xxxxx value: HAMA-322: Make sure failed assertions on test threads are reported 11/10/31 21:19:11 INFO examples.CombineExample$MyBSP: xxxxx key: 16428 11/10/31 21:19:11 INFO examples.CombineExample$MyBSP: xxxxx value: (Filipe Manana via edwardyoon) 11/10/31 21:19:11 INFO examples.CombineExample$MyBSP: xxxxx key: 16476 11/10/31 21:19:11 INFO examples.CombineExample$MyBSP: xxxxx value: HAMA-316: Renaming and Refactoring methods in BSPPeerInterface (Filipe Manana) 11/10/31 21:19:11 INFO examples.CombineExample$MyBSP: xxxxx key: 16559 11/10/31 21:19:11 INFO examples.CombineExample$MyBSP: xxxxx value: HAMA-319: groom servers Map in HeartbeatResponse not correctly serialized 11/10/31 21:19:11 INFO sync.ZooKeeperSyncClientImpl: 1. At superstep: 0 which task is waiting? attempt_201110312118_0001_000005_0 stat is null ? null
          Hide
          udanax Edward J. Yoon added a comment -

          I will

          1) move 'CombineExample', 'RandBench', 'SerializePrinting' examples to test package.
          2) delete IOTestJob and add Grep example using TextIn/OutputFormat.
          3) provide RecordReader as a input, so that some computing frameworks e.g., Pregel can be implemented on top of Hama.

          Show
          udanax Edward J. Yoon added a comment - I will 1) move 'CombineExample', 'RandBench', 'SerializePrinting' examples to test package. 2) delete IOTestJob and add Grep example using TextIn/OutputFormat. 3) provide RecordReader as a input, so that some computing frameworks e.g., Pregel can be implemented on top of Hama.
          Hide
          udanax Edward J. Yoon added a comment -

          Added output collector. Test passed on my cluster.

          edward@slave:~/workspace/hama-trunk$ /usr/local/src/hadoop-0.20.2/bin/hadoop dfs -ls result1
          Found 6 items
          -rw-r--r--   3 edward supergroup         63 2011-11-01 20:04 /user/edward/result1/part-00000
          -rw-r--r--   3 edward supergroup         18 2011-11-01 20:04 /user/edward/result1/part-00001
          -rw-r--r--   3 edward supergroup          0 2011-11-01 20:04 /user/edward/result1/part-00002
          -rw-r--r--   3 edward supergroup         27 2011-11-01 20:04 /user/edward/result1/part-00003
          -rw-r--r--   3 edward supergroup         18 2011-11-01 20:04 /user/edward/result1/part-00004
          -rw-r--r--   3 edward supergroup         18 2011-11-01 20:04 /user/edward/result1/part-00005
          
          Show
          udanax Edward J. Yoon added a comment - Added output collector. Test passed on my cluster. edward@slave:~/workspace/hama-trunk$ /usr/local/src/hadoop-0.20.2/bin/hadoop dfs -ls result1 Found 6 items -rw-r--r-- 3 edward supergroup 63 2011-11-01 20:04 /user/edward/result1/part-00000 -rw-r--r-- 3 edward supergroup 18 2011-11-01 20:04 /user/edward/result1/part-00001 -rw-r--r-- 3 edward supergroup 0 2011-11-01 20:04 /user/edward/result1/part-00002 -rw-r--r-- 3 edward supergroup 27 2011-11-01 20:04 /user/edward/result1/part-00003 -rw-r--r-- 3 edward supergroup 18 2011-11-01 20:04 /user/edward/result1/part-00004 -rw-r--r-- 3 edward supergroup 18 2011-11-01 20:04 /user/edward/result1/part-00005
          Hide
          udanax Edward J. Yoon added a comment -

          Builds OK, and data loading tests are all passed.

          LocalJobRunner and examples should be fixed but I'm going to commit this patch.

          Show
          udanax Edward J. Yoon added a comment - Builds OK, and data loading tests are all passed. LocalJobRunner and examples should be fixed but I'm going to commit this patch.
          Hide
          thomas.jungblut Thomas Jungblut added a comment -

          Please make a new task for the localjobrunner, I'll fix it then.

          Show
          thomas.jungblut Thomas Jungblut added a comment - Please make a new task for the localjobrunner, I'll fix it then.
          Hide
          udanax Edward J. Yoon added a comment -

          Yeah.

          Show
          udanax Edward J. Yoon added a comment - Yeah.
          Hide
          thomas.jungblut Thomas Jungblut added a comment - - edited

          This patch adds missing Apache headers in the files and moves the new I/O usage to the BSPPeerImpl, therefore changes the API in BSP class.

          Test cases are working without any errors.

          In my opinion we should move all the I/O related classes into another package. BSP package is soo bloated..

          TODO we should add custom partitioning.
          BTW what is when the split size is greater than the cluster capacity?
          And a task could use multiple splits.

          My proposal would be to use the number of tasks the user proposes and then assign the splits to the tasks equally. Partitioning should either make a block partitioning, or a key partitioning over its hashcode. Afterwards the created files are assigned as a filesplit to the task.

          Show
          thomas.jungblut Thomas Jungblut added a comment - - edited This patch adds missing Apache headers in the files and moves the new I/O usage to the BSPPeerImpl, therefore changes the API in BSP class. Test cases are working without any errors. In my opinion we should move all the I/O related classes into another package. BSP package is soo bloated.. TODO we should add custom partitioning. BTW what is when the split size is greater than the cluster capacity? And a task could use multiple splits. My proposal would be to use the number of tasks the user proposes and then assign the splits to the tasks equally. Partitioning should either make a block partitioning, or a key partitioning over its hashcode. Afterwards the created files are assigned as a filesplit to the task.
          Hide
          udanax Edward J. Yoon added a comment -

          Clearly, it's more simple and cute. +1

          Show
          udanax Edward J. Yoon added a comment - Clearly, it's more simple and cute. +1
          Hide
          chl501 ChiaHung Lin added a comment -

          It would be good if our system can have ability expressing data layout instead of how data is accessed by tasks so that the system will be more flexible than restricted access modes defined.

          Show
          chl501 ChiaHung Lin added a comment - It would be good if our system can have ability expressing data layout instead of how data is accessed by tasks so that the system will be more flexible than restricted access modes defined.
          Hide
          thomas.jungblut Thomas Jungblut added a comment -

          Let's commit this patch and look what we can do with the partitioning.

          It would be good if our system can have ability expressing data layout instead of how data is accessed by tasks so that the system will be more flexible than restricted access modes defined.

          Can you give an example please?

          Show
          thomas.jungblut Thomas Jungblut added a comment - Let's commit this patch and look what we can do with the partitioning. It would be good if our system can have ability expressing data layout instead of how data is accessed by tasks so that the system will be more flexible than restricted access modes defined. Can you give an example please?
          Hide
          thomas.jungblut Thomas Jungblut added a comment -

          Committed in r1197057.

          Show
          thomas.jungblut Thomas Jungblut added a comment - Committed in r1197057.
          Hide
          thomas.jungblut Thomas Jungblut added a comment -

          Who is going to get the missing points running?

          • When the user submits no input (inputpath == null), the number of tasks should be used.
          • A customizable partitioning
          • multiple splits per task
          Show
          thomas.jungblut Thomas Jungblut added a comment - Who is going to get the missing points running? When the user submits no input (inputpath == null), the number of tasks should be used. A customizable partitioning multiple splits per task
          Hide
          thomas.jungblut Thomas Jungblut added a comment -

          This patch adds the no-user input = number of tasks "feature".

          [INFO] Reactor Summary:
          [INFO] 
          [INFO] Apache Hama parent POM ............................ SUCCESS [1.972s]
          [INFO] Apache Hama Core .................................. SUCCESS [1:19.977s]
          [INFO] Apache Hama Graph Package ......................... SUCCESS [0.728s]
          [INFO] Apache Hama Examples .............................. SUCCESS [10.904s]
          
          Show
          thomas.jungblut Thomas Jungblut added a comment - This patch adds the no-user input = number of tasks "feature". [INFO] Reactor Summary: [INFO] [INFO] Apache Hama parent POM ............................ SUCCESS [1.972s] [INFO] Apache Hama Core .................................. SUCCESS [1:19.977s] [INFO] Apache Hama Graph Package ......................... SUCCESS [0.728s] [INFO] Apache Hama Examples .............................. SUCCESS [10.904s]
          Hide
          chl501 ChiaHung Lin added a comment -

          The current patch supports contiguous access mode. In addition to this, our framework also needs the ability to have random access to the underlying data. These can be achieved with the expression of data layout.

          Idea is borrowed from mpi i/o. A file is consisted of file type(s), which is a template/ pattern in describing data layout, and displacement, an absolute position from which the first file type begins. File type is consisted of elementary type, which is the basic construct unit (e.g. byte), and holes that define non-accessible area. Suppose there are 3 processes (or tasks). The first process holds a file type containing 1 elementary type and 5 holes (total size is 6 elementary types) with its elementary type sitting at the first position (e.g. array[0]). The second holds 2 elementary types and 4 holes where elementary types stay at the 2nd and 3rd position (array[1] and array[2]). The third holds 3 elementary types and 3 holes with elementary type positions at the 4th, 5th, and 6th (array[3], array[4], array[5]). Holes occupy places where elementary type left marking the data is non-accessible.

          The whole file thus can be expressed with the composition of three processes/ tasks. And each process has a view to the part that it want to access. Therefore, an contiguous split can be expressed by constructing how many times an elementary type (e.g. byte) would repeat (n x byte). For non contiguous access (random access), a process can specify elementary type and the layout (file type) to describe data it wants to access.

          The benefit, according to mpi i/o report, is its design goal favours common usage patterns (90%) corresponded to real world requirements.

          Show
          chl501 ChiaHung Lin added a comment - The current patch supports contiguous access mode. In addition to this, our framework also needs the ability to have random access to the underlying data. These can be achieved with the expression of data layout. Idea is borrowed from mpi i/o. A file is consisted of file type(s), which is a template/ pattern in describing data layout, and displacement, an absolute position from which the first file type begins. File type is consisted of elementary type, which is the basic construct unit (e.g. byte), and holes that define non-accessible area. Suppose there are 3 processes (or tasks). The first process holds a file type containing 1 elementary type and 5 holes (total size is 6 elementary types) with its elementary type sitting at the first position (e.g. array [0] ). The second holds 2 elementary types and 4 holes where elementary types stay at the 2nd and 3rd position (array [1] and array [2] ). The third holds 3 elementary types and 3 holes with elementary type positions at the 4th, 5th, and 6th (array [3] , array [4] , array [5] ). Holes occupy places where elementary type left marking the data is non-accessible. The whole file thus can be expressed with the composition of three processes/ tasks. And each process has a view to the part that it want to access. Therefore, an contiguous split can be expressed by constructing how many times an elementary type (e.g. byte) would repeat (n x byte). For non contiguous access (random access), a process can specify elementary type and the layout (file type) to describe data it wants to access. The benefit, according to mpi i/o report, is its design goal favours common usage patterns (90%) corresponded to real world requirements.
          Hide
          thomas.jungblut Thomas Jungblut added a comment -

          BTW: There is no patch ChiaHung^^

          Did someone review my patch? I'm going to commit this now, Edward fixed half of it because the nightly failed.

          Show
          thomas.jungblut Thomas Jungblut added a comment - BTW: There is no patch ChiaHung^^ Did someone review my patch? I'm going to commit this now, Edward fixed half of it because the nightly failed.
          Hide
          thomas.jungblut Thomas Jungblut added a comment - - edited

          NoInput patch got comitted, hope the build is fixed again. Thanks Edward though.

          TODO

          • customizable partitioning
          • multiple splits per task
          • random access
          Show
          thomas.jungblut Thomas Jungblut added a comment - - edited NoInput patch got comitted, hope the build is fixed again. Thanks Edward though. TODO customizable partitioning multiple splits per task random access
          Hide
          chl501 ChiaHung Lin added a comment -

          Random access feature probably would take longer time so another issue, HAMA-468, is created for dealing that task.

          Show
          chl501 ChiaHung Lin added a comment - Random access feature probably would take longer time so another issue, HAMA-468 , is created for dealing that task.
          Hide
          thomas.jungblut Thomas Jungblut added a comment -

          Multiple splits per tasks should be actually easy, we can use the combine file split: http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapreduce/lib/input/CombineFileSplit.html

          It will take care of the equally distribution of the blocks. You can provide via "mapred.max.split.size" a split size for a task.
          We just have to calculate this split size. But it is trivial, sum(block_sizes) / number of tasks.
          Or the user could directly set this size.

          I'm not to sure currently.

          Show
          thomas.jungblut Thomas Jungblut added a comment - Multiple splits per tasks should be actually easy, we can use the combine file split: http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapreduce/lib/input/CombineFileSplit.html It will take care of the equally distribution of the blocks. You can provide via "mapred.max.split.size" a split size for a task. We just have to calculate this split size. But it is trivial, sum(block_sizes) / number of tasks. Or the user could directly set this size. I'm not to sure currently.
          Hide
          thomas.jungblut Thomas Jungblut added a comment -

          This patch adds partitioning.

          Please have a look at it

          There is a testcase called "TestPartitioning" I verified that it is working with the local runner (I'm so glad it is running again ) and I would be very happy if you can give me feedback to this patch!

          BTW: Can we include a textfile for this testcase?

          And I added a lot of new input formats.

          Show
          thomas.jungblut Thomas Jungblut added a comment - This patch adds partitioning. Please have a look at it There is a testcase called "TestPartitioning" I verified that it is working with the local runner (I'm so glad it is running again ) and I would be very happy if you can give me feedback to this patch! BTW: Can we include a textfile for this testcase? And I added a lot of new input formats.
          Hide
          udanax Edward J. Yoon added a comment -

          BTW: Can we include a textfile for this testcase?

          Yes.
          Just idea, can we use just a CHANGES.txt file?

          Show
          udanax Edward J. Yoon added a comment - BTW: Can we include a textfile for this testcase? Yes. Just idea, can we use just a CHANGES.txt file?
          Hide
          thomas.jungblut Thomas Jungblut added a comment -

          Sure, would be okay for it.

          I would like to add two things to it:

          • the path where the partitioned stuff goes to should be configurable
          • the sequencefile which will be generated should have configurable compression.
          Show
          thomas.jungblut Thomas Jungblut added a comment - Sure, would be okay for it. I would like to add two things to it: the path where the partitioned stuff goes to should be configurable the sequencefile which will be generated should have configurable compression.
          Hide
          thomas.jungblut Thomas Jungblut added a comment -

          Enabled the sequencefile compression codec and type for partitioned files. And activated the testcase. Runs well.

          Show
          thomas.jungblut Thomas Jungblut added a comment - Enabled the sequencefile compression codec and type for partitioned files. And activated the testcase. Runs well.
          Hide
          udanax Edward J. Yoon added a comment -

          +1

          Patch is great and my test also passed.

          Show
          udanax Edward J. Yoon added a comment - +1 Patch is great and my test also passed.
          Hide
          thomas.jungblut Thomas Jungblut added a comment -

          Got committed, thanks for your review.

          Show
          thomas.jungblut Thomas Jungblut added a comment - Got committed, thanks for your review.
          Hide
          thomas.jungblut Thomas Jungblut added a comment -

          Edward and I thought we could put all the I/O stuff into a new package org.apache.hama.io.

          Someone volunteering to make some *.sh with moves and a patch for package declaration changes?
          And we should speak to the guys from the graph databases, maybe they want to provide an InputFormat for their products. At least we should also provide an format for HBase.

          But this would be a new feature then, after the movement we can close this epic issue here.

          Show
          thomas.jungblut Thomas Jungblut added a comment - Edward and I thought we could put all the I/O stuff into a new package org.apache.hama.io. Someone volunteering to make some *.sh with moves and a patch for package declaration changes? And we should speak to the guys from the graph databases, maybe they want to provide an InputFormat for their products. At least we should also provide an format for HBase. But this would be a new feature then, after the movement we can close this epic issue here.
          Hide
          udanax Edward J. Yoon added a comment -

          Someone volunteering to make some *.sh with moves and a patch for package declaration changes?

          I'll.

          And we should speak to the guys from the graph databases, maybe they want to provide an InputFormat for their products.

          +1

          Show
          udanax Edward J. Yoon added a comment - Someone volunteering to make some *.sh with moves and a patch for package declaration changes? I'll. And we should speak to the guys from the graph databases, maybe they want to provide an InputFormat for their products. +1
          Hide
          chl501 ChiaHung Lin added a comment -

          +1
          That's great if we can move related classes to io package.

          Show
          chl501 ChiaHung Lin added a comment - +1 That's great if we can move related classes to io package.
          Hide
          udanax Edward J. Yoon added a comment -

          Fixed.

          Please open new ticket if there's other suggestion.

          Show
          udanax Edward J. Yoon added a comment - Fixed. Please open new ticket if there's other suggestion.

            People

            • Assignee:
              udanax Edward J. Yoon
              Reporter:
              udanax Edward J. Yoon
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development