Details

    • Type: Umbrella Umbrella
    • Status: In Progress
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.8.1
    • Fix Version/s: None
    • Labels:
      None

      Description

      So I am curious what the plan is for the longterm future of MRUNIT?

      I think currently MRUNIT is useful for just unit testing a single mapper or reducer but currently there is a void for testing more complicated features such as MultipleInputs, MultipleOutputs, a driver class, counters, among other things. I wonder if instead of adding support to the current MRUNIT framework for these extra features it would more useful to add in hooks to the existing LocalJobRunner and MiniMRCluster classes to provide methods to more easily verify file output from text files, sequence files, etc. This would allow MRUNIT to test driver classes, MultipleInputs, MultipleOutputs, etc. MRUNIT would also then test against the real hadoop code instead of an implementation that mimics hadoop which can miss some bugs such as the ReduceDriver that did not reuse the same object until 0.8.0. MRUNIT would also keep up with new map reduce features instead of us having to implement fake versions of them

      I understand that performance would be an issue due to the file I/O but I wonder how fast the LocalJobRunner would be if we wrote a new class that extending FileSystem to allow users to write out fake files to memory and make the LocalJobRunner read from them

        Activity

        Hide
        Brock Noland added a comment -

        I think that is up to us, I agree there is a void in the integration testing space. It would be nice if we had a very easy interface to run drivers and validate the output.

        Show
        Brock Noland added a comment - I think that is up to us, I agree there is a void in the integration testing space. It would be nice if we had a very easy interface to run drivers and validate the output.
        Hide
        Brock Noland added a comment -

        I have been struggling with how to implement multiple input/outputs as I don't see a clear way to implement then in MRUnit today. That would be a good initial goal of an expansion.

        Show
        Brock Noland added a comment - I have been struggling with how to implement multiple input/outputs as I don't see a clear way to implement then in MRUnit today. That would be a good initial goal of an expansion.
        Hide
        Jim Donofrio added a comment -

        Another way to make using mrunit cleaner would be to really take advantage of assertions to cut out the boilerplate code.

        For example you could have

        @MapTest(mapper = IdentityMapper.class)
        public void mapTest()

        { driver.setInput(key, value); driver.addOutput(key, value); }

        I am not too familiar with making up Assertions but we would need to a way to then reference the driver or else all the methods would have to be static which would not be ideal.

        JUnit uses assertions all over their library, we should be able to use them too

        Show
        Jim Donofrio added a comment - Another way to make using mrunit cleaner would be to really take advantage of assertions to cut out the boilerplate code. For example you could have @MapTest(mapper = IdentityMapper.class) public void mapTest() { driver.setInput(key, value); driver.addOutput(key, value); } I am not too familiar with making up Assertions but we would need to a way to then reference the driver or else all the methods would have to be static which would not be ideal. JUnit uses assertions all over their library, we should be able to use them too
        Hide
        Jarek Jarcec Cecho added a comment -

        I quite like Jim's idea. I'm looking forward to simplify usage even more. I could imagine even more simplification to save any necessary typing. For example something like:

        class CoolMapperTest extends MapperTest<CoolMapper.class> {
        public testProperArguments()

        { withInput(new LongWritable(1), new Text("hi")); withOutput(new Text("hi"), new LongWritable(1)); withCounter("MyGroup", "MyCounter", 1); }

        }

        I'm expecting here that user will have one test class per one mapper (reducer, map reduce pair) which is something that I'm already doing in my tests to keep them clean. I'm not sure whether others are doing it as well. Another important note here is that I'm expecting that MapperTest will execute all methods starting with name "test" to set up testing environment (inputs, outputs, counters, ...). However test itself would be executed by MapperTest and user wouldn't have any control over it which might be limiting for some use cases. However those problems might be address by exposing API in MapperTest that can override default behavior.

        Show
        Jarek Jarcec Cecho added a comment - I quite like Jim's idea. I'm looking forward to simplify usage even more. I could imagine even more simplification to save any necessary typing. For example something like: class CoolMapperTest extends MapperTest<CoolMapper.class> { public testProperArguments() { withInput(new LongWritable(1), new Text("hi")); withOutput(new Text("hi"), new LongWritable(1)); withCounter("MyGroup", "MyCounter", 1); } } I'm expecting here that user will have one test class per one mapper (reducer, map reduce pair) which is something that I'm already doing in my tests to keep them clean. I'm not sure whether others are doing it as well. Another important note here is that I'm expecting that MapperTest will execute all methods starting with name "test" to set up testing environment (inputs, outputs, counters, ...). However test itself would be executed by MapperTest and user wouldn't have any control over it which might be limiting for some use cases. However those problems might be address by exposing API in MapperTest that can override default behavior.
        Hide
        Brock Noland added a comment -

        Hi,

        I for eliminating typing as well. I am a huge fan of Mockito. Their literate syntax is very easy to use.

        Show
        Brock Noland added a comment - Hi, I for eliminating typing as well. I am a huge fan of Mockito. Their literate syntax is very easy to use.
        Hide
        Jarek Jarcec Cecho added a comment -

        I'm not familiar with Mockito... yet

        Could you put together some code example how would you imagine using Mackito for writing tests? Just to give an idea what you're thinking about.

        Show
        Jarek Jarcec Cecho added a comment - I'm not familiar with Mockito... yet Could you put together some code example how would you imagine using Mackito for writing tests? Just to give an idea what you're thinking about.
        Hide
        Brock Noland added a comment -

        Here is Mockito: http://gojko.net/2009/10/23/mockito-in-six-easy-examples/

        It basically used static methods with thread locals to give you something very litterate. We, if we chose too, would be emulating it's syntax. For example:

        
        test(mapper)
          .withInput(new LongWritable(1), new Text("hi"))
          .withOutput(new Text("hi"), new LongWritable(1))
          .withCounter("MyGroup", "MyCounter", 1)
          .run();
        
        

        The method called "test" might be a bad choice.

        Show
        Brock Noland added a comment - Here is Mockito: http://gojko.net/2009/10/23/mockito-in-six-easy-examples/ It basically used static methods with thread locals to give you something very litterate. We, if we chose too, would be emulating it's syntax. For example: test(mapper) .withInput( new LongWritable(1), new Text( "hi" )) .withOutput( new Text( "hi" ), new LongWritable(1)) .withCounter( "MyGroup" , "MyCounter" , 1) .run(); The method called "test" might be a bad choice.
        Hide
        Jim Donofrio added a comment -

        I would think we should still be compatible with JUnit due to the good support from IDE's

        I agree that @MapTest(mapper = IdentityMapper.class) should not go on the method but the class. I am not sure we want to extend MapperTest<CoolMapper.class>. JUnit used to do this but went a way from that to do assertion based testing.

        I think the withInput(key, value).withOutput(key, value) is a good idea but it would be ideal to not have to call run() everytime. There also needs to be a withJobConf(conf) or a withJobConfParam("key", value)

        Show
        Jim Donofrio added a comment - I would think we should still be compatible with JUnit due to the good support from IDE's I agree that @MapTest(mapper = IdentityMapper.class) should not go on the method but the class. I am not sure we want to extend MapperTest<CoolMapper.class>. JUnit used to do this but went a way from that to do assertion based testing. I think the withInput(key, value).withOutput(key, value) is a good idea but it would be ideal to not have to call run() everytime. There also needs to be a withJobConf(conf) or a withJobConfParam("key", value)
        Hide
        Brock Noland added a comment -

        I agree JUnit and perhaps TestNG compatibility is a very good thing.

        The driver today can do that builder syntax so all my example above is eliminating is the creation of the driver. Not entirely sure it's worth the effort.

        I am not sure how we'd get rid of calling run? Maybe an annotation could do that, I am not annotation expert by any means.

        Show
        Brock Noland added a comment - I agree JUnit and perhaps TestNG compatibility is a very good thing. The driver today can do that builder syntax so all my example above is eliminating is the creation of the driver. Not entirely sure it's worth the effort. I am not sure how we'd get rid of calling run? Maybe an annotation could do that, I am not annotation expert by any means.
        Hide
        Jim Donofrio added a comment -

        With the builder syntax is it necessary to have withInput and addInput, shouldnt we just have addInput return "this" instead of having all these repetitive methods all over the source code?

        I dont know much about creating annotations either but we may be able to reduce some of the repetitive code with them.

        Show
        Jim Donofrio added a comment - With the builder syntax is it necessary to have withInput and addInput, shouldnt we just have addInput return "this" instead of having all these repetitive methods all over the source code? I dont know much about creating annotations either but we may be able to reduce some of the repetitive code with them.
        Hide
        Brock Noland added a comment -

        Yes I agree, they are redundant, unfortunately it's exposed to the public so I don't think we can remove the method signatures.

        I was thinking about what we should do longterm. MRUNIT-13 has been open for some time. I don't know if that requires a much full featured test framework ala Local Job Runner or if we can fulfill that requirement with MRUnit. That patch attached to that JIRA is an omnibus patch which changes a bunch of mrunit.

        Show
        Brock Noland added a comment - Yes I agree, they are redundant, unfortunately it's exposed to the public so I don't think we can remove the method signatures. I was thinking about what we should do longterm. MRUNIT-13 has been open for some time. I don't know if that requires a much full featured test framework ala Local Job Runner or if we can fulfill that requirement with MRUnit. That patch attached to that JIRA is an omnibus patch which changes a bunch of mrunit.
        Hide
        Jim Donofrio added a comment -

        Yes I agree that patch changes to much. I will have to think if there is a better way to integrate more with what we have now. Is there a reason MRUNIT-12 and MRUNIT-13 are sort of duplicates of each other. Shouldnt we just close MRUNIT-12 and give it a status of Later or something?

        Show
        Jim Donofrio added a comment - Yes I agree that patch changes to much. I will have to think if there is a better way to integrate more with what we have now. Is there a reason MRUNIT-12 and MRUNIT-13 are sort of duplicates of each other. Shouldnt we just close MRUNIT-12 and give it a status of Later or something?
        Hide
        Jim Donofrio added a comment -

        We use an annotation on the class to specify the mapper/reducer classes because we can use arrays which is required for PipelineMapReduceDriver which isnt possible with declaring the classes in the generic type. Users could also easily leave off mapper/reducer classes in the generic type.

        We then need a way to read these annotations. We either make all testclasses extend DriverTest because we can use this.getClass() to get a reference to the actual user class in order to use Reflection to read the annotations at runtime. Or we use all static methods with thread locals and use Thread.currentThread().getStackTrace() to search for the annotations. We may want to use thread locals even if we extend DriverTest because some people may want to use the concurrent JUnit runners which can run multiple methods in parallel.

        Lastly, we use a @Before method in the superclass Driver to cleanup the state and the @After method does the runTest call so an example would be:

        @MapTest(IdentityMapper.class)
        public class IdentityMapperTest {

        @Test
        public void testMap()

        { addConfParam(key, value); addInput(key, value); addInput(key, value); addCounter(key, value); addOutput(key, value); addMultipleOutput(prefix, key, value); }

        }

        This eliminates the driver, the need to specify the mapper/reducer class in every method, the run call. I dont think it could be much simpler than this. We could also allow multiple key, value pairs into a given mapper for MRUNIT-64 or multiple keys into a reducer with a method that took a varargs value.

        We could put all this in a new package structure and deprecate the old api which would fix MRUNIT-76

        Show
        Jim Donofrio added a comment - We use an annotation on the class to specify the mapper/reducer classes because we can use arrays which is required for PipelineMapReduceDriver which isnt possible with declaring the classes in the generic type. Users could also easily leave off mapper/reducer classes in the generic type. We then need a way to read these annotations. We either make all testclasses extend DriverTest because we can use this.getClass() to get a reference to the actual user class in order to use Reflection to read the annotations at runtime. Or we use all static methods with thread locals and use Thread.currentThread().getStackTrace() to search for the annotations. We may want to use thread locals even if we extend DriverTest because some people may want to use the concurrent JUnit runners which can run multiple methods in parallel. Lastly, we use a @Before method in the superclass Driver to cleanup the state and the @After method does the runTest call so an example would be: @MapTest(IdentityMapper.class) public class IdentityMapperTest { @Test public void testMap() { addConfParam(key, value); addInput(key, value); addInput(key, value); addCounter(key, value); addOutput(key, value); addMultipleOutput(prefix, key, value); } } This eliminates the driver, the need to specify the mapper/reducer class in every method, the run call. I dont think it could be much simpler than this. We could also allow multiple key, value pairs into a given mapper for MRUNIT-64 or multiple keys into a reducer with a method that took a varargs value. We could put all this in a new package structure and deprecate the old api which would fix MRUNIT-76
        Hide
        Brock Noland added a comment -

        That looks good. As I understand it we would create a new MRUnit API with new package name. The old MRUnit api would not be changed but deprecated. I would like to see the old api deprecated but supported for bug fixes for a long period, like a couple of years. We haven't seen many bugs so I think that would require little effort.

        Things I'd like to see:
        -MultipleOutputs via MRUNIT-13
        -Multiple input key/value pairs via MRUNIT-64
        -Cleaner API
        -Running a generic Map Reduce driver (via local job runner) and validating the output

        Show
        Brock Noland added a comment - That looks good. As I understand it we would create a new MRUnit API with new package name. The old MRUnit api would not be changed but deprecated. I would like to see the old api deprecated but supported for bug fixes for a long period, like a couple of years. We haven't seen many bugs so I think that would require little effort. Things I'd like to see: -MultipleOutputs via MRUNIT-13 -Multiple input key/value pairs via MRUNIT-64 -Cleaner API -Running a generic Map Reduce driver (via local job runner) and validating the output
        Hide
        Brock Noland added a comment -

        I also think we'd only have to support the o.a.h.mapreduce api since mapred will be going away at some point.

        Show
        Brock Noland added a comment - I also think we'd only have to support the o.a.h.mapreduce api since mapred will be going away at some point.
        Hide
        Jim Donofrio added a comment -

        Ok that sounds good. I will start this work in a branch to prevent effecting regular releases.

        I will convert this JIRA to a Umbrella and put the following as Sub tasks of this JIRA along with other relevant requirements as they come up:
        MRUNIT-13 - Add support for MultipleOutputs
        MRUNIT-64 - Multiple Input Key, Value Pairs should be supported
        Running a generic Map Reduce driver (via local job runner) and validating the output

        Show
        Jim Donofrio added a comment - Ok that sounds good. I will start this work in a branch to prevent effecting regular releases. I will convert this JIRA to a Umbrella and put the following as Sub tasks of this JIRA along with other relevant requirements as they come up: MRUNIT-13 - Add support for MultipleOutputs MRUNIT-64 - Multiple Input Key, Value Pairs should be supported Running a generic Map Reduce driver (via local job runner) and validating the output
        Hide
        Brock Noland added a comment -

        Cool!

        Show
        Brock Noland added a comment - Cool!
        Hide
        Jim Donofrio added a comment -

        What should the new package structure be?

        I noticed the groupId is now org.apache.mrunit, should we use that or is the plan for MRUNIT upon graduation to become a subproject of hadoop so we should keep org.apache.hadoop.mrunit?

        Show
        Jim Donofrio added a comment - What should the new package structure be? I noticed the groupId is now org.apache.mrunit, should we use that or is the plan for MRUNIT upon graduation to become a subproject of hadoop so we should keep org.apache.hadoop.mrunit?
        Hide
        Brock Noland added a comment -

        I think sub projects are out of style for whatever reason. The package structure question is a good one, let's think about that one for a bit? You should be able to start with a new separate package and then refactor it when we decide on a new one. Maybe start with: org.apache.mrunit.mapreduce?

        Show
        Brock Noland added a comment - I think sub projects are out of style for whatever reason. The package structure question is a good one, let's think about that one for a bit? You should be able to start with a new separate package and then refactor it when we decide on a new one. Maybe start with: org.apache.mrunit.mapreduce?
        Hide
        Jim Donofrio added a comment -

        The other question is how do we go about getting this in jenkins?

        Show
        Jim Donofrio added a comment - The other question is how do we go about getting this in jenkins?
        Hide
        Jim Donofrio added a comment -

        I think we should also support the current run() method if users want to test the output in some weird way. If they called run() they would set a flag that would prevent runTests() from running.

        Show
        Jim Donofrio added a comment - I think we should also support the current run() method if users want to test the output in some weird way. If they called run() they would set a flag that would prevent runTests() from running.
        Hide
        Jim Donofrio added a comment -

        I think initially I will just write the new api but have the new api use the existing framework in order to get the new api out sooner. Then later we can change the framework behind the scenes without any effect on the users

        Show
        Jim Donofrio added a comment - I think initially I will just write the new api but have the new api use the existing framework in order to get the new api out sooner. Then later we can change the framework behind the scenes without any effect on the users
        Hide
        Brock Noland added a comment -

        That's a pretty good idea!

        Show
        Brock Noland added a comment - That's a pretty good idea!
        Hide
        Bertrand Dechoux added a comment -

        I do have questions about the annotation based API (assuming there is no need to extend a provided parent test class).

        1) How do you intend to check for input/ouput types? I don't see how that can be done without exposing a context (like the typed driver or a parent class). It wouldn't be enforced anymore?

        2) How do you intend to check for methods which should not be called eg withKeyGroupingComparator during a reducer test? I don't see how that can be done without exposing a context. It wouldn't be enforced anymore?

        3) I like the idea of a in memory FileSystem implementation. I know Cascading 2 has now a local mode which sounds a bit similar. But it might too abstract or too strongly tied to Cascading concepts to be of any use to MRUnit. This feature would be also be a must for pig/hive when you want to run the same query but locally without the cluster latency. So it would be interesting to see if something already exist around Hadoop.

        Show
        Bertrand Dechoux added a comment - I do have questions about the annotation based API (assuming there is no need to extend a provided parent test class). 1) How do you intend to check for input/ouput types? I don't see how that can be done without exposing a context (like the typed driver or a parent class). It wouldn't be enforced anymore? 2) How do you intend to check for methods which should not be called eg withKeyGroupingComparator during a reducer test? I don't see how that can be done without exposing a context. It wouldn't be enforced anymore? 3) I like the idea of a in memory FileSystem implementation. I know Cascading 2 has now a local mode which sounds a bit similar. But it might too abstract or too strongly tied to Cascading concepts to be of any use to MRUnit. This feature would be also be a must for pig/hive when you want to run the same query but locally without the cluster latency. So it would be interesting to see if something already exist around Hadoop.

          People

          • Assignee:
            Jim Donofrio
            Reporter:
            Jim Donofrio
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:

              Development