MRUnit
  1. MRUnit
  2. MRUNIT-91

runTest() should optionally ignore output order

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 0.8.1
    • Fix Version/s: 0.9.0
    • Labels:

      Description

      Currently MapDriver.runTest() assumes that the order of pairs emitted by the mapper matches the order of the MapDriver.addOutput() calls. However, there are valid mappers that for a given input pair produce output pairs whose order is unspecified for testing purposes. (For example, if the mapper being tested uses a set object for deduplication before emission.) runTest() cannot be used to test these kinds of mappers.

      A workaround is to not use runTest() but instead put the output of run() into a Set and assert that the contents of the set are correct, bypassing MRUnit's validation code.

      A possible improvement would be to add a boolean orderMatters parameter to MapDriver.runTest(), invoking an order-insensitive version of TestDriver.validate() when orderMatters is false and the existing version otherwise.

      For clarity's sake only mappers are discussed in this feature request, but the same applies to reducers as well.

      1. MRUNIT-91.patch
        33 kB
        Dave Beech
      2. MRUNIT-91-1.patch
        33 kB
        Dave Beech
      3. MRUNIT-91-2.patch
        54 kB
        Dave Beech
      4. MRUNIT-91-3.patch
        56 kB
        Dave Beech

        Activity

        Hide
        Brock Noland added a comment -

        Dave, thank you very much for your awesome contribution!

        Show
        Brock Noland added a comment - Dave, thank you very much for your awesome contribution!
        Hide
        Jim Donofrio added a comment -

        Committed in 1308657. Sorry that took too long, we look forward to more contributions in the future from you

        Show
        Jim Donofrio added a comment - Committed in 1308657. Sorry that took too long, we look forward to more contributions in the future from you
        Hide
        Jim Donofrio added a comment -

        Thanks it looks good. I will commit this tonight. It is just taking awhile to manually merge it due to all the changes I made between the revision you started on the the current revision to make null inputs fail fast.

        Show
        Jim Donofrio added a comment - Thanks it looks good. I will commit this tonight. It is just taking awhile to manually merge it due to all the changes I made between the revision you started on the the current revision to make null inputs fail fast.
        Hide
        Dave Beech added a comment -

        Thanks very much for the feedback - I've taken your suggestions and amended the patch. New version here.

        Show
        Dave Beech added a comment - Thanks very much for the feedback - I've taken your suggestions and amended the patch. New version here.
        Hide
        Jim Donofrio added a comment -

        Math.abs is unnecessary on line 272, actualPositionList.size() must be > expectedPositionList.size()

        Why in 2 places do you say Received unexpected output without the position but in one place you give the position, I would say you should give the position, the more info the better. If you add this you should check if orderMatters at the end of the last for loop where you only output unexpected output with the remainer of the actualPositions. I think you could then abstract the logic in that last for loop where you check for unexpected output and the logic in the second to last for loop where you check for unexpected expected output into one method that you pass the relevant lists in along with the string Missing expected output vs Received unexpected output. This will also help to reduce the large amount of code in this single method.

        I think for (int i = 0; ; i++) { would be cleaner as a while loop:
        while (expectedPositionList.size() > i || actualPositionList.size() > i) {
        if (expectedPositionList.size() > i && actualPositionList.size() > i) {

        } else if (expectedPositionList.size() > i) {

        } else {

        }
        }

        Otherwise maybe adding some more comments would be useful given how complicated this method is but overall a great contribution.

        Show
        Jim Donofrio added a comment - Math.abs is unnecessary on line 272, actualPositionList.size() must be > expectedPositionList.size() Why in 2 places do you say Received unexpected output without the position but in one place you give the position, I would say you should give the position, the more info the better. If you add this you should check if orderMatters at the end of the last for loop where you only output unexpected output with the remainer of the actualPositions. I think you could then abstract the logic in that last for loop where you check for unexpected output and the logic in the second to last for loop where you check for unexpected expected output into one method that you pass the relevant lists in along with the string Missing expected output vs Received unexpected output. This will also help to reduce the large amount of code in this single method. I think for (int i = 0; ; i++) { would be cleaner as a while loop: while (expectedPositionList.size() > i || actualPositionList.size() > i) { if (expectedPositionList.size() > i && actualPositionList.size() > i) { } else if (expectedPositionList.size() > i) { } else { } } Otherwise maybe adding some more comments would be useful given how complicated this method is but overall a great contribution.
        Hide
        Jim Donofrio added a comment -

        I agree with Brock in not rejecting the patch due to it making major changes. I had planned in MRUNIT-69 to improve the validate method to be simpler as I have made it more fragile with the class checking code. I am looking at your patch now.

        Show
        Jim Donofrio added a comment - I agree with Brock in not rejecting the patch due to it making major changes. I had planned in MRUNIT-69 to improve the validate method to be simpler as I have made it more fragile with the class checking code. I am looking at your patch now.
        Hide
        Brock Noland added a comment -

        Great! We will adjust the JIRA as opposed to rejecting a patch if it requires more than `minor' changes.

        Show
        Brock Noland added a comment - Great! We will adjust the JIRA as opposed to rejecting a patch if it requires more than `minor' changes.
        Hide
        Dave Beech added a comment -

        Here is a new patch (which passes unit tests this time - apologies for the previous one!)

        I found it quite difficult to get the last test case raised by Jim to pass because of the way validation logic is split between the "validate" and "lookupExpectedValue" methods.

        As a result I've probably changed much more than you'd like for such a minor improvement, so I'll understand if you do not want to accept this patch for that reason. Please let me know your thoughts.

        Show
        Dave Beech added a comment - Here is a new patch (which passes unit tests this time - apologies for the previous one!) I found it quite difficult to get the last test case raised by Jim to pass because of the way validation logic is split between the "validate" and "lookupExpectedValue" methods. As a result I've probably changed much more than you'd like for such a minor improvement, so I'll understand if you do not want to accept this patch for that reason. Please let me know your thoughts.
        Hide
        Dave Beech added a comment -

        Actually - ignore that latest patch altogether please. It's broken.

        Show
        Dave Beech added a comment - Actually - ignore that latest patch altogether please. It's broken.
        Hide
        Dave Beech added a comment -

        Thanks for the clarification Jim - I think our messages crossed each other on the wire just now! I think my latest patch fixes that issue, but I haven't included any extra unit tests. I will submit a third with tests included.

        Show
        Dave Beech added a comment - Thanks for the clarification Jim - I think our messages crossed each other on the wire just now! I think my latest patch fixes that issue, but I haven't included any extra unit tests. I will submit a third with tests included.
        Hide
        Jim Donofrio added a comment -

        Sorry my example wasnt correct but I think you know what I meant. I was trying this on a mapper but a mapper can only take one input.

        The below tests for a reducer or mapreduce driver fail with 0 Error(s) because of a mismatch in the output vs expected output size:

        @Test
        public void testDuplicateOutputOrderInsensitive()

        { thrown .expectAssertionErrorMessage("1 Error(s): (Received unexpected output (foo, bar))"); final ReduceDriver<Text, Text, Text, Text> driver = ReduceDriver .newReduceDriver(new IdentityReducer<Text, Text>()); driver.withInputKey(new Text("foo")).withInputValue(new Text("bar")) .withInputValue(new Text("bar")) .withOutput(new Text("foo"), new Text("bar")).runTest(false); }

        @Test
        public void testDuplicateOutputOrderInsensitive()

        { thrown .expectAssertionErrorMessage("1 Error(s): (Received unexpected output (foo, bar))"); driver2.withMapper(new IdentityMapper<Text, Text>()).withReducer( new IdentityReducer<Text, Text>()); driver2.withInput(new Text("foo"), new Text("bar")) .withInput(new Text("foo"), new Text("bar")) .withOutput(new Text("foo"), new Text("bar")).runTest(false); }
        Show
        Jim Donofrio added a comment - Sorry my example wasnt correct but I think you know what I meant. I was trying this on a mapper but a mapper can only take one input. The below tests for a reducer or mapreduce driver fail with 0 Error(s) because of a mismatch in the output vs expected output size: @Test public void testDuplicateOutputOrderInsensitive() { thrown .expectAssertionErrorMessage("1 Error(s): (Received unexpected output (foo, bar))"); final ReduceDriver<Text, Text, Text, Text> driver = ReduceDriver .newReduceDriver(new IdentityReducer<Text, Text>()); driver.withInputKey(new Text("foo")).withInputValue(new Text("bar")) .withInputValue(new Text("bar")) .withOutput(new Text("foo"), new Text("bar")).runTest(false); } @Test public void testDuplicateOutputOrderInsensitive() { thrown .expectAssertionErrorMessage("1 Error(s): (Received unexpected output (foo, bar))"); driver2.withMapper(new IdentityMapper<Text, Text>()).withReducer( new IdentityReducer<Text, Text>()); driver2.withInput(new Text("foo"), new Text("bar")) .withInput(new Text("foo"), new Text("bar")) .withOutput(new Text("foo"), new Text("bar")).runTest(false); }
        Hide
        Dave Beech added a comment -

        Hi Jim - updated patch here. I pasted in your test into TestMapDriver and, after a little head-scratching, realised that it was passing because only one key value pair is supported at a time, so there was no duplicate input after all.

        The same test run in a MapReduceDriver did fail but gave no error message. So I've added an extra check at the end of "validate" to look for Pairs in the actual output set that have no t been accounted for. This will hopefully solve the issue.

        Show
        Dave Beech added a comment - Hi Jim - updated patch here. I pasted in your test into TestMapDriver and, after a little head-scratching, realised that it was passing because only one key value pair is supported at a time, so there was no duplicate input after all. The same test run in a MapReduceDriver did fail but gave no error message. So I've added an extra check at the end of "validate" to look for Pairs in the actual output set that have no t been accounted for. This will hopefully solve the issue.
        Hide
        Dave Beech added a comment -

        Thanks Jim - good spot. I'll fix and resubmit.

        Show
        Dave Beech added a comment - Thanks Jim - good spot. I'll fix and resubmit.
        Hide
        Jim Donofrio added a comment -

        Overall patch looks good. However there is one problem, the below test with duplicate inputs passes without any error about missing about because you only check if the expected output list contains the input.

        @Test
        public void testDuplicateOutputOrderInsensitive()

        { driver.withInput(new Text("foo"), new Text("bar")) .withInput(new Text("foo"), new Text("bar")) .withOutput(new Text("foo"), new Text("bar")).runTest(false); }
        Show
        Jim Donofrio added a comment - Overall patch looks good. However there is one problem, the below test with duplicate inputs passes without any error about missing about because you only check if the expected output list contains the input. @Test public void testDuplicateOutputOrderInsensitive() { driver.withInput(new Text("foo"), new Text("bar")) .withInput(new Text("foo"), new Text("bar")) .withOutput(new Text("foo"), new Text("bar")).runTest(false); }
        Hide
        Dave Beech added a comment -

        +1 I'd really like to see this included - I've often needed to use the workaround described by the original reporter.

        I've created an initial patch to add in this feature - please review!

        Show
        Dave Beech added a comment - +1 I'd really like to see this included - I've often needed to use the workaround described by the original reporter. I've created an initial patch to add in this feature - please review!
        Hide
        Jeremy Kahn added a comment -

        I hadn't considered map-only jobs, where order does matter.

        I withdraw my suggestion that order-insensitivity be the default, but having an order-insensitive option available would definitely be useful. + 1 to original suggestion from Bill McNeill above (hi, Bill!)

        Show
        Jeremy Kahn added a comment - I hadn't considered map-only jobs, where order does matter. I withdraw my suggestion that order-insensitivity be the default, but having an order-insensitive option available would definitely be useful. + 1 to original suggestion from Bill McNeill above (hi, Bill!)
        Hide
        Brock Noland added a comment -

        Yeah this is an interesting one. So if you run a map only job your outputs are written directly to HDFS in the order you output them. In a mapreduce job they are sorted and then delivered to the reducers. Based on that I don't see a clear line to whether we should care about the output order of the mapper by default.

        My feeling on the default to false (in new api where we could change compatibly) comes from when I first tested a mapper with MapDriver it failed due to ordering and I was surprised. It's very much a gut feeling.

        Show
        Brock Noland added a comment - Yeah this is an interesting one. So if you run a map only job your outputs are written directly to HDFS in the order you output them. In a mapreduce job they are sorted and then delivered to the reducers. Based on that I don't see a clear line to whether we should care about the output order of the mapper by default. My feeling on the default to false (in new api where we could change compatibly) comes from when I first tested a mapper with MapDriver it failed due to ordering and I was surprised. It's very much a gut feeling.
        Hide
        Jim Donofrio added a comment - - edited

        I do not agree with disabling order by default, this has been the convention that users has been used to from the beginning. Whether or not this is a good convention they will continue to make this assumption no matter how well this change is documented. I would make the default to care about order and allow you to turn it off by setting a boolean in runTest.

        I agree that most mappers wont care about order except for mappers in map only jobs but it would be confusing to have runTest default to ignoring order for mappers and caring about order in the rest of the api

        Show
        Jim Donofrio added a comment - - edited I do not agree with disabling order by default, this has been the convention that users has been used to from the beginning. Whether or not this is a good convention they will continue to make this assumption no matter how well this change is documented. I would make the default to care about order and allow you to turn it off by setting a boolean in runTest. I agree that most mappers wont care about order except for mappers in map only jobs but it would be confusing to have runTest default to ignoring order for mappers and caring about order in the rest of the api
        Hide
        Brock Noland added a comment -

        I actually agree with this with that, but changing that could cause users to who depend on the MapDriver checking ordering to have silent test failures. Perhaps the change from caring about ordering from the mapper should be changed via MRUNIT-69.

        Show
        Brock Noland added a comment - I actually agree with this with that, but changing that could cause users to who depend on the MapDriver checking ordering to have silent test failures. Perhaps the change from caring about ordering from the mapper should be changed via MRUNIT-69 .
        Hide
        Jeremy Kahn added a comment -

        In thinking about this further, I think that MapDriver.runTest() should by default ignore ordering differences – the point of a mapper is to emit items that will be sorted before being passed to a reducer.

        Users who really insist on order-matches could invoke .runTest(orderMatters=true); everybody else should just expect the mapper output to have the same Collection output (unordered multi-set) directed by the cumulative calls to .withOutput().

        Show
        Jeremy Kahn added a comment - In thinking about this further, I think that MapDriver.runTest() should by default ignore ordering differences – the point of a mapper is to emit items that will be sorted before being passed to a reducer. Users who really insist on order-matches could invoke .runTest(orderMatters=true); everybody else should just expect the mapper output to have the same Collection output (unordered multi-set) directed by the cumulative calls to .withOutput().
        Hide
        Brock Noland added a comment -

        +1, I think that we should be able to turn that assumption off.

        Show
        Brock Noland added a comment - +1, I think that we should be able to turn that assumption off.

          People

          • Assignee:
            Dave Beech
            Reporter:
            William McNeill
          • Votes:
            3 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development