Crunch
  1. Crunch
  2. CRUNCH-234

PCollectionGetSizeIT test failure with hadoop-2 profile

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.7.0
    • Component/s: Core
    • Labels:
      None

      Description

      Running with the "hadoop-2" profile, there is a test failure.

      Results :

      Failed tests: testGetSizeOfEmptyIntermediatePCollection_MRPipeline(org.apache.crunch.PCollectionGetSizeIT): (..)

      Tests run: 211, Failures: 1, Errors: 0, Skipped: 1

      I haven't dug into the test yet but this is the output from the test run:

      Tests run: 11, Failures: 1, Errors: 0, Skipped: 1, Time elapsed: 1.254 sec <<< FAILURE!
      testGetSizeOfEmptyIntermediatePCollection_MRPipeline(org.apache.crunch.PCollectionGetSizeIT) Time elapsed: 0.581 sec <<< FAILURE!
      java.lang.AssertionError:
      Expected: is <0L>
      got: <8L>

      at org.junit.Assert.assertThat(Assert.java:780)
      at org.junit.Assert.assertThat(Assert.java:738)
      at org.apache.crunch.PCollectionGetSizeIT.testGetSizeOfEmptyIntermediatePCollection_MRPipeline(PCollectionGetSizeIT.java:89)
      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
      at java.lang.reflect.Method.invoke(Method.java:597)
      at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45)
      at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
      at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42)
      at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
      at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
      at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:46)
      at org.junit.rules.RunRules.evaluate(RunRules.java:18)
      at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263)
      at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68)
      at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:47)
      at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231)
      at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60)
      at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229)
      at org.junit.runners.ParentRunner.access$000(ParentRunner.java:50)
      at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222)
      at org.junit.runners.ParentRunner.run(ParentRunner.java:300)
      at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:236)
      at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:134)
      at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:113)
      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
      at java.lang.reflect.Method.invoke(Method.java:597)
      at org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithArray(ReflectionUtils.java:189)
      at org.apache.maven.surefire.booter.ProviderFactory$ProviderProxy.invoke(ProviderFactory.java:165)
      at org.apache.maven.surefire.booter.ProviderFactory.invokeProvider(ProviderFactory.java:85)
      at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:103)
      at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:74)

      Note this comes after CRUNCH-233, which was done to get it to compile.

        Activity

        Transition Time In Source Status Execution Times Last Executer Last Execution Date
        Open Open Resolved Resolved
        5d 2h 27m 1 Josh Wills 08/Jul/13 19:05
        Resolved Resolved Closed Closed
        20d 21h 45m 1 Josh Wills 29/Jul/13 16:51
        Josh Wills made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Hide
        Micah Whitacre added a comment -

        +1. Change fixed the build for me locally as well.

        Show
        Micah Whitacre added a comment - +1. Change fixed the build for me locally as well.
        Josh Wills made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Hide
        Josh Wills added a comment -

        Committed to master.

        Show
        Josh Wills added a comment - Committed to master.
        Josh Wills made changes -
        Field Original Value New Value
        Attachment CRUNCH-234.patch [ 12591226 ]
        Hide
        Josh Wills added a comment -

        Here's a patch for this. Verifying that it works in both hadoop1 and hadoop2, and will then commit.

        Show
        Josh Wills added a comment - Here's a patch for this. Verifying that it works in both hadoop1 and hadoop2, and will then commit.
        Hide
        Josh Wills added a comment -

        Yeah, exactly. But we wouldn't want to do that if, e.g., the directory contained thousands of files. So we could use the content summary as a check, and say that if the # of files is less than X, then check the files directly. But that's still a hack.

        Show
        Josh Wills added a comment - Yeah, exactly. But we wouldn't want to do that if, e.g., the directory contained thousands of files. So we could use the content summary as a check, and say that if the # of files is less than X, then check the files directly. But that's still a hack.
        Hide
        Gabriel Reid added a comment -

        Right you are, of course. I suppose the contents of the directory could also be iterated over if the path is a directory, but it obviously won't be as simple as I was thinking.

        Show
        Gabriel Reid added a comment - Right you are, of course. I suppose the contents of the directory could also be iterated over if the path is a directory , but it obviously won't be as simple as I was thinking.
        Hide
        Josh Wills added a comment -

        I hadn't started in on it yet, but I don't think the filter trick will work in this case-- we're getting the content summary on the directory, not on the underlying files.

        Show
        Josh Wills added a comment - I hadn't started in on it yet, but I don't think the filter trick will work in this case-- we're getting the content summary on the directory, not on the underlying files.
        Hide
        Gabriel Reid added a comment -

        Just checking this out too – looks like the PathFilter used in CompositePathIterable could just be applied to globStatus call in SourceTargetHelper and that should do it.

        Josh, I don't know if you're working on a patch for this right now (and if you're already on a better plan just ignore my comment), but if not I don't mind doing it.

        Show
        Gabriel Reid added a comment - Just checking this out too – looks like the PathFilter used in CompositePathIterable could just be applied to globStatus call in SourceTargetHelper and that should do it. Josh, I don't know if you're working on a patch for this right now (and if you're already on a better plan just ignore my comment), but if not I don't mind doing it.
        Hide
        Josh Wills added a comment -

        Yeah, that's precisely the kind of hack I had in mind. I'll try a few things and see what looks least ugly.

        Show
        Josh Wills added a comment - Yeah, that's precisely the kind of hack I had in mind. I'll try a few things and see what looks least ugly.
        Hide
        Micah Whitacre added a comment -

        I believe for Sequence files you can actually specific regexes instead of simple paths. So one idea might be for intermediate state you don't specify a simple path but instead a regex that excludes any file named _SUCCESS. Kind of a hack I realize.

        Show
        Micah Whitacre added a comment - I believe for Sequence files you can actually specific regexes instead of simple paths. So one idea might be for intermediate state you don't specify a simple path but instead a regex that excludes any file named _SUCCESS. Kind of a hack I realize.
        Hide
        Josh Wills added a comment -

        So here's what's going on: the _SUCCESS file we're adding to the output from the intermediate collection MapReduce job now means that the output directory has a non-zero length associated with it, but only in Hadoop 2, not in Hadoop 1. I don't see a clean way to fix this in the current implementation-- we could just hack it to do a deeper inquiry into the content of the files when the number of entries in the directory is small.

        Show
        Josh Wills added a comment - So here's what's going on: the _SUCCESS file we're adding to the output from the intermediate collection MapReduce job now means that the output directory has a non-zero length associated with it, but only in Hadoop 2, not in Hadoop 1. I don't see a clean way to fix this in the current implementation-- we could just hack it to do a deeper inquiry into the content of the files when the number of entries in the directory is small.
        Hide
        Josh Wills added a comment -

        Crazy interesting-- will take a look.

        Show
        Josh Wills added a comment - Crazy interesting-- will take a look.
        Micah Whitacre created issue -

          People

          • Assignee:
            Josh Wills
            Reporter:
            Micah Whitacre
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development