Uploaded image for project: 'Crunch (Retired)'
  1. Crunch (Retired)
  2. CRUNCH-601

Short PCollections in SparkPipeline get length null.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 0.13.0
    • 0.15.0
    • Spark
    • None
    • Running in local mode on Mac as well as in a ubuntu 14.04 docker container

    Description

      I'll attach a file with a test that I would expect to pass but which fails.

      It creates five PCollection<String> of lengths 0, 1, 2, 3, 4 gets the lengths, runs the pipeline and prints the lengths. Finally it asserts that all lengths are non-null.

      I would expect it to print lengths 0, 1, 2, 3, 4 and pass.

      What it does is print lengths null, null, null, 3, 4 and fail.

      I think the underlying reason is the use of getSize() on an unmaterialized object and assuming that when the estimate that getSize() returns is 0, then the PCollection is guaranteed to be empty, which is false in some cases.

      Attachments

        1. CRUNCH-601.patch
          6 kB
          Micah Whitacre
        2. CRUNCH-601b.patch
          7 kB
          Micah Whitacre
        3. CRUNCH-601c.patch
          7 kB
          Micah Whitacre
        4. CRUNCH-601d.patch
          7 kB
          Josh Wills
        5. CRUNCH-601-jw.patch
          4 kB
          Josh Wills
        6. SmallCollectionLengthTest.java
          3 kB
          Mikael Goldmann

        Activity

          People

            mkwhitacre Micah Whitacre
            migoldmann Mikael Goldmann
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: