Uploaded image for project: 'Crunch'
  1. Crunch
  2. CRUNCH-301

Cogrouping tables where RHS has a Scala tuple value type causes duplicated and missing RHS values

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.8.0
    • Fix Version/s: 0.9.0, 0.8.2
    • Component/s: Scrunch
    • Labels:
      None
    • Environment:
      Hadoop 2

      Description

      Suppose you have three record types, Rec1, Rec2 and Rec3.
      Rec1 references Rec2 via key1, and Rec2 references Rec3 (one-to-many) by key2. If you innerJoin Rec2 and Rec3 to make a PCollection[(Rec2,Rec3)] and they cogroup it against Rec1, then instead of surfacing n different (Rec2,Rec3) tuples applicable to the Rec1, it surfaces just one of the (Rec2, Rec3) tuples multiple times.

      This only happens when running with MRPipeline, and not with MemPipeline.

      Attached is the simplest complete program I could come up with which will produce this unexpected result:

      The result that is produced is:

      Rec1(1,tjena) Rec1(1,hello) (Rec2(1,a,0.5),Rec3(a,4)) (Rec2(1,a,0.5),Rec3(a,4)) (Rec2(1,a,0.5),Rec3(a,4)) (Rec2(1,a,0.5),Rec3(a,4))
      Rec1(2,goodbye) (Rec2(2,c,9.9),Rec3(c,6))

      As you can see, there's a single (Rec2, Rec3) tuple repeated many times, instead of showing all the distinct ones. This does not happen if you join against Rec2 on its own.

        Attachments

        1. IsolatedBug.scala
          3 kB
          David Whiting
        2. CRUNCH-301.patch
          9 kB
          Josh Wills
        3. CRUNCH-301b.patch
          9 kB
          Josh Wills

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              davw David Whiting
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: