Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-2673

Allow Merge join to follow an ORDER statement

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.11
    • Component/s: None
    • Labels:
      None

      Description

      Currently, we insist that data for a merge join must come from an OrderedLoadFunc.
      We can relax this condition and allow explicit ordering operations to precede a MergeJoin.

      1. PIG-2673_0.patch
        11 kB
        Dmitriy V. Ryaboy
      2. PIG-2673_1_noprefix_now_with_merge.patch
        17 kB
        Dmitriy V. Ryaboy
      3. PIG-2673_1_noprefix.patch
        272 kB
        Dmitriy V. Ryaboy
      4. PIG-2673_1.patch
        273 kB
        Dmitriy V. Ryaboy
      5. PIG-2673.2.patch
        21 kB
        Dmitriy V. Ryaboy

        Issue Links

          Activity

          Hide
          dvryaboy Dmitriy V. Ryaboy added a comment -

          Attaching patch that relaxes the sanity checks and updates the documentation.

          Note that while working on this patch I discovered we never enforced a stated condition for the merge join, which is that you cannot call UDFs on the data between a load and a merge join. This is also fixed in this patch. This fix may break people's script in cases when they were using UDFs and unwittingly violating the documented constraint.

          Show
          dvryaboy Dmitriy V. Ryaboy added a comment - Attaching patch that relaxes the sanity checks and updates the documentation. Note that while working on this patch I discovered we never enforced a stated condition for the merge join, which is that you cannot call UDFs on the data between a load and a merge join. This is also fixed in this patch. This fix may break people's script in cases when they were using UDFs and unwittingly violating the documented constraint.
          Hide
          ashutoshc Ashutosh Chauhan added a comment -

          PIG-959 relaxes the condition for all non-blocking operators too. May be these too should be done in conjunction.

          Show
          ashutoshc Ashutosh Chauhan added a comment - PIG-959 relaxes the condition for all non-blocking operators too. May be these too should be done in conjunction.
          Hide
          ashutoshc Ashutosh Chauhan added a comment -

          Aah.. blocking.. not non-blocking.

          Show
          ashutoshc Ashutosh Chauhan added a comment - Aah.. blocking.. not non-blocking.
          Hide
          dvryaboy Dmitriy V. Ryaboy added a comment -

          I don't think you can relax the conditions around distinct, since that involves hashing the input across reducers?

          Show
          dvryaboy Dmitriy V. Ryaboy added a comment - I don't think you can relax the conditions around distinct, since that involves hashing the input across reducers?
          Hide
          dvryaboy Dmitriy V. Ryaboy added a comment -

          Ashutosh, can you review this? Think you are the person most familiar with the Merge join code.

          Show
          dvryaboy Dmitriy V. Ryaboy added a comment - Ashutosh, can you review this? Think you are the person most familiar with the Merge join code.
          Hide
          ashutoshc Ashutosh Chauhan added a comment -

          Yeah, will take a look at it.

          Show
          ashutoshc Ashutosh Chauhan added a comment - Yeah, will take a look at it.
          Hide
          ashutoshc Ashutosh Chauhan added a comment -

          I don't think you can relax the conditions around distinct, since that involves hashing the input across reducers?

          Yeah, thats right. For other blocking operators:

          • Order-by: obv we can relax it.
          • Join/Cogroup: We can relax for map-side joins & cogroups i.e., FR,Merge Join and cogroup, but not the ones having reducers.
          Show
          ashutoshc Ashutosh Chauhan added a comment - I don't think you can relax the conditions around distinct, since that involves hashing the input across reducers? Yeah, thats right. For other blocking operators: Order-by: obv we can relax it. Join/Cogroup: We can relax for map-side joins & cogroups i.e., FR,Merge Join and cogroup, but not the ones having reducers.
          Hide
          ashutoshc Ashutosh Chauhan added a comment -

          I took a quick look at your patch. For chaining together of MR jobs (i.e., order-by MR job feeding into merge-join MR job), you will need changes in MRCompiler which are outlined in PIG-959 patch. I don't see any of that in your patch. There must be something which I am missing here.

          Show
          ashutoshc Ashutosh Chauhan added a comment - I took a quick look at your patch. For chaining together of MR jobs (i.e., order-by MR job feeding into merge-join MR job), you will need changes in MRCompiler which are outlined in PIG-959 patch. I don't see any of that in your patch. There must be something which I am missing here.
          Hide
          dvryaboy Dmitriy V. Ryaboy added a comment -

          Patch is incomplete right now.. only fixes the LP blocks, but it does need the MRCompiler changes. Will need some more work. Hackday!

          Show
          dvryaboy Dmitriy V. Ryaboy added a comment - Patch is incomplete right now.. only fixes the LP blocks, but it does need the MRCompiler changes. Will need some more work. Hackday!
          Hide
          dvryaboy Dmitriy V. Ryaboy added a comment -

          This seems to work (verified manually in addition to running tests). Ready for review again

          Show
          dvryaboy Dmitriy V. Ryaboy added a comment - This seems to work (verified manually in addition to running tests). Ready for review again
          Hide
          ashutoshc Ashutosh Chauhan added a comment -

          Seems like you ran git pull after a long while.

          Show
          ashutoshc Ashutosh Chauhan added a comment - Seems like you ran git pull after a long while.
          Hide
          dvryaboy Dmitriy V. Ryaboy added a comment -

          Now, without git prefixes.

          Show
          dvryaboy Dmitriy V. Ryaboy added a comment - Now, without git prefixes.
          Hide
          dvryaboy Dmitriy V. Ryaboy added a comment -

          sigh.

          Show
          dvryaboy Dmitriy V. Ryaboy added a comment - sigh.
          Hide
          julienledem Julien Le Dem added a comment -

          for some reason the review board did not post my review back to the JIRA.
          FYI: https://reviews.apache.org/r/5280/

          Show
          julienledem Julien Le Dem added a comment - for some reason the review board did not post my review back to the JIRA. FYI: https://reviews.apache.org/r/5280/
          Hide
          dvryaboy Dmitriy V. Ryaboy added a comment -

          Addressed review comments. TestMergeJoin passes (and takes 30 minutes...)

          Show
          dvryaboy Dmitriy V. Ryaboy added a comment - Addressed review comments. TestMergeJoin passes (and takes 30 minutes...)
          Hide
          jcoveney Jonathan Coveney added a comment -

          LGTM. I'm +1. Julien, you satisfied with the responses?

          Show
          jcoveney Jonathan Coveney added a comment - LGTM. I'm +1. Julien, you satisfied with the responses?
          Hide
          julienledem Julien Le Dem added a comment -

          LGTM.
          +1

          Show
          julienledem Julien Le Dem added a comment - LGTM. +1
          Hide
          dvryaboy Dmitriy V. Ryaboy added a comment -

          Committed to 0.11 (trunk)

          Show
          dvryaboy Dmitriy V. Ryaboy added a comment - Committed to 0.11 (trunk)

            People

            • Assignee:
              dvryaboy Dmitriy V. Ryaboy
              Reporter:
              dvryaboy Dmitriy V. Ryaboy
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development