Details

    • Type: Sub-task Sub-task
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: tez-branch
    • Component/s: tez
    • Labels:
      None

      Description

      The PigProcessor needs to be able to handle multiple distinct inputs. These can come in a variety of flavors including multiple "file" inputs (Merge join), multiple shuffle inputs (Hash Join / Co-group), and a mix (Replicated Join).

      1. PIG-3527.1.patch
        74 kB
        Mark Wagner
      2. PIG-3527.2.patch
        322 kB
        Mark Wagner
      3. PIG-3527.3.patch
        89 kB
        Cheolsoo Park
      4. PIG-3527.4.patch
        94 kB
        Cheolsoo Park

        Activity

        Hide
        Cheolsoo Park added a comment -

        Committed PIG-3527.4.patch into tez branch. Thank you Mark!

        Note that I discovered 4 e2e test failures-

        • Checkin_3
        • Join_1
        • Operators_1
        • Operators_5

        I looked at the diff between Tez and MR runs and found that these are mostly due to non-deterministic natures of test queries. For eg, the order of tuples in group, the top 100 tuples selected by limit, etc. I will file a separate jira to fix the tests.

        Show
        Cheolsoo Park added a comment - Committed PIG-3527 .4.patch into tez branch. Thank you Mark! Note that I discovered 4 e2e test failures- Checkin_3 Join_1 Operators_1 Operators_5 I looked at the diff between Tez and MR runs and found that these are mostly due to non-deterministic natures of test queries. For eg, the order of tuples in group, the top 100 tuples selected by limit, etc. I will file a separate jira to fix the tests.
        Hide
        Cheolsoo Park added a comment -

        Here is a patch that is rebased to the current HEAD of tez branch. I will commit this after running tests.

        In terms of changes, Mark's patch includes 1) POPackage refactoring + 2) incremental changes for multiple inputs, and I eliminated #1 from the patch. Since yesterday, I am also going through the patch carefully to make sure that I am not losing any changes.

        Show
        Cheolsoo Park added a comment - Here is a patch that is rebased to the current HEAD of tez branch. I will commit this after running tests. In terms of changes, Mark's patch includes 1) POPackage refactoring + 2) incremental changes for multiple inputs, and I eliminated #1 from the patch. Since yesterday, I am also going through the patch carefully to make sure that I am not losing any changes.
        Hide
        Mark Wagner added a comment -

        Update with the POPackage refactoring. This patch depends on the one in PIG-3595.

        Show
        Mark Wagner added a comment - Update with the POPackage refactoring. This patch depends on the one in PIG-3595 .
        Hide
        Cheolsoo Park added a comment -

        Mark Wagner, thank you very much for the great work! Overall it looks good. I am still going through the patch, but I made two high level comments in the RB.

        Show
        Cheolsoo Park added a comment - Mark Wagner , thank you very much for the great work! Overall it looks good. I am still going through the patch, but I made two high level comments in the RB.
        Hide
        Mark Wagner added a comment -

        Here's an initial patch.There are some things that I need to clean up, and I've made notes of these with TODOs I've posted a review at https://reviews.apache.org/r/15194/. One interesting thing to note is that after attaching inputs directly to the operator pipeline, I observed an ~%40 speedup. I believe this is because there aren't so many calls returning STATUS_EOP, but I haven't tested this.

        Show
        Mark Wagner added a comment - Here's an initial patch.There are some things that I need to clean up, and I've made notes of these with TODOs I've posted a review at https://reviews.apache.org/r/15194/ . One interesting thing to note is that after attaching inputs directly to the operator pipeline, I observed an ~%40 speedup. I believe this is because there aren't so many calls returning STATUS_EOP, but I haven't tested this.

          People

          • Assignee:
            Mark Wagner
            Reporter:
            Mark Wagner
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development