Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-4059 Pig on Spark
  3. PIG-4848

pig.noSplitCombination=true should always be set internally for a merge join

    XMLWordPrintableJSON

    Details

    • Type: Sub-task
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: spark-branch
    • Component/s: spark
    • Labels:
      None

      Description

      In spark mode, for a merge join, the flag is NOT set as true internally. The input splits will be in the order of file size. The output is out of order.

      Scenaro:
      cat input1

      1	1
      

      cat input2

      2	2
      

      cat input3

      33	33
      

      A = LOAD 'input*' as (a:int, b:int);
      B = LOAD 'input*' as (a:int, b:int);
      C = JOIN A BY $0, B BY $0 USING 'merge';
      DUMP C;

      expected result:

      (1,1,1,1)
      (2,2,2,2)
      (33,33,33,33)
      

      actual result:

      (33,33,33,33)
      (1,1,1,1)
      (2,2,2,2)
      

      In MR mode, the flag was set as true internally for a merge join(see: PIG-2773). However, it doesn't work now. The output is still out of order, because the splits will be ordered again by hadoop-client. In spark mode, we can solve this issue.

        Attachments

        1. PIG-4848-hotfix.patch
          0.8 kB
          Xianda Ke
        2. PIG-4848-2.patch
          2 kB
          Xianda Ke
        3. PIG-4848.patch
          2 kB
          Xianda Ke

          Activity

            People

            • Assignee:
              kexianda Xianda Ke
              Reporter:
              kexianda Xianda Ke
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: