Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-4059 Pig on Spark
  3. PIG-4848

pig.noSplitCombination=true should always be set internally for a merge join

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • spark-branch
    • spark
    • None

    Description

      In spark mode, for a merge join, the flag is NOT set as true internally. The input splits will be in the order of file size. The output is out of order.

      Scenaro:
      cat input1

      1	1
      

      cat input2

      2	2
      

      cat input3

      33	33
      

      A = LOAD 'input*' as (a:int, b:int);
      B = LOAD 'input*' as (a:int, b:int);
      C = JOIN A BY $0, B BY $0 USING 'merge';
      DUMP C;

      expected result:

      (1,1,1,1)
      (2,2,2,2)
      (33,33,33,33)
      

      actual result:

      (33,33,33,33)
      (1,1,1,1)
      (2,2,2,2)
      

      In MR mode, the flag was set as true internally for a merge join(see: PIG-2773). However, it doesn't work now. The output is still out of order, because the splits will be ordered again by hadoop-client. In spark mode, we can solve this issue.

      Attachments

        1. PIG-4848.patch
          2 kB
          Xianda Ke
        2. PIG-4848-2.patch
          2 kB
          Xianda Ke
        3. PIG-4848-hotfix.patch
          0.8 kB
          Xianda Ke

        Activity

          People

            kexianda Xianda Ke
            kexianda Xianda Ke
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: