Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-2690

Pig Documentation regarding Merge Join is confusing

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Not A Problem
    • Affects Version/s: 0.7.0, 0.8.1
    • Fix Version/s: None
    • Component/s: documentation, site
    • Labels:
    • Tags:
      Docs

      Description

      The Documentation regarding merge join in pig is a bit off.

      http://pig.apache.org/docs/r0.7.0/piglatin_ref1.html#Merge+Joins

      "For optimal performance, each part file of the left (sorted) input of the join should have a size of at least 1 hdfs block size (for example if the hdfs block size is 128 MB, each part file should be less than 128 MB). If the total input size (including all part files) is greater than blocksize, then the part files should be uniform in size (without large skews in sizes)."

      This is confusing and should read something more akin to this:
      http://wiki.apache.org/pig/PigMergeJoin

      For optimal performance, each part file of the left (sorted) input of the join should have a size of at least 1 hdfs block size (for example if the hdfs block size is 128 MB, each part file should be > 128 MB). If the total input size (including all part files) is < a blocksize, then the part files should be uniform in size (without large skews in sizes). The main idea is to eliminate skew in the amount of input the final map job performing the merge-join will process.

        Attachments

        1. fixDocs_0.patch
          3 kB
          Jeff Lord

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              jlord Jeff Lord
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: