Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-5255 Improvements to bloom join
  3. PIG-5342

Add setting to turn off bloom join combiner

    XMLWordPrintableJSON

    Details

    • Type: Sub-task
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.18.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      1) Need a new setting pig.bloomjoin.nocombiner to turn off combiner for bloom join. When the keys are all unique, the combiner is unnecessary overhead.
      2) In previous case, the keys were the bloom filter index and the values were the join key. Combining involved doing a distinct on the bag of values which has memory issues for more than 10 million records. That needs to be flipped and distinct combiner used to scale to a billions of records.
      3) Mention in documentation that bloom join is also ideal in cases of right outer join with smaller dataset on the right. Replicate join only supports left outer join.

       

        Attachments

        1. PIG-5342-1.patch
          26 kB
          Satish Saley
        2. PIG-5342-2.patch
          46 kB
          Satish Saley
        3. PIG-5342-3.patch
          48 kB
          Satish Saley
        4. PIG-5342-4.patch
          41 kB
          Satish Saley
        5. PIG-5342-5.patch
          41 kB
          Satish Saley
        6. PIG-5342-6.patch
          43 kB
          Satish Saley
        7. PIG-5342-7.patch
          43 kB
          Satish Saley
        8. PIG-5342-8.patch
          43 kB
          Satish Saley

          Activity

            People

            • Assignee:
              satishsaley Satish Saley
              Reporter:
              satishsaley Satish Saley
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: