Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-4963

Add a Bloom join

    XMLWordPrintableJSON

    Details

    • Type: New Feature
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.17.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      In PIG-4925, added option to pass BloomFilter as a scalar to bloom function. But found that actually using it for big data which required huge vector size was very inefficient and led to OOM.
      I had initially calculated that it would take around 12MB bytearray for 100 million vectorsize (100000000 + 7) / 8 = 12500000 bytes) and that would be the scalar value broadcasted and would not take much space. But problem is 12MB was written out for every input record with BuildBloom$Initial before the aggregation happens and we arrive at the final BloomFilter vector. And with POPartialAgg it runs into OOM issues.

      If we added a bloom join implementation, which can be combined with hash or skewed join it would boost performance for a lot of jobs. Bloom filter of the smaller tables can be sent to the bigger tables as scalar and data filtered before hash or skewed join is used.

        Attachments

        1. PIG-4963-1.patch
          150 kB
          Rohini Palaniswamy
        2. PIG-4963-2.patch
          151 kB
          Rohini Palaniswamy
        3. PIG-4963-3.patch
          152 kB
          Rohini Palaniswamy
        4. PIG-4963-4.patch
          155 kB
          Rohini Palaniswamy
        5. PIG-4963-5.patch
          155 kB
          Rohini Palaniswamy
        6. PIG-4963-6.patch
          155 kB
          Rohini Palaniswamy

          Issue Links

            Activity

              People

              • Assignee:
                rohini Rohini Palaniswamy
                Reporter:
                rohini Rohini Palaniswamy
              • Votes:
                0 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: