Pig
  1. Pig
  2. PIG-2328

Add builtin UDFs for building and using bloom filters

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.10.0, 0.11
    • Component/s: internal-udfs
    • Labels:
      None
    • Release Note:
      Hide
      Bloom filters are a common way to select a limited set of records before moving data for a join or other heavy weight operation. For example, if one wanted to join a very large data set L with a smaller set S, and it was known that the number of keys in L that will match with S is small, building a bloom filter on S and then applying it to L before the join can greatly reduce the number of records from L that have to be moved from the map to the reduce, thus speeding the join.

      {code}
      define bb BuildBloom('128', '3', 'jenkins');
      small = load 'S' as (x, y, z);
      grpd = group small all;
      fltrd = foreach grpd generate bb(small.x);
      store fltrd in 'mybloom';
      exec;
      define bloom Bloom('mybloom');
      large = load 'L' as (a, b, c);
      flarge = filter large by bloom(L.a);
      joined = join small by x, flarge by a;
      store joined into 'results';
      {code}

      When constructing BuildBloom, the three arguments passed are the number of bits in the bloom filter, the number of hash functions used in constructing the bloom filter, and the type of hash function to use. Valid values for the hash functions are 'jenkins' and 'murmur'. See http://en.wikipedia.org/wiki/Bloom_filter for a discussion of how to select the number of bits and the number of hash functions.

      This uses Hadoop's bloom filters (org.apache.hadoop.util.bloom.BloomFilter) internally.
      Show
      Bloom filters are a common way to select a limited set of records before moving data for a join or other heavy weight operation. For example, if one wanted to join a very large data set L with a smaller set S, and it was known that the number of keys in L that will match with S is small, building a bloom filter on S and then applying it to L before the join can greatly reduce the number of records from L that have to be moved from the map to the reduce, thus speeding the join. {code} define bb BuildBloom('128', '3', 'jenkins'); small = load 'S' as (x, y, z); grpd = group small all; fltrd = foreach grpd generate bb(small.x); store fltrd in 'mybloom'; exec; define bloom Bloom('mybloom'); large = load 'L' as (a, b, c); flarge = filter large by bloom(L.a); joined = join small by x, flarge by a; store joined into 'results'; {code} When constructing BuildBloom, the three arguments passed are the number of bits in the bloom filter, the number of hash functions used in constructing the bloom filter, and the type of hash function to use. Valid values for the hash functions are 'jenkins' and 'murmur'. See http://en.wikipedia.org/wiki/Bloom_filter for a discussion of how to select the number of bits and the number of hash functions. This uses Hadoop's bloom filters (org.apache.hadoop.util.bloom.BloomFilter) internally.

      Description

      Bloom filters are a common way to do select a limited set of records before moving data for a join or other heavy weight operation. Pig should add UDFs to support building and using bloom filters.

      1. PIG-bloom-3.patch
        32 kB
        Alan Gates
      2. PIG-bloom-2.patch
        32 kB
        Alan Gates
      3. PIG-bloom.patch
        26 kB
        Alan Gates

        Issue Links

          Activity

          Alan Gates created issue -
          Alan Gates made changes -
          Field Original Value New Value
          Attachment PIG-bloom.patch [ 12499740 ]
          Alan Gates made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Alan Gates made changes -
          Release Note Bloom filters are a common way to select a limited set of records before moving data for a join or other heavy weight operation. For example, if one wanted to join a very large data set L with a smaller set S, and it was known that the number of keys in L that will match with S is small, building a bloom filter on S and then applying it to L before the join can greatly reduce the number of records from L that have to be moved from the map to the reduce, thus speeding the join.

          {code}
          define bb BuildBloom('128', '3', 'jenkins');
          small = load 'S' as (x, y, z);
          grpd = group small all;
          fltrd = foreach grpd generate bb(small.x);
          store fltrd in 'mybloom';
          exec;
          define bloom Bloom('mybloom');
          large = load 'L' as (a, b, c);
          flarge = filter large by bloom(L.a);
          joined = join small by x, flarge by a;
          store joined into 'results';
          {code}

          When constructing BuildBloom, the three arguments passed are the number of bits in the bloom filter, the number of hash functions used in constructing the bloom filter, and the type of hash function to use. Valid values for the hash functions are 'jenkins' and 'murmur'. See http://en.wikipedia.org/wiki/Bloom_filter for a discussion of how to select the number of bits and the number of hash functions.

          This uses Hadoop's bloom filters (org.apache.hadoop.util.bloom.BloomFilter) internally.
          Alan Gates made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Alan Gates made changes -
          Attachment PIG-bloom-2.patch [ 12500515 ]
          Alan Gates made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Alan Gates made changes -
          Attachment PIG-bloom-3.patch [ 12502081 ]
          Alan Gates made changes -
          Link This issue relates to PIG-2348 [ PIG-2348 ]
          Alan Gates made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Fix Version/s 0.11 [ 12318878 ]
          Resolution Fixed [ 1 ]
          Daniel Dai made changes -
          Status Resolved [ 5 ] Closed [ 6 ]

            People

            • Assignee:
              Alan Gates
              Reporter:
              Alan Gates
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development