Pig
  1. Pig
  2. PIG-2328

Add builtin UDFs for building and using bloom filters

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.10.0, 0.11
    • Component/s: internal-udfs
    • Labels:
      None
    • Release Note:
      Hide
      Bloom filters are a common way to select a limited set of records before moving data for a join or other heavy weight operation. For example, if one wanted to join a very large data set L with a smaller set S, and it was known that the number of keys in L that will match with S is small, building a bloom filter on S and then applying it to L before the join can greatly reduce the number of records from L that have to be moved from the map to the reduce, thus speeding the join.

      {code}
      define bb BuildBloom('128', '3', 'jenkins');
      small = load 'S' as (x, y, z);
      grpd = group small all;
      fltrd = foreach grpd generate bb(small.x);
      store fltrd in 'mybloom';
      exec;
      define bloom Bloom('mybloom');
      large = load 'L' as (a, b, c);
      flarge = filter large by bloom(L.a);
      joined = join small by x, flarge by a;
      store joined into 'results';
      {code}

      When constructing BuildBloom, the three arguments passed are the number of bits in the bloom filter, the number of hash functions used in constructing the bloom filter, and the type of hash function to use. Valid values for the hash functions are 'jenkins' and 'murmur'. See http://en.wikipedia.org/wiki/Bloom_filter for a discussion of how to select the number of bits and the number of hash functions.

      This uses Hadoop's bloom filters (org.apache.hadoop.util.bloom.BloomFilter) internally.
      Show
      Bloom filters are a common way to select a limited set of records before moving data for a join or other heavy weight operation. For example, if one wanted to join a very large data set L with a smaller set S, and it was known that the number of keys in L that will match with S is small, building a bloom filter on S and then applying it to L before the join can greatly reduce the number of records from L that have to be moved from the map to the reduce, thus speeding the join. {code} define bb BuildBloom('128', '3', 'jenkins'); small = load 'S' as (x, y, z); grpd = group small all; fltrd = foreach grpd generate bb(small.x); store fltrd in 'mybloom'; exec; define bloom Bloom('mybloom'); large = load 'L' as (a, b, c); flarge = filter large by bloom(L.a); joined = join small by x, flarge by a; store joined into 'results'; {code} When constructing BuildBloom, the three arguments passed are the number of bits in the bloom filter, the number of hash functions used in constructing the bloom filter, and the type of hash function to use. Valid values for the hash functions are 'jenkins' and 'murmur'. See http://en.wikipedia.org/wiki/Bloom_filter for a discussion of how to select the number of bits and the number of hash functions. This uses Hadoop's bloom filters (org.apache.hadoop.util.bloom.BloomFilter) internally.

      Description

      Bloom filters are a common way to do select a limited set of records before moving data for a join or other heavy weight operation. Pig should add UDFs to support building and using bloom filters.

      1. PIG-bloom-3.patch
        32 kB
        Alan Gates
      2. PIG-bloom-2.patch
        32 kB
        Alan Gates
      3. PIG-bloom.patch
        26 kB
        Alan Gates

        Issue Links

          Activity

          Alan Gates created issue -
          Alan Gates made changes -
          Field Original Value New Value
          Attachment PIG-bloom.patch [ 12499740 ]
          Hide
          Alan Gates added a comment -

          This patch includes two new UDFs, BuildBloom and Bloom. BuildBloom will build a bloom filter. It is intended to be used in a group all script where all records passed to it will be added to the filter. The results are stored to the output of the group all Pig job. Bloom can then be used to apply that filter to inputs in subsequent queries, such as a join. See the tests in the patch for examples.

          Show
          Alan Gates added a comment - This patch includes two new UDFs, BuildBloom and Bloom. BuildBloom will build a bloom filter. It is intended to be used in a group all script where all records passed to it will be added to the filter. The results are stored to the output of the group all Pig job. Bloom can then be used to apply that filter to inputs in subsequent queries, such as a join. See the tests in the patch for examples.
          Alan Gates made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Alan Gates made changes -
          Release Note Bloom filters are a common way to select a limited set of records before moving data for a join or other heavy weight operation. For example, if one wanted to join a very large data set L with a smaller set S, and it was known that the number of keys in L that will match with S is small, building a bloom filter on S and then applying it to L before the join can greatly reduce the number of records from L that have to be moved from the map to the reduce, thus speeding the join.

          {code}
          define bb BuildBloom('128', '3', 'jenkins');
          small = load 'S' as (x, y, z);
          grpd = group small all;
          fltrd = foreach grpd generate bb(small.x);
          store fltrd in 'mybloom';
          exec;
          define bloom Bloom('mybloom');
          large = load 'L' as (a, b, c);
          flarge = filter large by bloom(L.a);
          joined = join small by x, flarge by a;
          store joined into 'results';
          {code}

          When constructing BuildBloom, the three arguments passed are the number of bits in the bloom filter, the number of hash functions used in constructing the bloom filter, and the type of hash function to use. Valid values for the hash functions are 'jenkins' and 'murmur'. See http://en.wikipedia.org/wiki/Bloom_filter for a discussion of how to select the number of bits and the number of hash functions.

          This uses Hadoop's bloom filters (org.apache.hadoop.util.bloom.BloomFilter) internally.
          Hide
          Dmitriy V. Ryaboy added a comment -

          Right on.

          Quick questions from reading the patch:

          Correct me if I am wrong, but this doesn't work if you use 2 different bloom filters in a single task.

          Why "contains" test for jenkins and murmur?

          In addition to defining a bloom filter by array size and number of functions, you can construct bloom filters by expected number of elements and desired accuracy (math's on wikipedia, or you can check my Bloom::Faster perl module ). That's probably more generally useful, since people don't necessarily know how to choose the right number of functions.

          Why does the Bloom function only work on files? It'd be pretty simple to do Bloom( (bytearray) mybloom.$0, item) where mybloom is the Bloom relation.

          Show
          Dmitriy V. Ryaboy added a comment - Right on. Quick questions from reading the patch: Correct me if I am wrong, but this doesn't work if you use 2 different bloom filters in a single task. Why "contains" test for jenkins and murmur? In addition to defining a bloom filter by array size and number of functions, you can construct bloom filters by expected number of elements and desired accuracy (math's on wikipedia, or you can check my Bloom::Faster perl module ). That's probably more generally useful, since people don't necessarily know how to choose the right number of functions. Why does the Bloom function only work on files? It'd be pretty simple to do Bloom( (bytearray) mybloom.$0, item) where mybloom is the Bloom relation.
          Hide
          Alan Gates added a comment -

          Correct me if I am wrong, but this doesn't work if you use 2 different bloom filters in a single task.

          Glad you caught that. I'd meant to fix it and forgot.

          Why "contains" test for jenkins and murmur?

          The Hadoop names for these are Hash.JENKINS_HASH and Hash.MURMUR_HASH. I assumed people might copy some or all of those strings from the Hadoop docs and use them, and I wanted it to work whether they used "jenkins" "jenkins_hash" or "Hash.JENKINS_HASH"

          On the definition by number of elements and desired accuracy I agree that would be nice. I may put that in a follow on patch though, we'll see if I can finish it in the next few days.

          Same on operating directly on a relation. I'll see if I can get it working soon, if not I may do it in a follow on patch.

          Show
          Alan Gates added a comment - Correct me if I am wrong, but this doesn't work if you use 2 different bloom filters in a single task. Glad you caught that. I'd meant to fix it and forgot. Why "contains" test for jenkins and murmur? The Hadoop names for these are Hash.JENKINS_HASH and Hash.MURMUR_HASH. I assumed people might copy some or all of those strings from the Hadoop docs and use them, and I wanted it to work whether they used "jenkins" "jenkins_hash" or "Hash.JENKINS_HASH" On the definition by number of elements and desired accuracy I agree that would be nice. I may put that in a follow on patch though, we'll see if I can finish it in the next few days. Same on operating directly on a relation. I'll see if I can get it working soon, if not I may do it in a follow on patch.
          Alan Gates made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Hide
          Alan Gates added a comment -

          A new version of the patch that fixes the single bloom filter per query issue and makes it so the user can specify desired false positive rate and estimate number of elements instead of number of bits and hash functions.

          Show
          Alan Gates added a comment - A new version of the patch that fixes the single bloom filter per query issue and makes it so the user can specify desired false positive rate and estimate number of elements instead of number of bits and hash functions.
          Alan Gates made changes -
          Attachment PIG-bloom-2.patch [ 12500515 ]
          Alan Gates made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Hide
          Dmitriy V. Ryaboy added a comment -

          This looks great. My only remaining issue is that you are getting around the multiple bloom filters issue by grabbing the filename (dirname, really) of the bloom filter we are loading, while it may be the same. For example, it's not unreasonable to load "mydataset/sellers/bloom" and "/mydataset/buyers/bloom" . Perhaps a simple replacement of "/" with "_" would be better?

          Show
          Dmitriy V. Ryaboy added a comment - This looks great. My only remaining issue is that you are getting around the multiple bloom filters issue by grabbing the filename (dirname, really) of the bloom filter we are loading, while it may be the same. For example, it's not unreasonable to load "mydataset/sellers/bloom" and "/mydataset/buyers/bloom" . Perhaps a simple replacement of "/" with "_" would be better?
          Hide
          Daniel Dai added a comment -

          Here are some comments:
          1. javadoc sample is wrong:
          define bb BuildBloom(100, 3, Hash.JENKINS_HASH); => define bb BuildBloom('jenkins', '100', '0.1');
          C = filter B by Bloom(mybloom, z); =>C = filter B by Bloom(z);
          2. It should be trivial to convert it into scalar, so that we get out of the business to figure out the symbol link name:

          define bb BuildBloom('jenkins', '10', '0.1');
          small = load 'S' as (x, y, z);
          grpd = group small all;
          fltrd = foreach grpd generate bb(small.x) as a0;
          
          large = load 'L' as (a, b, c);
          flarge = filter large by Bloom(fltrd.a0, a);
          joined = join small by x, flarge by a;
          store joined into 'results';
          

          Wanna me to upload a patch?

          Show
          Daniel Dai added a comment - Here are some comments: 1. javadoc sample is wrong: define bb BuildBloom(100, 3, Hash.JENKINS_HASH); => define bb BuildBloom('jenkins', '100', '0.1'); C = filter B by Bloom(mybloom, z); =>C = filter B by Bloom(z); 2. It should be trivial to convert it into scalar, so that we get out of the business to figure out the symbol link name: define bb BuildBloom('jenkins', '10', '0.1'); small = load 'S' as (x, y, z); grpd = group small all; fltrd = foreach grpd generate bb(small.x) as a0; large = load 'L' as (a, b, c); flarge = filter large by Bloom(fltrd.a0, a); joined = join small by x, flarge by a; store joined into 'results'; Wanna me to upload a patch?
          Alan Gates made changes -
          Attachment PIG-bloom-3.patch [ 12502081 ]
          Hide
          Alan Gates added a comment -

          Uploaded new patch that follows Dmitriy's suggestion of replacing / with _ for file name uniqueness and fixes the javadoc issues Daniel brought up.

          It should be trivial to convert it into scalar, so that we get out of the business to figure out the symbol link name:

          While I agree it would be great to have the scalar option, I want to keep the storing it to a file option as well. One use case I envision is people building a bloom filter once (say against a small lookup table) and using it repeatedly.

          I'm going to commit this patch as is. I'll file a separate JIRA to add using the scalar in Bloom. Daniel, if you want to submit a patch for that, go ahead. It will be a while before I get to it.

          Also, if no one objects I'd like to port this to 0.10. It doesn't touch any existing code so it shouldn't break anything that already works.

          Show
          Alan Gates added a comment - Uploaded new patch that follows Dmitriy's suggestion of replacing / with _ for file name uniqueness and fixes the javadoc issues Daniel brought up. It should be trivial to convert it into scalar, so that we get out of the business to figure out the symbol link name: While I agree it would be great to have the scalar option, I want to keep the storing it to a file option as well. One use case I envision is people building a bloom filter once (say against a small lookup table) and using it repeatedly. I'm going to commit this patch as is. I'll file a separate JIRA to add using the scalar in Bloom. Daniel, if you want to submit a patch for that, go ahead. It will be a while before I get to it. Also, if no one objects I'd like to port this to 0.10. It doesn't touch any existing code so it shouldn't break anything that already works.
          Hide
          Alan Gates added a comment -

          Checked into trunk. I'll hold the JIRA open until I get feedback on checking it into 0.10.

          Show
          Alan Gates added a comment - Checked into trunk. I'll hold the JIRA open until I get feedback on checking it into 0.10.
          Hide
          Daniel Dai added a comment -

          +1 for 0.10.

          Show
          Daniel Dai added a comment - +1 for 0.10.
          Alan Gates made changes -
          Link This issue relates to PIG-2348 [ PIG-2348 ]
          Hide
          Alan Gates added a comment -

          Created PIG-2348 to track follow on feature of allowing a relation in addition to a file.

          Show
          Alan Gates added a comment - Created PIG-2348 to track follow on feature of allowing a relation in addition to a file.
          Hide
          Dmitriy V. Ryaboy added a comment -

          +1

          Show
          Dmitriy V. Ryaboy added a comment - +1
          Hide
          Alan Gates added a comment -

          Checked into 0.10 branch as well.

          Show
          Alan Gates added a comment - Checked into 0.10 branch as well.
          Alan Gates made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Fix Version/s 0.11 [ 12318878 ]
          Resolution Fixed [ 1 ]
          Daniel Dai made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Hide
          Olga Natkovich added a comment -

          Looks like the change made it into 10 but what about documentation? I could not find it ib builtins but just want to make sure it was not put in some other place?

          Show
          Olga Natkovich added a comment - Looks like the change made it into 10 but what about documentation? I could not find it ib builtins but just want to make sure it was not put in some other place?
          Hide
          Daniel Dai added a comment -

          They are piggybank UDFs which we have javadoc but no forest doc.

          Show
          Daniel Dai added a comment - They are piggybank UDFs which we have javadoc but no forest doc.
          Hide
          Olga Natkovich added a comment -

          This one are in builtins at least according to the patch, so they need to be in docs. I will create a doc patch, I just was not sure if it was in a different place

          Show
          Olga Natkovich added a comment - This one are in builtins at least according to the patch, so they need to be in docs. I will create a doc patch, I just was not sure if it was in a different place
          Transition Time In Source Status Execution Times Last Executer Last Execution Date
          Patch Available Patch Available Open Open
          4d 23h 29m 1 Alan Gates 24/Oct/11 19:59
          Open Open Patch Available Patch Available
          6m 12s 2 Alan Gates 24/Oct/11 20:01
          Patch Available Patch Available Resolved Resolved
          10d 2h 43m 1 Alan Gates 03/Nov/11 21:44
          Resolved Resolved Closed Closed
          174d 22h 47m 1 Daniel Dai 26/Apr/12 21:32

            People

            • Assignee:
              Alan Gates
              Reporter:
              Alan Gates
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development