[PIG-2328] Add builtin UDFs for building and using bloom filters - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.10.0, 0.11
Component/s: internal-udfs
Labels:
None

Release Note:

Hide
Bloom filters are a common way to select a limited set of records before moving data for a join or other heavy weight operation. For example, if one wanted to join a very large data set L with a smaller set S, and it was known that the number of keys in L that will match with S is small, building a bloom filter on S and then applying it to L before the join can greatly reduce the number of records from L that have to be moved from the map to the reduce, thus speeding the join.

{code}
define bb BuildBloom('128', '3', 'jenkins');
small = load 'S' as (x, y, z);
grpd = group small all;
fltrd = foreach grpd generate bb(small.x);
store fltrd in 'mybloom';
exec;
define bloom Bloom('mybloom');
large = load 'L' as (a, b, c);
flarge = filter large by bloom(L.a);
joined = join small by x, flarge by a;
store joined into 'results';
{code}

When constructing BuildBloom, the three arguments passed are the number of bits in the bloom filter, the number of hash functions used in constructing the bloom filter, and the type of hash function to use. Valid values for the hash functions are 'jenkins' and 'murmur'. See http://en.wikipedia.org/wiki/Bloom_filter for a discussion of how to select the number of bits and the number of hash functions.

This uses Hadoop's bloom filters (org.apache.hadoop.util.bloom.BloomFilter) internally.

Show
Bloom filters are a common way to select a limited set of records before moving data for a join or other heavy weight operation. For example, if one wanted to join a very large data set L with a smaller set S, and it was known that the number of keys in L that will match with S is small, building a bloom filter on S and then applying it to L before the join can greatly reduce the number of records from L that have to be moved from the map to the reduce, thus speeding the join. {code} define bb BuildBloom('128', '3', 'jenkins'); small = load 'S' as (x, y, z); grpd = group small all; fltrd = foreach grpd generate bb(small.x); store fltrd in 'mybloom'; exec; define bloom Bloom('mybloom'); large = load 'L' as (a, b, c); flarge = filter large by bloom(L.a); joined = join small by x, flarge by a; store joined into 'results'; {code} When constructing BuildBloom, the three arguments passed are the number of bits in the bloom filter, the number of hash functions used in constructing the bloom filter, and the type of hash function to use. Valid values for the hash functions are 'jenkins' and 'murmur'. See http://en.wikipedia.org/wiki/Bloom_filter for a discussion of how to select the number of bits and the number of hash functions. This uses Hadoop's bloom filters (org.apache.hadoop.util.bloom.BloomFilter) internally.

Description

Bloom filters are a common way to do select a limited set of records before moving data for a join or other heavy weight operation. Pig should add UDFs to support building and using bloom filters.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

PIG-bloom-3.patch
03/Nov/11 00:57
32 kB
Alan Gates
PIG-bloom-2.patch
24/Oct/11 19:01
32 kB
Alan Gates
PIG-bloom.patch
19/Oct/11 19:27
26 kB
Alan Gates

Issue Links

relates to

PIG-2348 Bloom should be able to take a relation or a file

Resolved

PIG-4713 Document Bloom UDF

Closed

Activity

People

Assignee:: Alan Gates

Reporter:: Alan Gates

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 19/Oct/11 19:25

Updated:: 30/Oct/15 23:48

Resolved:: 03/Nov/11 21:44