[PIG-4963] Add a Bloom join - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.17.0
Component/s: None
Labels:
None

Hadoop Flags:

Reviewed

Description

In ~~PIG-4925~~, added option to pass BloomFilter as a scalar to bloom function. But found that actually using it for big data which required huge vector size was very inefficient and led to OOM.
I had initially calculated that it would take around 12MB bytearray for 100 million vectorsize (100000000 + 7) / 8 = 12500000 bytes) and that would be the scalar value broadcasted and would not take much space. But problem is 12MB was written out for every input record with BuildBloom$Initial before the aggregation happens and we arrive at the final BloomFilter vector. And with POPartialAgg it runs into OOM issues.

If we added a bloom join implementation, which can be combined with hash or skewed join it would boost performance for a lot of jobs. Bloom filter of the smaller tables can be sent to the bigger tables as scalar and data filtered before hash or skewed join is used.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

PIG-4963-6.patch
26/Jan/17 16:20
155 kB
Rohini Palaniswamy
PIG-4963-5.patch
26/Jan/17 05:57
155 kB
Rohini Palaniswamy
PIG-4963-4.patch
21/Jan/17 00:35
155 kB
Rohini Palaniswamy
PIG-4963-3.patch
21/Jan/17 00:35
152 kB
Rohini Palaniswamy
PIG-4963-2.patch
21/Jan/17 00:35
151 kB
Rohini Palaniswamy
PIG-4963-1.patch
18/Jan/17 16:02
150 kB
Rohini Palaniswamy

Issue Links

is related to

PIG-5117 Implement Bloom join for Spark execution engine

Open

links to

Review Board

Activity

People

Assignee:: Rohini Palaniswamy

Reporter:: Rohini Palaniswamy

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 27/Jul/16 19:19

Updated:: 21/Jun/17 09:15

Resolved:: 26/Jan/17 17:41