[FLINK-8918] Introduce Runtime Filter Join - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Abandoned
Affects Version/s: None
Fix Version/s: None
Component/s: Table SQL / API
Labels:
None

Description

In general, stream join is one of the most performance cost task. For every record from both side, we need to query the state from the other side, this will lead to poor performance when the state size if huge. So, in production, we always need to spend a lot slots to handle stream join. But, indeed, we can improve this in somehow, there a phenomenon of stream join can be found in production. That's the `joined ratio` of the stream join is often very low, for example.

stream join in promotion analysis: Job need to join the promotion log with the action(click, view, payment, collection, retweet) log with the `promotion_id` to analysis the effect of the promotion.

stream join in AD(advertising) attribution: Job need to join the AD click log with the item payment log on the `click_id` to find which click of which AD that brings the payment to do attribution.
stream join in click log analysis of doc: Job need to join viewed log(doc viewed by users) with the click log (doc clicked by users) to analysis the reason of the click and the property of the users.
….so on

All these cases have one common property, that is the joined ratio is very low. Here is a example to describe it, imagine that, we have 10000 records from the left stream, and 10000 records from the right stream, and we execute select * from leftStream l join rightStream r on l.id = r.id , we only got 100 record from the result, that is the case for low joined ratio, this is an example for inner join, but it can also apply to left & right join.

there are more example I can come up with low joined ratio , but the most important point I want to expressed is that, the low joined ratio of stream join in production is a very common phenomenon(maybe the almost common phenomenon in some companies, at least in our company that is the case).

Then how to improve it?

We can see from the above case, 10000 record join 10000 record we only got 100 result, that means, we query the state 20000 times (10000 for the left stream and 10000 for the right stream) but only 100 of them are meaningful!!! If we could reduce the useless query times, then we can definitely improve the performance of stream join.

the way we used to improve this is to introduce the Runtime Filter Join, the mainly ideal is that, we build a filter for the state on each side (left stream & right stream). When we need to query the state on that side we first check the corresponding filter whether the key is possible in the state, if the filter say "not, it impossible in the state", then we stop querying the state, if it say "hmm, it maybe in state", then we need to query the state. As you can see, the best choose of the filter is Bloom Filter, it has all the feature that we expected: extremely good performance, non-existence of false negative.

the simplest pseudo code for Runtime Filter Join(the comments inline are based on RocksDBBackend)

void performJoinNormally(Record recordFromLeftStream) {
    Iterator<Record> rightIterator = rigthStreamState.iterator();
    // perform the `seek()` on the RocksDB, and iterator one by one,
    // this is an expensive operation especially when the key can't be found in RocksDB.
    for (Record recordFromRightState : rightIterator) {
        .......
    }
}

void performRuntimeFilterJoin(Record recordFromLeftStream) {
    Iterator<Record> rightIterator = EMPTY_ITERATOR;
    if (rigthStreamfilter.containsCurrentKey()) {
        rightIterator = rigthStreamState.iterator();
    }
    // perform the `seek()` only when filter.containsCurrentKey() return true
    for (Record recordFromRightState : rightIterator) {
        .......
    }
    
    // add the current key into the filter of left stream.
	leftStreamFilter.addCurrentKey();
}

A description of Runtime Filter Join for batch join can be found here (even though it not for stream join original, but we can easily refer it to `stream join`)

Attachments

Issue Links

is blocked by

FLINK-8601 Introduce ElasticBloomFilter for Approximate calculation and other situations of performance optimization

Closed

Activity

People

Assignee:: Sihua Zhou

Reporter:: Sihua Zhou

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 12/Mar/18 04:59

Updated:: 30/Jan/19 13:44

Resolved:: 30/Jan/19 10:39