[FLINK-17800] RocksDB optimizeForPointLookup results in missing time windows - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Critical
Resolution: Fixed
Affects Version/s: 1.10.0, 1.10.1
Fix Version/s: 1.10.2, 1.11.0, 1.12.0
Component/s: Runtime / State Backends
Labels:
- pull-request-available

Release Note:

Hide
After ~~FLINK-17800~~ by default we will set `setTotalOrderSeek` to true for RocksDB's `ReadOptions`, to prevent user from miss using `optimizeForPointLookup`. Meantime we support customizing `ReadOptions` through `RocksDBOptionsFactory`. Please set `setTotalOrderSeek` back to false if any performance regression observed (normally won't happen according to our testing).

Show
After FLINK-17800 by default we will set `setTotalOrderSeek` to true for RocksDB's `ReadOptions`, to prevent user from miss using `optimizeForPointLookup`. Meantime we support customizing `ReadOptions` through `RocksDBOptionsFactory`. Please set `setTotalOrderSeek` back to false if any performance regression observed (normally won't happen according to our testing).

Description

My Setup:

We have been using the RocksDb option of optimizeForPointLookup and running version 1.7 for years. Upon upgrading to Flink 1.10 we started receiving a strange behavior of missing time windows on a streaming Flink job. For the purpose of testing I experimented with previous Flink version and (1.8, 1.9, 1.9.3) and non of them showed the problem

A sample of the code demonstrating the problem is here:

 val datastream = env
 .addSource(KafkaSource.keyedElements(config.kafkaElements, List(config.kafkaBootstrapServer)))

 val result = datastream
 .keyBy( _ => 1)
 .timeWindow(Time.milliseconds(1))
 .print()

The source consists of 3 streams (being either 3 Kafka partitions or 3 Kafka topics), the elements in each of the streams are separately increasing. The elements generate increasing timestamps using an event time and start from 1, increasing by 1. The first partitions would consist of timestamps 1, 2, 10, 15..., the second of 4, 5, 6, 11..., the third of 3, 7, 8, 9...

What I observe:

The time windows would open as I expect for the first 127 timestamps. Then there would be a huge gap with no opened windows, if the source has many elements, then next open window would be having a timestamp in the thousands. A gap of hundred of elements would be created with what appear to be 'lost' elements. Those elements are not reported as late (if tested with the .sideOutputLateData operator). The way we have been using the option is by setting in inside the config like so:

etherbi.rocksDB.columnOptions.optimizeForPointLookup=268435456

We have been using it for performance reasons as we have huge RocksDB state backend.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

MyMissingWindows.scala
28/May/20 15:24
3 kB
Yordan Pavlov
MyMissingWindows.scala
27/May/20 03:40
2 kB
Yun Tang
MissingWindows.scala
19/May/20 10:15
2 kB
Yordan Pavlov

Issue Links

causes

FLINK-18338 RocksDB tests crash the JVM on CI

Closed

is related to

FLINK-23789 Remove unnecessary setTotalOrderForSeek for Rocks iterator

Resolved

FLINK-14482 Bump up rocksdb version

Closed

links to

GitHub Pull Request #12514

GitHub Pull Request #12669

GitHub Pull Request #12736

(1 links to)

Activity

People

Assignee:: Yun Tang

Reporter:: Yordan Pavlov

Votes:: 1 Vote for this issue

Watchers:: 12 Start watching this issue

Dates

Created:: 18/May/20 15:32

Updated:: 16/Aug/21 06:27

Resolved:: 26/Jun/20 16:48