[KYLIN-5339] Renew Epoch Retry did not interrupt the old thread in time, and the new thread failed to write data, resulting in kylin losing epoch - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 5.0-alpha
Fix Version/s: 5.0-alpha
Component/s: None
Labels:
None

Description

epoch renew时有两次retry，每次有超时60s的机制。renew时使用线程池来执行。这个线程池容量由开关 kylin.server.renew-epoch-pool-size=3决定。这里存在的问题是：renew线程超时60s后没有终止该线程，又拉起了另一个renew线程，对同样的数据进行了更新。此时第一个线程由于没有终止，最后renew成功了，并把数据的MVCC+1。后面renew的线程renew时，会判断MVCC：

此时，发现没有满足条件的数据，导致return的update affectedRows = 0。最终，造成了当前节点丢失了所有项目的控制权。流程可见下图：

fix design
Epoch Renew有超时失败的重试机制(kylin.server.leader-race.heart-beat-timeout=60s)。重试时，原有的事务没有停止，新开事务进行了数据库更新。由于Epoch 更新时，会校验mvcc的值，所以这里导致第二个事务被第一个事务冲突了。鉴于此，增加事务Timeout机制，Timeout=kylin.server.leader-race.heart-beat-timeout=60s-1s。事务超时自动回滚，避免了Renew重试时事务冲突的问题。

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

31c439f4-0a2b-4616-949d-415f4b417f2e.png
06/Dec/22 06:44
57 kB
sibing.zhang
602360ee-fa81-4c8c-a7d2-4fdd73d284ff.png
06/Dec/22 06:47
171 kB
sibing.zhang

Activity

People

Assignee:: Unassigned

Reporter:: sibing.zhang

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 06/Dec/22 06:55

Updated:: 29/Mar/23 07:15

Resolved:: 29/Mar/23 07:15