[FLINK-20827] Just read record correlating to join key in FilesystemLookUpFunc - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Not a Priority
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: Connectors / FileSystem, Connectors / Hive
Labels:
- auto-deprioritized-major
- auto-deprioritized-minor

Description

When using Temporal table join, all hive tables' records will be loaded into cache. But sometimes, the size of hive temporal table is larger than expected, and users can't know exactly how big it is in memory. In this situation, some error will occur, for example, `GC overhead limit exceeded`, `the heartbeat of TaskManager timeout (caused by gc)`.

Maybe we can optimize the number of records readed from hive table? If the upstream records can be hashed only by using `Join key`, then we only need to load the part of records into cache, whose value of join key after being hashed, is equal to one fixed hash value. If it can be done, the whole table can be divided by the number of parallelism. I don't know whether it could come true in the upstream under the existing framework, but It is easy to support in `FileSystemLookupFunction`

If not, we can add some logs to tell others the size of cache to help them to set MemorySize or other parameter of TM.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: zoucao

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 01/Jan/21 10:26

Updated:: 07/Nov/21 10:38