[FLINK-9998] FlinkKafkaConsumer produces lag -Inf when the pipeline lags - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Not a Priority
Resolution: Unresolved
Affects Version/s: 1.4.0
Fix Version/s: None
Component/s: Connectors / Kafka
Labels:
- auto-deprioritized-major
- auto-deprioritized-minor

Description

I reported this in the list, but now I have enough information to understand what's going on.

Sometimes, the kafkaConsumer will report a lag (flink_taskmanager_job_task_operator_KafkaConsumer_records_lag_max) of -Inf.

The problem seems to related to the capture time.

If the pipeline (defines with EXACTLY_ONCE) starts lagging at some point, there won't be enough information in a certain period and the reported lag becomes -Inf.

Example: We had an external SQL Sink, pointing to a RDS source, but with a cluster outside AWS. This produced a flush time of about 2 minutes for 500 records (captured 'cause we added a metric around `upload.executeBatch()` inside JDBCOutputFormat); although absurd (which is another problem), during this time, the metric would report `-Inf` and return the a proper value once the stream finished.

So it seems the lag, instead of being a captured value and kept in memory, it's calculated from time to time instead of being kept in memory and updated from time to time (just because there wasn't any record processed in a certain period, it doesn't mean the lag went down).

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Julio Biason

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 30/Jul/18 16:50

Updated:: 06/Dec/21 10:39