Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-36071

Using System.nanoTime to measure the elapsed time instead of System.currentTimeMillis

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • Runtime / Metrics
    • None

    Description

      A series of flink metrics are using the System.currentTimeMillis[1] to measure the elapsed time. I propose to refactor them from  System.currentTimeMillis to  System.nanoTime[2].

      Why do we need to refactor it?

      Note: High precision is not the reason for refactor.

      Actually, System.currentTimeMillis() and System.nanoTime() have completely different semantics.

      System.currentTimeMillis() != System.nanoTime() / 1_000_000

      • System.currentTimeMillis() is current system time of the server.
        • The time can be updated by NTP[3], or it can be adjusted manually.
        • Therefore, when we use System.currentTimeMillis, the end time may be less than the start time
      • System.nanoTime() usually indicates the length of time since the operating system was booted.
        • So System.nanoTime isn't system time, and it's not effected by system time.
        • System.nanoTime (inside the process) is monotonically increasing and never goes back.
        • As the job doc[2] mentioned: this method can only be used to measure elapsed time and is not related to any other notion of system or wall-clock time.

      Here is a blog[4] to explain their difference in detail.

      Current use cases:

      Based on last part, we know the System.nanoTime is recommended for measuring the duration.

      Most of tracing systems are using it, and flink also uses it to measure the duration for some metrics, such as:

      • all latency tracks of state backend
      • SubtaskCheckpointCoordinatorImpl#takeSnapshotSync measures the checkpoint Sync Duration
      • etc

      In addition, the Clock[5] of flink extracted the absoluteTimeMillis, relativeTimeMillis and relativeTimeNanos before. But I guess most of developers doesn't know these details.

      • absoluteTimeMillis is using System.currentTimeMillis
      • relativeTimeMillis and relativeTimeNanos are using System.nanoTime
      • It's better to call relativeTimeNanos or absoluteTimeMillis instead of absoluteTimeMillis for all duration related metrics

      Proposed changes:

      This jira proposes that Flink uses System.nanoTime uniformly for duration calculation.

      Currently, many components still use System.currentTimeMillis to calculate duration, it includes:

      • TimerGauge
      • TaskIOMetricGroup
      • ThroughputCalculator
      • DeploymentStateTimeMetrics
      • A lof of methods of StreamTask
      • etc

      [1] https://docs.oracle.com/javase/8/docs/api/java/lang/System.html#currentTimeMillis--

      [2] https://docs.oracle.com/javase/8/docs/api/java/lang/System.html#nanoTime--

      [3] https://en.wikipedia.org/wiki/Network_Time_Protocol

      [4] https://www.javaadvent.com/2019/12/measuring-time-from-java-to-kernel-and-back.html

      [5] https://github.com/apache/flink/blob/729b8b81a77ba6c32711216b88a1bf57ccddfadc/flink-core/src/main/java/org/apache/flink/util/clock/Clock.java#L40

       

      Attachments

        Activity

          People

            fanrui Rui Fan
            fanrui Rui Fan
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: