Uploaded image for project: 'Apache Ozone'
  1. Apache Ozone
  2. HDDS-9821

XceiverServerRatis SyncTimeoutRetry is overridden

    XMLWordPrintableJSON

Details

    Description

      In XceiverServerRatis#newRaftProperties, setSyncTimeoutRetry was set twice

      First, it is set to 

      (int) nodeFailureTimeoutMs / dataSyncTimeout.toIntExact(TimeUnit.MILLISECONDS) 

      which by default equals to 300_000 ms / 10_000 ms  =  30 retries

      From the comment, the intention of setting a finite number of retries is:

      Even if the leader is not able to complete write calls within the timeout seconds, it should just fail the operation and trigger pipeline close. failing the writeStateMachine call with limited retries will ensure even the leader initiates a pipeline close if its not able to complete write in the timeout configured.

      However, it was overridden in 

      int numSyncRetries = conf.getInt(
          OzoneConfigKeys.DFS_CONTAINER_RATIS_STATEMACHINEDATA_SYNC_RETRIES,
          OzoneConfigKeys.
              DFS_CONTAINER_RATIS_STATEMACHINEDATA_SYNC_RETRIES_DEFAULT);
      RaftServerConfigKeys.Log.StateMachineData.setSyncTimeoutRetry(properties,
          numSyncRetries); 

      Which set it to the default value -1 (retry indefinitely). 

      This might cause the leader to never initiate a pipeline close when its writeStateMachine time out (e.g. write chunk timeout due to I/O issue).

      I propose we use the finite timeout retry calculation as the default configuration.

       

      This is also a good avenue to re-evaluate the state machine data policy in Container State Machine.

      Attachments

        Issue Links

          Activity

            People

              ivanandika Ivan Andika
              ivanandika Ivan Andika
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: