Uploaded image for project: 'Jackrabbit Oak'
  1. Jackrabbit Oak
  2. OAK-5034

FileStoreUtil#readSegmentWithRetry max retry delay is too short to be functional

Agile BoardAttach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • Segment Tar 0.0.16
    • 1.5.13, 1.6.0
    • segment-tar
    • None
    • Patch

    Description

      The commit 1765838 introduced the FileStoreUtil#readSegmentWithRetry util and reduced the period between two tries (from 2sec to 0.125s) while the total number of tries did not change.

      This does not give enough time for the server to find references and segments, thus causing exceptions such as

      29.10.2016 05:07:37.242 *ERROR* [sling-default-2-Registered Service.605] org.apache.jackrabbit.oak.segment.standby.client.StandbyClientSync Failed synchronizing state.
      java.lang.IllegalStateException: Unable to read references of segment 5168c878-3a3f-49d0-aea9-b8b57d5d867f from primary
              at org.apache.jackrabbit.oak.segment.standby.client.StandbyClientSyncExecution.readReferences(StandbyClientSyncExecution.java:196)
              at org.apache.jackrabbit.oak.segment.standby.client.StandbyClientSyncExecution.copySegmentHierarchyFromPrimary(StandbyClientSyncExecution.java:130)
              at org.apache.jackrabbit.oak.segment.standby.client.StandbyClientSyncExecution.compareAgainstBaseState(StandbyClientSyncExecution.java:94)
              at org.apache.jackrabbit.oak.segment.standby.client.StandbyClientSyncExecution.execute(StandbyClientSyncExecution.java:74)
              at org.apache.jackrabbit.oak.segment.standby.client.StandbyClientSync.run(StandbyClientSync.java:143)
              at org.apache.sling.commons.scheduler.impl.QuartzJobExecutor.execute(QuartzJobExecutor.java:118)
              at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
              at java.lang.Thread.run(Thread.java:745)
      

      and causing the client to throw exceptions, ultimately causing IT tests to fail.

      IIUC, the minimum period to retry should be bigger than a TarMK flush cycle (5 sec).

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            frm Francesco Mari
            marett Timothee Maret
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment