Jackrabbit Content Repository
  1. Jackrabbit Content Repository
  2. JCR-3146

Text extraction may congest thread pool in the repository

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.2.12, 2.3.4
    • Component/s: jackrabbit-core
    • Labels:
      None

      Description

      Text extraction congests the thread pool in the repository when e.g. many PDFs are loaded into the workspace. Tasks submitted by the index merger are delayed because of that and will result in many index segment folders.

      1. JCR-3146.patch
        14 kB
        Alex Parvulescu

        Issue Links

          Activity

          Hide
          Alex Parvulescu added a comment -

          The solution is to define another queue for the tasks considered as low priority, so that they don't fill the execution queue.
          Then, depending on the executor's load poll this queue for additional work items.

          The secondary queue will only be used as needed, and the load is configurable via the system property
          "org.apache.jackrabbit.core.JackrabbitThreadPool.maxLoadForLowPriorityTasks"
          This property is meant to be used as a percent. 0 means disabled / the default is 75.

          There are some timing issues with the indexing tests on account of this new async text extraction. I've tried to fix all of them, but there may be more.

          I haven't touched yet on the tika extraction that happens in a different process. I think that will need some minor refactoring as well.

          Attaching proposed patch.

          Show
          Alex Parvulescu added a comment - The solution is to define another queue for the tasks considered as low priority, so that they don't fill the execution queue. Then, depending on the executor's load poll this queue for additional work items. The secondary queue will only be used as needed, and the load is configurable via the system property "org.apache.jackrabbit.core.JackrabbitThreadPool.maxLoadForLowPriorityTasks" This property is meant to be used as a percent. 0 means disabled / the default is 75. There are some timing issues with the indexing tests on account of this new async text extraction. I've tried to fix all of them, but there may be more. I haven't touched yet on the tika extraction that happens in a different process. I think that will need some minor refactoring as well. Attaching proposed patch.
          Hide
          Marcel Reutegger added a comment -

          In general looks good to me.

          I'm not sure the communication between the threads in JackrabbitThreadPool
          is 100% correct. E.g. the first statement in RetryLowPriorityTask.run()
          checks if the queue is empty. To me it seems like this should never
          happen, right?

          Style:
          Should we rather keep the JackrabbitThreadPool package private and
          only expose the marker as public interface? How about renaming
          the LOW_PRIORITY_MARKER to LowPriorityTask and extend it from
          Runnable? That way a client wouldn't have to implement Runnable
          and the marker interface.

          Minor:
          The method waitForTextExtractionTasksToFinish() already does an
          index flush at the end. Aren't the additional index flush calls
          in IndexingQueueTest now obsolete?

          Show
          Marcel Reutegger added a comment - In general looks good to me. I'm not sure the communication between the threads in JackrabbitThreadPool is 100% correct. E.g. the first statement in RetryLowPriorityTask.run() checks if the queue is empty. To me it seems like this should never happen, right? Style: Should we rather keep the JackrabbitThreadPool package private and only expose the marker as public interface? How about renaming the LOW_PRIORITY_MARKER to LowPriorityTask and extend it from Runnable? That way a client wouldn't have to implement Runnable and the marker interface. Minor: The method waitForTextExtractionTasksToFinish() already does an index flush at the end. Aren't the additional index flush calls in IndexingQueueTest now obsolete?
          Hide
          Alex Parvulescu added a comment -

          thanks for taking the time to review the patch

          > To me it seems like this should never happen, right?
          yes, that is just premature optimization. I'll remove it.

          > Should we rather keep the JackrabbitThreadPool package private and only expose the marker as public interface?
          you mean like moving the interface to the util package?

          > How about renaming the LOW_PRIORITY_MARKER to LowPriorityTask and extend it from Runnable?
          that is a good idea

          > Aren't the additional index flush calls in IndexingQueueTest now obsolete?
          true. I had some issues with timing which are now hopefully fixed, so yes we can remove the extra flush

          Show
          Alex Parvulescu added a comment - thanks for taking the time to review the patch > To me it seems like this should never happen, right? yes, that is just premature optimization. I'll remove it. > Should we rather keep the JackrabbitThreadPool package private and only expose the marker as public interface? you mean like moving the interface to the util package? > How about renaming the LOW_PRIORITY_MARKER to LowPriorityTask and extend it from Runnable? that is a good idea > Aren't the additional index flush calls in IndexingQueueTest now obsolete? true. I had some issues with timing which are now hopefully fixed, so yes we can remove the extra flush
          Hide
          Marcel Reutegger added a comment -

          > you mean like moving the interface to the util package?

          no, I would still keep it in the same package as the JackrabbitThreadPool

          Show
          Marcel Reutegger added a comment - > you mean like moving the interface to the util package? no, I would still keep it in the same package as the JackrabbitThreadPool
          Hide
          Alex Parvulescu added a comment -

          fixed in rev:1202192

          Show
          Alex Parvulescu added a comment - fixed in rev:1202192
          Hide
          Jukka Zitting added a comment -

          Merged to the 2.2 branch in revision 1242468.

          Show
          Jukka Zitting added a comment - Merged to the 2.2 branch in revision 1242468.

            People

            • Assignee:
              Alex Parvulescu
              Reporter:
              Alex Parvulescu
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development