Solr
  1. Solr
  2. SOLR-5201

UIMAUpdateRequestProcessor should reuse the AnalysisEngine

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.4
    • Fix Version/s: 4.5, Trunk
    • Component/s: contrib - UIMA
    • Labels:
      None

      Description

      As reported in http://markmail.org/thread/2psiyl4ukaejl4fx UIMAUpdateRequestProcessor instantiates an AnalysisEngine for each request which is bad for performance therefore it'd be nice if such AEs could be reused whenever that's possible.

        Activity

        Hide
        Tommaso Teofili added a comment -

        ok, now I recall why the caching logic was put in the AEProvider. Basically an UpdateRequestProcessor is instantiated on each update request (it's not reused) and therefore caching it locally wouldn't help.

        Show
        Tommaso Teofili added a comment - ok, now I recall why the caching logic was put in the AEProvider. Basically an UpdateRequestProcessor is instantiated on each update request (it's not reused) and therefore caching it locally wouldn't help.
        Hide
        Hoss Man added a comment -

        Tommaso Teofili: I'm not sure what kind of state the AnalysisEngine maintains that might be reused/pollute subsequent requests, but there are two things you could do to "cache" an AnalysisEngine for various durations depending on what you're looking for...

        • you could create & cache the engine in the UIAMAUpdateRequestProcessor object and then it will be re-used for each document included in a single update request
        • you could create & cache the engine in the UIAMAUpdateRequestProcessorFactory, passing it to each UIAMAUpdateRequestProcessor it creates, and then it will be re-used for every document included in every request.
        Show
        Hoss Man added a comment - Tommaso Teofili : I'm not sure what kind of state the AnalysisEngine maintains that might be reused/pollute subsequent requests, but there are two things you could do to "cache" an AnalysisEngine for various durations depending on what you're looking for... you could create & cache the engine in the UIAMAUpdateRequestProcessor object and then it will be re-used for each document included in a single update request you could create & cache the engine in the UIAMAUpdateRequestProcessorFactory, passing it to each UIAMAUpdateRequestProcessor it creates, and then it will be re-used for every document included in every request.
        Hide
        Jun Ohtani added a comment -

        I try to write a patch that UIMAUpdateRequestProcessor reuse the AnalysisEngine only single update request
        Sorry, this patch is based on branch_4x and I write a simple test only.
        Maybe this patch that improved a little...

        Show
        Jun Ohtani added a comment - I try to write a patch that UIMAUpdateRequestProcessor reuse the AnalysisEngine only single update request Sorry, this patch is based on branch_4x and I write a simple test only. Maybe this patch that improved a little...
        Hide
        Jun Ohtani added a comment -

        I try to write second patch that AnalysisEngine is created in UIMAUpdateRequestProcessorFactory.getInstance() and is re-used every update request.
        This patch is based on branch_4x too.

        But this code is dirty…
        Please, check and refactor this patch.

        Show
        Jun Ohtani added a comment - I try to write second patch that AnalysisEngine is created in UIMAUpdateRequestProcessorFactory.getInstance() and is re-used every update request. This patch is based on branch_4x too. But this code is dirty… Please, check and refactor this patch.
        Hide
        Tommaso Teofili added a comment -

        Thanks Hoss Man for your hints and [~johtani] for your patches.

        The first option of reusing an AnalysisEngine in each UpdateRequestProcessor instance for reuse in "batch" update requests (first option/patch) is surely the easiest solution but the performance improvement depends on the no. of docs that are sent together in each update request.
        The second option sounds nice but I wonder if that would cause a problem with multiple configurations (2 update chains with 2 different configurations of UIMAUpdateRequestProcessorFactory), I'll do some testing on this scenario using John's patch so that we can decide which design it's better to support.

        Show
        Tommaso Teofili added a comment - Thanks Hoss Man for your hints and [~johtani] for your patches. The first option of reusing an AnalysisEngine in each UpdateRequestProcessor instance for reuse in "batch" update requests (first option/patch) is surely the easiest solution but the performance improvement depends on the no. of docs that are sent together in each update request. The second option sounds nice but I wonder if that would cause a problem with multiple configurations (2 update chains with 2 different configurations of UIMAUpdateRequestProcessorFactory ), I'll do some testing on this scenario using John's patch so that we can decide which design it's better to support.
        Hide
        Hoss Man added a comment -

        The second option sounds nice but I wonder if that would cause a problem with multiple configurations (2 update chains with 2 different configurations of UIMAUpdateRequestProcessorFactory),

        It depends on what kinds or problems you are worried about ... each UIMAUpdateRequestProcessorFactory instance (ie: one instance per chain) should each have it's own AnalysisEngine using it's own configuration ... unless the AnalysisEngine constructor/factory/provider does something special to keep track of them, they won't know anything about eachother.

        If you want them to know about eachother (ie: to share an AnalysisEngine between chains, or between chains in different SolrCores) then something a lot more special case would need to be done.

        Show
        Hoss Man added a comment - The second option sounds nice but I wonder if that would cause a problem with multiple configurations (2 update chains with 2 different configurations of UIMAUpdateRequestProcessorFactory), It depends on what kinds or problems you are worried about ... each UIMAUpdateRequestProcessorFactory instance (ie: one instance per chain) should each have it's own AnalysisEngine using it's own configuration ... unless the AnalysisEngine constructor/factory/provider does something special to keep track of them, they won't know anything about eachother. If you want them to know about eachother (ie: to share an AnalysisEngine between chains, or between chains in different SolrCores) then something a lot more special case would need to be done.
        Hide
        Tommaso Teofili added a comment -

        unless the AnalysisEngine constructor/factory/provider does something special to keep track of them, they won't know anything about eachother

        ok, so I think we can go with the second option of having the UIMAUpdateRequestProcessorFactory serve the AnalysisEngine to UIMAUpdateRequestProcessors.
        I'll post a patch later today.

        Show
        Tommaso Teofili added a comment - unless the AnalysisEngine constructor/factory/provider does something special to keep track of them, they won't know anything about eachother ok, so I think we can go with the second option of having the UIMAUpdateRequestProcessorFactory serve the AnalysisEngine to UIMAUpdateRequestProcessors . I'll post a patch later today.
        Hide
        Tommaso Teofili added a comment -

        here's a draft patch: https://github.com/tteofili/lucene-solr/compare/apache:trunk...solr-5201.patch

        AnalysisEngines are initialized inside UIMAUpdateRequestProcessorFactories together with a JCasPool to better handle multiple concurrent requests.
        My benchmarks (ran 'ant clean test -Dtests.multiplier=100' with and without the above patch) show execution of UIMAUpdateRequestProcessorTest#testMultiplierProcessing is ~10 times faster and less memory consumptive (~240MB saved over ~650MB heap)

        Show
        Tommaso Teofili added a comment - here's a draft patch: https://github.com/tteofili/lucene-solr/compare/apache:trunk...solr-5201.patch AnalysisEngines are initialized inside UIMAUpdateRequestProcessorFactories together with a JCasPool to better handle multiple concurrent requests. My benchmarks (ran 'ant clean test -Dtests.multiplier=100' with and without the above patch) show execution of UIMAUpdateRequestProcessorTest#testMultiplierProcessing is ~10 times faster and less memory consumptive (~240MB saved over ~650MB heap)
        Hide
        ASF subversion and git services added a comment -

        Commit 1520239 from Tommaso Teofili in branch 'dev/trunk'
        [ https://svn.apache.org/r1520239 ]

        SOLR-5201 - AnalysisEngines are now created in the factory and passed to the processors with a JCas pool

        Show
        ASF subversion and git services added a comment - Commit 1520239 from Tommaso Teofili in branch 'dev/trunk' [ https://svn.apache.org/r1520239 ] SOLR-5201 - AnalysisEngines are now created in the factory and passed to the processors with a JCas pool
        Hide
        Tommaso Teofili added a comment -

        I've committed the above patch to trunk in r1520239

        Show
        Tommaso Teofili added a comment - I've committed the above patch to trunk in r1520239
        Hide
        Jun Ohtani added a comment -

        Thanks Tommaso.

        Sorry, I misunderstood about the relationship between UIMAUpdateRequestProcessorFactory and AnalysisEngine.
        My co-woker use this patch, it work without problems.

        Do you commit the above patch to branch_4x?

        Show
        Jun Ohtani added a comment - Thanks Tommaso. Sorry, I misunderstood about the relationship between UIMAUpdateRequestProcessorFactory and AnalysisEngine. My co-woker use this patch, it work without problems. Do you commit the above patch to branch_4x?
        Hide
        Tommaso Teofili added a comment -

        ok good, thanks. I'll merge it to branch_4x too.

        Show
        Tommaso Teofili added a comment - ok good, thanks. I'll merge it to branch_4x too.
        Hide
        ASF subversion and git services added a comment -

        Commit 1520859 from Tommaso Teofili in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1520859 ]

        SOLR-5201 - patch backported to branch_4x

        Show
        ASF subversion and git services added a comment - Commit 1520859 from Tommaso Teofili in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1520859 ] SOLR-5201 - patch backported to branch_4x
        Hide
        Adrien Grand added a comment -

        4.5 release -> bulk close

        Show
        Adrien Grand added a comment - 4.5 release -> bulk close

          People

          • Assignee:
            Tommaso Teofili
            Reporter:
            Tommaso Teofili
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development