JSPWiki
  1. JSPWiki
  2. JSPWIKI-498

Performance Issues with Lucene Index

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Incomplete
    • Affects Version/s: 2.8.4
    • Fix Version/s: FutureVersion
    • Component/s: Default template
    • Labels:
      None
    • Environment:

      Win 2003 on XEN-Server, 3 GB RAM, 4 x Intel Xeon 2Ghz

      Description

      Our Wiki-Luceneindex has swollen to 33 MB and if someone searches words like "find" he gets about 500 results. Only problem is, it takes about 30 seconds to complete the search.

      I have looked a bit into the problem and there is something strange to me. JSPWiki searches twice every query.
      Once in search.jsp and afterwards in AjaxSearch.jsp.

      The filteredList from search.jsp is written to the pagecontext but this information is never read within the AjaxSearch.jsp. Thus AjaxSearch.jsp makes a second search.

      1. patch_lucene_search.txt
        35 kB
        Kurt Stein
      2. screenshot-1.jpg
        53 kB
        Kurt Stein
      3. patch_lucene_search_2.txt
        15 kB
        Kurt Stein
      4. lucene_search.patch
        16 kB
        Kurt Stein

        Activity

        Hide
        Kurt Stein added a comment -

        Update to my Research on LuceneSearchProvider.

        I added some Stopwatches to see where the bottleneck is. It is the LuceneHighlighter!

        The Problem is the part where it is getting the fragments for all searchresults.

        This is unessesary because the page only displays the first 20 Searchresults(1-20).

        And if you go to the next page JSPWIKI start a new query but displays only results 21-40.

        In my case i had 500 results which took me 30 seconds, I dont want this to happen on a Wiki with more than 10000 pages as Searchresult.

        Suggestion:
        1.Transmit the "resultsdisplayed" parameter pagesize as HitLength from
        AjaxSearch.jsp :

        wiki:SetPagination start="$

        Unknown macro: {param.start}

        " total="<%=list.size()%>" pagesize="20")

        2.And then call the LuceneSearchProvider.findPages(query, flags, startFromHit, HitLength)

        3.Get only the fragments from startFromHi * till *startFromHi + HitLength

        What do you think?

        Show
        Kurt Stein added a comment - Update to my Research on LuceneSearchProvider. I added some Stopwatches to see where the bottleneck is. It is the LuceneHighlighter! The Problem is the part where it is getting the fragments for all searchresults. This is unessesary because the page only displays the first 20 Searchresults(1-20). And if you go to the next page JSPWIKI start a new query but displays only results 21-40. In my case i had 500 results which took me 30 seconds, I dont want this to happen on a Wiki with more than 10000 pages as Searchresult. Suggestion: 1.Transmit the "resultsdisplayed" parameter pagesize as HitLength from AjaxSearch.jsp : wiki:SetPagination start="$ Unknown macro: {param.start} " total="<%=list.size()%>" pagesize="20") 2.And then call the LuceneSearchProvider.findPages(query, flags, startFromHit, HitLength) 3.Get only the fragments from startFromHi * till *startFromHi + HitLength What do you think?
        Hide
        Janne Jalkanen added a comment -

        Sounds okay to me (except that the findPages(query,flags) needs to be retained for API compatibility, so you need to overload findPages().

        In addition, I don't like AjaxSearch.jsp. It should really be completely integrated with SearchManager. Our entire search is currently a hodgepodge of things, and someone should go and clean the whole thing up...

        Show
        Janne Jalkanen added a comment - Sounds okay to me (except that the findPages(query,flags) needs to be retained for API compatibility, so you need to overload findPages(). In addition, I don't like AjaxSearch.jsp. It should really be completely integrated with SearchManager. Our entire search is currently a hodgepodge of things, and someone should go and clean the whole thing up...
        Hide
        Kurt Stein added a comment -

        OK, I have finished my work.

        This one fixes the search dilemma for LuceneSearchProvider described above.

        Additionally it shows all pages for a query. But it shows secret content(user without pagepermission) only to authorized users.

        Reason: We often have users saying: This is not in the wiki. But it is in the wiki, they are just not authorized.(We still need a little bit of internalization for this piece of code).

        This way, they see there is a page an so they can ask the admin for access to this page.

        As soon as they changes are accomplished I will provide my search template. But i don´t want to mix to much code for you at one time.

        PS: There is a dependency in LuceneSearchProvider for PDBox because I am actually indexing my pdf files I think this is a good part to get pdf indexed so I commit this too.

        Show
        Kurt Stein added a comment - OK, I have finished my work. This one fixes the search dilemma for LuceneSearchProvider described above. Additionally it shows all pages for a query. But it shows secret content(user without pagepermission) only to authorized users. Reason: We often have users saying: This is not in the wiki. But it is in the wiki, they are just not authorized.(We still need a little bit of internalization for this piece of code). This way, they see there is a page an so they can ask the admin for access to this page. As soon as they changes are accomplished I will provide my search template. But i don´t want to mix to much code for you at one time. PS: There is a dependency in LuceneSearchProvider for PDBox because I am actually indexing my pdf files I think this is a good part to get pdf indexed so I commit this too.
        Hide
        Kurt Stein added a comment -

        Result

        Show
        Kurt Stein added a comment - Result
        Hide
        Janne Jalkanen added a comment -

        I see at least the following problems with this patch:

        • SearchMatcher.notAuthorized() is not localized.
        • The patch does not adhere to JSPWiki coding guidelines (e.g. the use of tabs instead of spaces, curly braces and so on)
        • I don't understand why SearchMatcher needs to be augmented with a new routine breaking binary and source compatibility (big no-no for patch releases)
        • WikiEngine.findPages(String,int,int) should be in SearchManager.
        • LuceneSearchProvider is relicensed under LGPL
        • There is a dependency to PDFBox (this should be a separate issue and patch)
        • Generics are replaced with non-generics (undoing the work we did for 2.8)
        • Some old bugs are reintroduced by failing to catch exceptions properly
        • There is some strange "owiki.tld" being included in AjaxSearch.jsp
        • The patch is way too big to be examined properly - there are lots of changes to AjaxSearch.jsp for example, most of look like reindentation.

        All in all, a strong -1 from me for this patch. Please clean it up and make it really short and sweet so that it can be properly reviewed.

        Show
        Janne Jalkanen added a comment - I see at least the following problems with this patch: SearchMatcher.notAuthorized() is not localized. The patch does not adhere to JSPWiki coding guidelines (e.g. the use of tabs instead of spaces, curly braces and so on) I don't understand why SearchMatcher needs to be augmented with a new routine breaking binary and source compatibility (big no-no for patch releases) WikiEngine.findPages(String,int,int) should be in SearchManager. LuceneSearchProvider is relicensed under LGPL There is a dependency to PDFBox (this should be a separate issue and patch) Generics are replaced with non-generics (undoing the work we did for 2.8) Some old bugs are reintroduced by failing to catch exceptions properly There is some strange "owiki.tld" being included in AjaxSearch.jsp The patch is way too big to be examined properly - there are lots of changes to AjaxSearch.jsp for example, most of look like reindentation. All in all, a strong -1 from me for this patch. Please clean it up and make it really short and sweet so that it can be properly reviewed.
        Hide
        Kurt Stein added a comment -

        Yoe totally destroyed me. But this is OK. Thanks for your comments.

        I have a few questions regarding to your points.

        *SearchMatcher needs to be augmented with a new routine breaking binary and source compatibility
        -->It is because The not authorized need to be implemnted for SearchMatcher too. Actually it is not part of the patch, so i will leave it out. For this patch.

        *WikiEngine.findPages(String,int,int) should be in SearchManager.
        -->dont understand this part. Where is the problem?

        Do you have a Java class where you use localization. I can only find it in JSP Pages.

        Show
        Kurt Stein added a comment - Yoe totally destroyed me. But this is OK. Thanks for your comments. I have a few questions regarding to your points. *SearchMatcher needs to be augmented with a new routine breaking binary and source compatibility -->It is because The not authorized need to be implemnted for SearchMatcher too. Actually it is not part of the patch, so i will leave it out. For this patch. *WikiEngine.findPages(String,int,int) should be in SearchManager. -->dont understand this part. Where is the problem? Do you have a Java class where you use localization. I can only find it in JSP Pages.
        Hide
        Kurt Stein added a comment -

        An other questions. There are a lot of brnaches. On codebase of which branch should I do the patch?

        Show
        Kurt Stein added a comment - An other questions. There are a lot of brnaches. On codebase of which branch should I do the patch?
        Hide
        Janne Jalkanen added a comment -

        Sorry :-/. Didn't mean to devastate you... But it's in everybody's interest if we get good patches!

        WikiEngine is too fat as it is, so the current trend is to enhance the Managers instead of WikiEngine with new methods. The old methods in WikiEngine are there for backwards compatibility, and new methods should be added to the relevant Manager class.

        All new major features should be based on the trunk. Minor features and updates should go to the JSPWIKI_2_8_BRANCH. It's okay to break binary compatibility in the trunk, but for 2.8 it's a no-no.

        For example, the addition of PDFBox should be against trunk, but speed optimizations should be against the 2.8 branch.

        In general, the smaller the patch, the better. A really big patch may have side-effects which cause massive problems down the line, so in general we don't like those (unless we're sure that you know what you're doing - but in that case you're probably already a committer).

        Show
        Janne Jalkanen added a comment - Sorry :-/. Didn't mean to devastate you... But it's in everybody's interest if we get good patches! WikiEngine is too fat as it is, so the current trend is to enhance the Managers instead of WikiEngine with new methods. The old methods in WikiEngine are there for backwards compatibility, and new methods should be added to the relevant Manager class. All new major features should be based on the trunk. Minor features and updates should go to the JSPWIKI_2_8_BRANCH. It's okay to break binary compatibility in the trunk, but for 2.8 it's a no-no. For example, the addition of PDFBox should be against trunk, but speed optimizations should be against the 2.8 branch. In general, the smaller the patch, the better. A really big patch may have side-effects which cause massive problems down the line, so in general we don't like those (unless we're sure that you know what you're doing - but in that case you're probably already a committer).
        Hide
        Kurt Stein added a comment -

        The problem is as long as I don't get the trunk started (Because of all the pending JCR stuff) to see what I am doing I prefer to work on the branches.

        So i will make my patch working on JSPWIKI_2_8_BRANCH.

        So you want the JSP-Page to call the Searchmanger and not the wikiengine to perform the search?

        In general I think it would be best to make the call from AjaxSearch.jsp and Search.jsp the same way to reduce duplicate code.

        Show
        Kurt Stein added a comment - The problem is as long as I don't get the trunk started (Because of all the pending JCR stuff) to see what I am doing I prefer to work on the branches. So i will make my patch working on JSPWIKI_2_8_BRANCH. So you want the JSP-Page to call the Searchmanger and not the wikiengine to perform the search? In general I think it would be best to make the call from AjaxSearch.jsp and Search.jsp the same way to reduce duplicate code.
        Hide
        Kurt Stein added a comment -

        This is a pure patch for this issue. I will open more issue to improve the Wiki-Search functionality.

        I hope I have merged all the necessary stuff.

        Show
        Kurt Stein added a comment - This is a pure patch for this issue. I will open more issue to improve the Wiki-Search functionality. I hope I have merged all the necessary stuff.
        Hide
        Harry Metske added a comment -

        Kurt,
        a bit more room for improvement still:

        • name your patch files properly, using the .patch suffix (makes handling easier with most editors, and gives automatic correct syntax highlighting)
        • some parameter names do not adhere to the standards
        • method docs are incomplete
        • please use the JSPWiki code formatting (http://www.jspwiki.org/wiki/JSPWikiCodingStandard).
        • an unnecessary import of StopWatch in Search.jsp
        • there is still a findPages(String,int,int) left in WikiEngine
        • is it possible to include one or more JUnit tests (to reduce the chance that we break your work in the future) ?

        Another suggestion: Have you tried using the Eclipse CheckStyle plugin ?

        thanks,
        Harry

        Show
        Harry Metske added a comment - Kurt, a bit more room for improvement still: name your patch files properly, using the .patch suffix (makes handling easier with most editors, and gives automatic correct syntax highlighting) some parameter names do not adhere to the standards method docs are incomplete please use the JSPWiki code formatting ( http://www.jspwiki.org/wiki/JSPWikiCodingStandard ). an unnecessary import of StopWatch in Search.jsp there is still a findPages(String,int,int) left in WikiEngine is it possible to include one or more JUnit tests (to reduce the chance that we break your work in the future) ? Another suggestion: Have you tried using the Eclipse CheckStyle plugin ? thanks, Harry
        Hide
        Janne Jalkanen added a comment -

        Also, this extends SearchProvider interface. We can't do that in a minor release - it needs to go either to 2.9 or 3.0 (and considering that there is no 2.9, I think this patch needs to be against 3.0, or be refactored not to change existing interfaces).

        Show
        Janne Jalkanen added a comment - Also, this extends SearchProvider interface. We can't do that in a minor release - it needs to go either to 2.9 or 3.0 (and considering that there is no 2.9, I think this patch needs to be against 3.0, or be refactored not to change existing interfaces).
        Hide
        Kurt Stein added a comment -

        Ok 2 Issues.
        First. This is a really heavy bottleneck for performance. If there is any way to avoid the Interface I will do so. But I don´t see the point. And I really want a fix for 2.8 Versions (Saying that for JSPWiki Community because my Wiki-Server has the fix and the search is running like a leopard)
        Thus I need to pass these 2 additional paramters/variables to the LuceneSearchprovider. This leads to the second Issue. I can avoid using the wikiengine find(String,int,int) method. So do you want me to call the searchmanager directly from ajaxsearch.jsp and search.jsp ?

        I will checkout your JPSwiki Styles but personally I dont like the linebreak in methodcalls (as I am working with a 22" Flatscreen -overview is not a issue)

        PS: The stopwatch in search.jsp import can be removed. I think it is some legacy from my tests for finding the bottleneck. Can you please specify which paramters don´t adhere the standards?

        Show
        Kurt Stein added a comment - Ok 2 Issues. First. This is a really heavy bottleneck for performance. If there is any way to avoid the Interface I will do so. But I don´t see the point. And I really want a fix for 2.8 Versions (Saying that for JSPWiki Community because my Wiki-Server has the fix and the search is running like a leopard) Thus I need to pass these 2 additional paramters/variables to the LuceneSearchprovider. This leads to the second Issue. I can avoid using the wikiengine find(String,int,int) method. So do you want me to call the searchmanager directly from ajaxsearch.jsp and search.jsp ? I will checkout your JPSwiki Styles but personally I dont like the linebreak in methodcalls (as I am working with a 22" Flatscreen -overview is not a issue) PS: The stopwatch in search.jsp import can be removed. I think it is some legacy from my tests for finding the bottleneck. Can you please specify which paramters don´t adhere the standards?
        Hide
        Janne Jalkanen added a comment -

        Yupyup, we do recognize this as a problem, but unfortunately there are some ground rules we have to follow, or else we will simply lose the ability to maintain the code. One of these are the code style instructions; the other one is maintaining the compatibility rules. For example, if someone has written their own SearchProvider, and we casually break the interface, they will have to fix it in order to take a new minor release into use. In many corporate environments, this may be difficult.

        One possibility is to create a PaginatingSearchProvider interface, which has the findPages(String,int,int) method, and have LuceneSearchProvider implement that in addition to the regular SearchProvider interface. SearchManager can then provide a findPages(String,int,int) method which checks whether the given SearchProvider implements this PaginatingSearchProvider, and functions accordingly.

        And yes, the JSP files should call SearchManager directly.

        Show
        Janne Jalkanen added a comment - Yupyup, we do recognize this as a problem, but unfortunately there are some ground rules we have to follow, or else we will simply lose the ability to maintain the code. One of these are the code style instructions; the other one is maintaining the compatibility rules. For example, if someone has written their own SearchProvider, and we casually break the interface, they will have to fix it in order to take a new minor release into use. In many corporate environments, this may be difficult. One possibility is to create a PaginatingSearchProvider interface, which has the findPages(String,int,int) method, and have LuceneSearchProvider implement that in addition to the regular SearchProvider interface. SearchManager can then provide a findPages(String,int,int) method which checks whether the given SearchProvider implements this PaginatingSearchProvider, and functions accordingly. And yes, the JSP files should call SearchManager directly.
        Hide
        Harry Metske added a comment -

        We also very welcome performance improvements to searching, no discussions about that.

        Just install the CheckStyle plugin, and it will tell you all code style violations. And again, just minor things, easy to fix.
        For example:

        • the casing of method parameters, for example : "int SearchItemAmount" should be "int searchItemAmount".
        • the following snippet:
               /**
               *  BasicSearchProvider needs to parse all pages, thus Start- and Endpoint are useless
               */
              public Collection<ArrayList<SearchResultImpl>> findPages( String query, int startShowFrom, int showAmount )
                                                                                                                         throws ProviderException,
                                                                                                                             IOException
          
          

          could be:

              /**
               *  {@inheritDoc}
               */
              public Collection<ArrayList<SearchResultImpl>> findPages( String query, int startShowFrom, int showAmount )
                                                                                                                         throws ProviderException,
                                                                                                                             IOException
          
        • the following snippet seems to have been copy/pasted, but the added parameters are not in the method doc:
              /**
               *  Sends a search to the current search provider. The query is is whatever native format
               *  the query engine wants to use.
               *
               * @param query The query.  Null is safe, and is interpreted as an empty query.
               * @return A collection of WikiPages that matched.
               * @throws ProviderException If the provider fails and a search cannot be completed.
               * @throws IOException If something else goes wrong.
               */
              public Collection<ArrayList<SearchResultImpl>> findPages( String query ,int startSearchFrom ,int searchItemAmount )
                  throws ProviderException, IOException
              {
          
        Show
        Harry Metske added a comment - We also very welcome performance improvements to searching, no discussions about that. Just install the CheckStyle plugin, and it will tell you all code style violations. And again, just minor things, easy to fix. For example: the casing of method parameters, for example : "int SearchItemAmount" should be "int searchItemAmount". the following snippet: /** * BasicSearchProvider needs to parse all pages, thus Start- and Endpoint are useless */ public Collection<ArrayList<SearchResultImpl>> findPages( String query, int startShowFrom, int showAmount ) throws ProviderException, IOException could be: /** * {@inheritDoc} */ public Collection<ArrayList<SearchResultImpl>> findPages( String query, int startShowFrom, int showAmount ) throws ProviderException, IOException the following snippet seems to have been copy/pasted, but the added parameters are not in the method doc: /** * Sends a search to the current search provider. The query is is whatever native format * the query engine wants to use. * * @param query The query. Null is safe, and is interpreted as an empty query. * @ return A collection of WikiPages that matched. * @ throws ProviderException If the provider fails and a search cannot be completed. * @ throws IOException If something else goes wrong. */ public Collection<ArrayList<SearchResultImpl>> findPages( String query , int startSearchFrom , int searchItemAmount ) throws ProviderException, IOException {
        Hide
        Harry Metske added a comment -

        And again, JUnit tests is also very important.
        JSPWiki currently has 1049 JUnit tests, and this really helps preventing us breaking things.
        This has been a serious investment and we profit from that every day. We want to keep relying on it.

        Show
        Harry Metske added a comment - And again, JUnit tests is also very important. JSPWiki currently has 1049 JUnit tests, and this really helps preventing us breaking things. This has been a serious investment and we profit from that every day. We want to keep relying on it.
        Hide
        Kurt Stein added a comment -

        Sure unitTest are important. I will add one if I get behind the magic of unitTests.

        I have imported the wiki codestyles into eclipse and run "format" on LuceneSearchProvider. This stuff has corrupted the complete class by adding a lot of whitespace everywhere(even code I havn´t touch).

        Seems to me that the LuceneSearchProvider is not conform with the JSPWikistyle...or I am using the style formatter in an unappropriate way.

        Show
        Kurt Stein added a comment - Sure unitTest are important. I will add one if I get behind the magic of unitTests. I have imported the wiki codestyles into eclipse and run "format" on LuceneSearchProvider. This stuff has corrupted the complete class by adding a lot of whitespace everywhere(even code I havn´t touch). Seems to me that the LuceneSearchProvider is not conform with the JSPWikistyle...or I am using the style formatter in an unappropriate way.
        Hide
        Harry Metske added a comment -

        That is correct, there are many more sources that are not according to JSPWiki code style (JSPWiki is a long existing project).
        So you should not format a complete source if you supply a patch, just select the piece of code you added, and then format it.
        If you want to format the whole source, do that in a separate patch.

        thanks,
        Harry

        Show
        Harry Metske added a comment - That is correct, there are many more sources that are not according to JSPWiki code style (JSPWiki is a long existing project). So you should not format a complete source if you supply a patch, just select the piece of code you added, and then format it. If you want to format the whole source, do that in a separate patch. thanks, Harry
        Hide
        Janne Jalkanen added a comment -

        Also, the eclipse codestyle is not necessarily completely up-to-date.

        Show
        Janne Jalkanen added a comment - Also, the eclipse codestyle is not necessarily completely up-to-date.
        Hide
        Kurt Stein added a comment -

        I was beatifying my code and I noticed, that Lucensearchprovider returns a Collection with ArrayList<SearchResult> and BasicSearchProvider a TreeSet<SearchResult>.

        I need to unify them for the SeachProvider Interface, so which one do you want to keep?
        BasicSearchprovider seems to need the sorting and LuceneSearchprovider gets the sorting from the LuceneIndex.

        I give you the patch and you can go the last mile...

        Show
        Kurt Stein added a comment - I was beatifying my code and I noticed, that Lucensearchprovider returns a Collection with ArrayList<SearchResult> and BasicSearchProvider a TreeSet<SearchResult>. I need to unify them for the SeachProvider Interface, so which one do you want to keep? BasicSearchprovider seems to need the sorting and LuceneSearchprovider gets the sorting from the LuceneIndex. I give you the patch and you can go the last mile...
        Hide
        Kurt Stein added a comment -

        TODO: Generics for BasicSearchProvider and LuceneSearchProvider need to be synchronized.

        Show
        Kurt Stein added a comment - TODO: Generics for BasicSearchProvider and LuceneSearchProvider need to be synchronized.
        Hide
        Janne Jalkanen added a comment -

        I don't understand - SearchProvider.findPages() returns a Collection. There is no need to synchronize anything; they're already correct.

        Show
        Janne Jalkanen added a comment - I don't understand - SearchProvider.findPages() returns a Collection. There is no need to synchronize anything; they're already correct.
        Hide
        Kurt Stein added a comment -

        Yes I know, but whats about the Generics you told me? All the stuff that is within the Collection? Collection<Arraylist<SearchResult>>

        Show
        Kurt Stein added a comment - Yes I know, but whats about the Generics you told me? All the stuff that is within the Collection? Collection<Arraylist<SearchResult>>
        Hide
        Janne Jalkanen added a comment -

        Nonono, not that one. You replaced some internal generics, not the API signatures . Those are still fine without generics (we're generifying them for 3.0).

        Show
        Janne Jalkanen added a comment - Nonono, not that one. You replaced some internal generics, not the API signatures . Those are still fine without generics (we're generifying them for 3.0).
        Hide
        Harry Metske added a comment -

        Kurt,

        it's a long time ago, but can we still expect patches ?
        If not, I'd like to close this issue.

        Show
        Harry Metske added a comment - Kurt, it's a long time ago, but can we still expect patches ? If not, I'd like to close this issue.
        Hide
        Harry Metske added a comment -

        no follow up unfortunately.
        Feel free to re-open is patches are there.

        Show
        Harry Metske added a comment - no follow up unfortunately. Feel free to re-open is patches are there.

          People

          • Assignee:
            Unassigned
            Reporter:
            Kurt Stein
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development