Lucene - Core
  1. Lucene - Core
  2. LUCENE-2098

make BaseCharFilter more efficient in performance

    Details

    • Lucene Fields:
      New, Patch Available

      Description

      Performance degradation in Solr 1.4 was reported. See:

      http://www.lucidimagination.com/search/document/43c4bdaf5c9ec98d/html_stripping_slower_in_solr_1_4

      The inefficiency has been pointed out in BaseCharFilter javadoc by Mike:

      NOTE: This class is not particularly efficient. For example, a new class instance is created for every call to addOffCorrectMap(int, int), which is then appended to a private list.

      1. LUCENE-2098.patch
        2 kB
        Robert Muir
      2. LUCENE-2098.patch
        4 kB
        Robert Muir
      3. LUCENE-2098.patch
        4 kB
        Koji Sekiguchi

        Activity

        Koji Sekiguchi created issue -
        Hide
        Robert Muir added a comment -

        i haven't benchmarked to see if this is any faster, maybe even worse.

        but its no longer a linear algorithm

        Show
        Robert Muir added a comment - i haven't benchmarked to see if this is any faster, maybe even worse. but its no longer a linear algorithm
        Robert Muir made changes -
        Field Original Value New Value
        Attachment LUCENE-2098.patch [ 12438866 ]
        Hide
        Michael McCandless added a comment -

        Why did this cause Solr to slowdown...? Did Solr previously have a more efficient impl and then they cutover to Lucene's?

        Show
        Michael McCandless added a comment - Why did this cause Solr to slowdown...? Did Solr previously have a more efficient impl and then they cutover to Lucene's?
        Michael McCandless made changes -
        Affects Version/s 3.1 [ 12314025 ]
        Affects Version/s 2.9 [ 12312682 ]
        Hide
        Michael McCandless added a comment -

        Patch looks like it should be a good net/net improvement – lookups of the offset correction should now be fast (though insertion cost is probably higher – we create likely 3 new objects (2 ints, one TreeMap$Entry) per insert) but I expect that's a good tradeoff.

        Show
        Michael McCandless added a comment - Patch looks like it should be a good net/net improvement – lookups of the offset correction should now be fast (though insertion cost is probably higher – we create likely 3 new objects (2 ints, one TreeMap$Entry) per insert) but I expect that's a good tradeoff.
        Hide
        Uwe Schindler added a comment -

        Why did this cause Solr to slowdown...? Did Solr previously have a more efficient impl and then they cutover to Lucene's?

        Solr used another Filter in 1.3.

        Show
        Uwe Schindler added a comment - Why did this cause Solr to slowdown...? Did Solr previously have a more efficient impl and then they cutover to Lucene's? Solr used another Filter in 1.3.
        Hide
        Michael McCandless added a comment -

        Ahh ok.

        Probably we should switch to parallel arrays here, to make it very fast... yes this will consume RAM (8 bytes per position, if we keep all of them).

        Really most apps do not need all positions stored, ie, they only need to see typically the current token. So maybe we could make a filter that takes a "lookbehind size" and it'd only keep that number of mappings cached? That'd have to be > the max size of any token you may analyze, so hard to bound perfectly, but eg setting this to the max allowed token in IndexWriter would guarantee that we'd never have a miss?

        For analyzers that buffer tokens... they'd have to set this max to infinity, or, ensure they remap the offsets before capturing the token's state?

        Show
        Michael McCandless added a comment - Ahh ok. Probably we should switch to parallel arrays here, to make it very fast... yes this will consume RAM (8 bytes per position, if we keep all of them). Really most apps do not need all positions stored, ie, they only need to see typically the current token. So maybe we could make a filter that takes a "lookbehind size" and it'd only keep that number of mappings cached? That'd have to be > the max size of any token you may analyze, so hard to bound perfectly, but eg setting this to the max allowed token in IndexWriter would guarantee that we'd never have a miss? For analyzers that buffer tokens... they'd have to set this max to infinity, or, ensure they remap the offsets before capturing the token's state?
        Hide
        Robert Muir added a comment -

        Mark did some quick tests and this patch only seems to make things slower.

        Really most apps do not need all positions stored, ie, they only need to see typically the current token.

        I think this is why it got slower with my patch, in practice it didn't matter that this thing did 'backwards linear lookup' due to this reason?

        Show
        Robert Muir added a comment - Mark did some quick tests and this patch only seems to make things slower. Really most apps do not need all positions stored, ie, they only need to see typically the current token. I think this is why it got slower with my patch, in practice it didn't matter that this thing did 'backwards linear lookup' due to this reason?
        Hide
        Michael McCandless added a comment -

        I think this is why it got slower with my patch, in practice it didn't matter that this thing did 'backwards linear lookup' due to this reason?

        Ahh yes since presumably the test was simply looking up the offsets for the current token...

        Show
        Michael McCandless added a comment - I think this is why it got slower with my patch, in practice it didn't matter that this thing did 'backwards linear lookup' due to this reason? Ahh yes since presumably the test was simply looking up the offsets for the current token...
        Hide
        Robert Muir added a comment -

        I think the best way to proceed would be to make it easy to benchmark
        CharFilters in contrib/benchmark, especially this HTML stripping one.

        Honestly we don't even know for sure any performance degradation reported
        in the original link is really due to BaseCharFilter yet, so I think we need
        to benchmark and profile.

        Show
        Robert Muir added a comment - I think the best way to proceed would be to make it easy to benchmark CharFilters in contrib/benchmark, especially this HTML stripping one. Honestly we don't even know for sure any performance degradation reported in the original link is really due to BaseCharFilter yet, so I think we need to benchmark and profile.
        Robert Muir made changes -
        Component/s contrib/analyzers [ 12312333 ]
        Component/s Analysis [ 12310230 ]
        Uwe Schindler made changes -
        Affects Version/s 3.1 [ 12314822 ]
        Affects Version/s 4.0 [ 12314025 ]
        Hide
        Robert Muir added a comment -

        ok, i think this one is fixed.

        i ran a loop with the example doc in the tests and tested both removing the object creation and switching to binary search, both help.

        I'd like to commit to trunk and 3x tomorrow.

        Show
        Robert Muir added a comment - ok, i think this one is fixed. i ran a loop with the example doc in the tests and tested both removing the object creation and switching to binary search, both help. I'd like to commit to trunk and 3x tomorrow.
        Robert Muir made changes -
        Attachment LUCENE-2098.patch [ 12453217 ]
        Robert Muir made changes -
        Assignee Robert Muir [ rcmuir ]
        Robert Muir made changes -
        Fix Version/s 3.1 [ 12314822 ]
        Fix Version/s 4.0 [ 12314025 ]
        Lucene Fields [New] [New, Patch Available]
        Hide
        Robert Muir added a comment -

        here are the files i tested, htmlStripCharFilterTest.html (from the test, 12kb file) and http://en.wikipedia.org/wiki/Benjamin_Franklin (360kb file)

        i ran each 3 times:

        file before after
        htmlStripCharFilterTest.html 9709ms,9560ms,9587ms 8755ms,8697ms,8708ms
        benFranklin.html 26877ms,26963ms,26495ms 17593ms,17674ms,17694ms

        here was the code (crude but i think it shows the point, the larger the files the worse the offset correction was):

            Charset charset = Charset.forName("UTF-8");
            WhitespaceTokenizer tokenizer = new WhitespaceTokenizer(Version.LUCENE_CURRENT, new StringReader(""));
            long startMS = System.currentTimeMillis();
            for (int i = 0; i < 10000; i++) {
              InputStream stream = HTMLStripCharFilterTest.class.getResourceAsStream("htmlStripReaderTest.html");
              HTMLStripCharFilter reader = new HTMLStripCharFilter(CharReader.get(new InputStreamReader(stream, charset)));
              tokenizer.reset(reader);
              tokenizer.reset();
              while (tokenizer.incrementToken())
                ;
            }
            System.out.println("time=" + (System.currentTimeMillis() - startMS));
        
        Show
        Robert Muir added a comment - here are the files i tested, htmlStripCharFilterTest.html (from the test, 12kb file) and http://en.wikipedia.org/wiki/Benjamin_Franklin (360kb file) i ran each 3 times: file before after htmlStripCharFilterTest.html 9709ms,9560ms,9587ms 8755ms,8697ms,8708ms benFranklin.html 26877ms,26963ms,26495ms 17593ms,17674ms,17694ms here was the code (crude but i think it shows the point, the larger the files the worse the offset correction was): Charset charset = Charset.forName( "UTF-8" ); WhitespaceTokenizer tokenizer = new WhitespaceTokenizer(Version.LUCENE_CURRENT, new StringReader("")); long startMS = System .currentTimeMillis(); for ( int i = 0; i < 10000; i++) { InputStream stream = HTMLStripCharFilterTest.class.getResourceAsStream( "htmlStripReaderTest.html" ); HTMLStripCharFilter reader = new HTMLStripCharFilter(CharReader.get( new InputStreamReader(stream, charset))); tokenizer.reset(reader); tokenizer.reset(); while (tokenizer.incrementToken()) ; } System .out.println( "time=" + ( System .currentTimeMillis() - startMS));
        Hide
        Koji Sekiguchi added a comment -

        Patch looks good, Robert!

        I added a check code for currentOff before doing binary search.

        Show
        Koji Sekiguchi added a comment - Patch looks good, Robert! I added a check code for currentOff before doing binary search.
        Koji Sekiguchi made changes -
        Attachment LUCENE-2098.patch [ 12453239 ]
        rmuir committed 990167 (52 files)
        Reviews: none

        LUCENE-2098: speed up BaseCharFilter

        Lucene branch_3x
        Hide
        Robert Muir added a comment -

        Committed 990161 (trunk) 990167 (3x)

        Show
        Robert Muir added a comment - Committed 990161 (trunk) 990167 (3x)
        Robert Muir made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Hide
        Robert Muir added a comment -

        reopening for potential backport.

        I think this one (n^2) is really buggish territory, the original user reports > 2x slowdown with solr 1.4

        I think the fix is safe, its baked in trunk/3.x a while, and if we have perf bugs like this with no api breaks,
        it makes sense... but if someone objects I won't backport.

        Show
        Robert Muir added a comment - reopening for potential backport. I think this one (n^2) is really buggish territory, the original user reports > 2x slowdown with solr 1.4 I think the fix is safe, its baked in trunk/3.x a while, and if we have perf bugs like this with no api breaks, it makes sense... but if someone objects I won't backport.
        Robert Muir made changes -
        Resolution Fixed [ 1 ]
        Status Resolved [ 5 ] Reopened [ 4 ]
        Robert Muir made changes -
        Fix Version/s 2.9.4 [ 12315148 ]
        Fix Version/s 3.0.3 [ 12315147 ]
        Hide
        Koji Sekiguchi added a comment -

        Robert, please backport.

        Show
        Koji Sekiguchi added a comment - Robert, please backport.
        rmuir committed 1029077 (28 files)
        Reviews: none

        LUCENE-2098: speed up BaseCharFilter

        Lucene lucene_3_0
        rmuir committed 1029087 (30 files)
        Reviews: none

        LUCENE-2098: speed up BaseCharFilter

        Lucene lucene_2_9
        Hide
        Robert Muir added a comment -

        Committed revision 1029077 to 3.0.x.
        Committed revision 1029087 to 2.9.x.

        I also cleared up the svn:eol-style problems on these branches,
        if you are merging on windows you shouldn't see any property conflicts anymore.

        Show
        Robert Muir added a comment - Committed revision 1029077 to 3.0.x. Committed revision 1029087 to 2.9.x. I also cleared up the svn:eol-style problems on these branches, if you are merging on windows you shouldn't see any property conflicts anymore.
        Robert Muir made changes -
        Status Reopened [ 4 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Uwe Schindler made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Mark Thomas made changes -
        Workflow jira [ 12483408 ] Default workflow, editable Closed status [ 12564072 ]
        Mark Thomas made changes -
        Workflow Default workflow, editable Closed status [ 12564072 ] jira [ 12585536 ]
        Shai Erera made changes -
        Component/s modules/analysis [ 12310230 ]
        Component/s contrib/analyzers [ 12312333 ]

          People

          • Assignee:
            Robert Muir
            Reporter:
            Koji Sekiguchi
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development