Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-7525

ASCIIFoldingFilter.foldToASCII performance issue due to large compiled method size

Agile BoardAttach filesAttach ScreenshotAdd voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 6.2.1
    • None
    • modules/analysis
    • None
    • New

    Description

      The ASCIIFoldingFilter.foldToASCII method has an enormous switch statement and is too large for the HotSpot compiler to compile; causing a performance problem.

      The method is about 13K compiled, versus the 8KB HotSpot limit. So splitting the method in half works around the problem.

      In my tests splitting the method in half resulted in a 5X performance increase.

      In the test code below you can see how slow the fold method is, even when it is using the shortcut when the character is less than 0x80, compared to an inline implementation of the same shortcut.

      So a workaround is to split the method. I'm happy to provide a patch. It's a hack, of course. Perhaps using the MappingCharFilterFactory with an input file as per SOLR-2013 would be a better replacement for this method in this class?

      public class ASCIIFoldingFilterPerformanceTest {
      
      	private static final int ITERATIONS = 1_000_000;
      
      	@Test
      	public void testFoldShortString() {
      		char[] input = "testing".toCharArray();
      		char[] output = new char[input.length * 4];
      
      		for (int i = 0; i < ITERATIONS; i++) {
      			ASCIIFoldingFilter.foldToASCII(input, 0, output, 0, input.length);
      		}
      	}
      
      	@Test
      	public void testFoldShortAccentedString() {
      		char[] input = "éúéúøßüäéúéúøßüä".toCharArray();
      		char[] output = new char[input.length * 4];
      
      		for (int i = 0; i < ITERATIONS; i++) {
      			ASCIIFoldingFilter.foldToASCII(input, 0, output, 0, input.length);
      		}
      	}
      
      	@Test
      	public void testManualFoldTinyString() {
      		char[] input = "t".toCharArray();
      		char[] output = new char[input.length * 4];
      
      		for (int i = 0; i < ITERATIONS; i++) {
      			int k = 0;
      			for (int j = 0; j < 1; ++j) {
      				final char c = input[j];
      				if (c < '\u0080') {
      					output[k++] = c;
      				} else {
      					Assert.assertTrue(false);
      				}
      			}
      		}
      	}
      }
      

      Attachments

        1. ASCIIFolding.java
          88 kB
          Karl von Randow
        2. ASCIIFoldingFilter.java
          109 kB
          Karl von Randow
        3. LUCENE-7525.patch
          5 kB
          Adrien Grand
        4. LUCENE-7525.patch
          5 kB
          Adrien Grand
        5. TestASCIIFolding.java
          2 kB
          Karl von Randow

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            karlvr Karl von Randow

            Dates

              Created:
              Updated:

              Slack

                Issue deployment