[LUCENE-7525] ASCIIFoldingFilter.foldToASCII performance issue due to large compiled method size - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 6.2.1
Fix Version/s: None
Component/s: modules/analysis
Labels:
None

Lucene Fields:

New

Description

The ASCIIFoldingFilter.foldToASCII method has an enormous switch statement and is too large for the HotSpot compiler to compile; causing a performance problem.

The method is about 13K compiled, versus the 8KB HotSpot limit. So splitting the method in half works around the problem.

In my tests splitting the method in half resulted in a 5X performance increase.

In the test code below you can see how slow the fold method is, even when it is using the shortcut when the character is less than 0x80, compared to an inline implementation of the same shortcut.

So a workaround is to split the method. I'm happy to provide a patch. It's a hack, of course. Perhaps using the MappingCharFilterFactory with an input file as per ~~SOLR-2013~~ would be a better replacement for this method in this class?

public class ASCIIFoldingFilterPerformanceTest {

	private static final int ITERATIONS = 1_000_000;

	@Test
	public void testFoldShortString() {
		char[] input = "testing".toCharArray();
		char[] output = new char[input.length * 4];

		for (int i = 0; i < ITERATIONS; i++) {
			ASCIIFoldingFilter.foldToASCII(input, 0, output, 0, input.length);
		}
	}

	@Test
	public void testFoldShortAccentedString() {
		char[] input = "éúéúøßüäéúéúøßüä".toCharArray();
		char[] output = new char[input.length * 4];

		for (int i = 0; i < ITERATIONS; i++) {
			ASCIIFoldingFilter.foldToASCII(input, 0, output, 0, input.length);
		}
	}

	@Test
	public void testManualFoldTinyString() {
		char[] input = "t".toCharArray();
		char[] output = new char[input.length * 4];

		for (int i = 0; i < ITERATIONS; i++) {
			int k = 0;
			for (int j = 0; j < 1; ++j) {
				final char c = input[j];
				if (c < '\u0080') {
					output[k++] = c;
				} else {
					Assert.assertTrue(false);
				}
			}
		}
	}
}

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

LUCENE-7525.patch
27/Jan/17 14:08
5 kB
Adrien Grand
LUCENE-7525.patch
27/Jan/17 13:50
5 kB
Adrien Grand
ASCIIFoldingFilter.java
28/Oct/16 00:25
109 kB
Karl von Randow
TestASCIIFolding.java
28/Oct/16 00:15
2 kB
Karl von Randow
ASCIIFolding.java
28/Oct/16 00:15
88 kB
Karl von Randow

Activity

People

Assignee:: Unassigned

Reporter:: Karl von Randow

Votes:: 1 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 27/Oct/16 05:13

Updated:: 28/Aug/22 15:05