[LUCENE-2798] Randomize indexed collation key testing - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Test
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: 3.1, 4.0-ALPHA
Fix Version/s: 4.9, 6.0
Component/s: modules/analysis
Labels:
None

Lucene Fields:

New

Description

Robert Muir noted on #lucene IRC channel today that Lucene's indexed collation key testing is currently fragile (for example, they had to be revisited when Robert upgraded the ICU dependency in ~~LUCENE-2797~~ because of Unicode 6.0 collation changes) and coverage is trivial (only 5 locales tested, and no collator options are exercised). This affects both the JDK implementation in modules/analysis/common/ and the ICU implementation under modules/icu/.

The key thing to test is that the order of the indexed terms is the same as that provided by the Collator itself. Instead of the current set of static tests, this could be achieved via indexing randomly generated terms' collation keys (and collator options) and then comparing the index terms' order to the order provided by the Collator over the original terms.

Since different terms may produce the same collation key, however, the order of indexed terms is inherently unstable. When performing runtime collation, the Collator addresses the sort stability issue by adding a secondary sort over the normalized original terms. In order to directly compare Collator's sort with Lucene's collation key sort, a secondary sort will need to be applied to Lucene's indexed terms as well. Robert has suggested indexing the original terms in addition to their collation keys, then using a Sort over the original terms as the secondary sort.

Another complication: Lucene 3.X uses Java's UTF-16 term comparison, and trunk uses UTF-8 order, so the implemented secondary sort will need to respect that.

From #lucene:

rmuir__: so i think we have to on 3.x, sort the 'expected list' with Collator.compare, if thats equal, then as a tiebreak use String.compareTo
rmuir__: and in the index sort on the collated field, followed by the original term
rmuir__: in 4.x we do the same thing, but dont use String.compareTo as the tiebreak for the expected list
rmuir__: instead compare codepoints (iterating character.codepointAt, or comparing .getBytes("UTF-8"))

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

LUCENE-2798.patch
11/Apr/11 07:25
13 kB
Steven Rowe
LUCENE-2798.patch
11/Apr/11 15:19
17 kB
Steven Rowe

Issue Links

relates to

LUCENE-2797 upgrade icu to 4.6

Closed

Activity

People

Assignee:: Steven Rowe

Reporter:: Steven Rowe

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 04/Dec/10 16:57

Updated:: 28/Aug/22 12:37