[LUCENE-8419] Return token unchanged for pathological Stempel tokens - ASF JIRA

Details

Type: New Feature
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: modules/analysis
Labels:
- stemmer
- stemming

Lucene Fields:

New

Description

In the aggregate, Stempel does a good job, but certain tokens get stemmed pathologically, conflating completely unrelated words in the search index. Depending on the scoring function, documents returned may have no form of the word that was in the query, only unrelated forms (see ć examples below).

It's probably not possible to fix the stemmer, and it's probably not possible to catch every error, but catching and ignoring certain large classes of errors would greatly improve precision, and doing it in the stemmer would prevent losses to recall that happen from cleaning up these errors outside the stemmer.

An obvious example is that numbers ending in 1 have the last two digits replaced with ć. So 12341 is stemmed as 123ć. Numbers ending in 31 have the last 4 numbers removed and replaced with ć, so 12331 is stemmed as 1ć. Mixed letters and numbers are treated the same: abc123451 is stemmed as abc1234ć, abc1231 is stemmed as abcć.

Proposed solution: any token that ends in a number should not be stemmed, it should just be returned unchanged.

One letter stems from the set [a-zńć] are generally useless and often absurd.

ć is the worst offender by far (it's the ending of the infinitive form of verbs). All of these tokens (found on Polish Wikipedia/Wiktionary) get stemmed to ć:

acque Adrien aguas Águas Alainem Alandh Amores Ansoe Arau asinaio aŭdas audyt Awiwie Ayres Baby badż Baina Bains Balue Baon baque Barbola Bazy Beau beim Beroe Betz Blaue blenda bleue Blizzard boor Boruca Boym Brodła Brogi Bronksie Brydż Budgie Budiafa bujny Buon Buot Button Caan Cains Canoe Canona caon Celu Charl Chloe ciag Cioma Cmdr Conseil Conso Cotton Cramp Creel Cuyk cyan czcią Czermny czto D.III Daws Daxue dazzle decy Defoe Dereń Detroit digue Dior Ditton Dojlido dosei douk DRaaS drag drau Dudacy dudas Dutton Duty Dziób eayd Edwy Edyp eiro Eltz Emain erar ESaaS faan Fetz figurar Fitz foam Frau Fugue GAAB gaan Gabirol Gaon gasue Gaup Geol GeoMIP Getz gigue Ginny Gioią Girl Goam Gołymin Gosei Götz grasso Grodnie Gula Guroo gyan HAAB Haan Heim Héroe Hitz Hoam Hohenho Hosei Huon Hutton Huub hyaina Iberii inkuby Inoue Issue ITaaS Iudas Izmaile Jaan Jaws jedyn Jews jira Josepho Jost Josue Judas Kaan Kaleido Karoo Katz Kazue Kehoe khayag kiwa Kiwu Klaas kmdr Kokei Konoe kozer kpią Kringle ksiezyce Któż Kutz L231 L331 Laan Lalli Laon Laws łebka Leroo Liban Ligue Liro Lisoli Logue Loja Londyn Lubomyr Luque Lutz Lytton łzawy Maan mains Mainy malpaco Mammal mandag MBaaS meeki Merl Metz MIDAS middag Miras mmol modą moins Monty Moryń motz mróż Mutz Müzesi MVaaS Naam nabrzeża Nadab Nadala Nalewki Nd:YAG neol News Nieszawa Nimue Nyam ÖAAB oblał oddala okala Olień opar oppi Orioł Osioł osoagi Osyki Otóż Output Oxalido pasmową Patton Pearl Peau peoplk Petz poar Pobrzeża poecie Pogue Pono posagi posł Praha Pringle probie progi Prońko Prosper prwdę Psioł Pułka Putz QDTOE Quien Qwest radża raga Rains reht Reich Retz Revue Right RITZ Roam Rogue Roque rosii RU31 Rutki Ryan SAAB saasso salue Sampaio Satz Sears Sekisho semo Setton Sgan Siloe Sitz Skopje Slot Šmarje Smrkci Soar sopo sozinho springa Steel Stip Straz Strip Suez sukuby Sumach Surgucie Sutton svasso Szosą szto Tadas Taira tęczy Teodorą teol Tisii Tisza Toluca Tomoe Toque TPMŻ Traiana Trask Traue Tulyag Tuque Turinga Undas Uniw usque Vague Value Venue Vidas Vogue Voor W331 Waringa weht Weich Weija Wheel widmem WKAG worku Wotton Wryk Wschowie wsiach wsiami Wybrzeża wydala Wyraz XLIII XVIII XXIII Yaski yeol YONO Yorki zakręcie Zijab zipo.

Four-character tokens ending in 31 (like 2,31 9,31 1031 1131 7431 8331 a331) also all get stemmed to ć.

Below are examples of other tokens (from Polish Wikipedia/Wiktionary) that get stemmed to one-letter tokens in [a-zńć]. Note that i, o, u, w, and z are stop words, and so don't show up in the list.

a: a, addo, adygea, jhwh, also
b: b, bdrm, barr, bebek, berr, bounty, bures, burr, berm, birm
c: alzira, c, carr, county, haight, hermas, kidoń, paich, pieter, połóż, radoń, soest, tatort, voight, zaba, biegną, pokaż, wskaż, zoisyt
d: award, d, dlek, deeb
e: e, eddy, eloi
f: f, farr, firm
g: g, geagea, grunty, gwdy, gyro, górą
h: h
i: inre, isro
j: j, judo
k: k, kgtj, kpzr, karr, kerr, ksok
l: l, leeb, loeb
m: m, magazyn, marr, mayor, merr, mnsi, murr, mgły, najmu
n: johnowi, n
o: obzr, offy
p: p, pace, paoli, parr, pasji, pawełek, pyro, pirsy, plmb
q: q
r: r, rite, rrek
s: s, sarr, site, sowie, szok
t: leźnie, t, tnsw, tooi
u: noite
w: wmro, warr, wifi, wyspom, wątki
x: x
y: jesteś, lafleur, nate, nowsze, violeur, y, yach, douleur
z: czok, skrawek
ń: cisew, esso

All other one-character stems I have encountered have been for one-character input tokens (especially those in other writing systems).

Proposed solution: if a token gets stemmed to a one-letter stem (either in general, or specifically if the letter is one of [a-zńć]), the input token should be returned unchanged.

There are other patterns of unreliable stems, though the ones above are the worst.

Two-letter stems are generally unreliable (see attachement twoletter.txt). The specific stems my, um, ąc, and ły are particularly random.

Two- and three-letter stems fitting the patterns .ć and ..ć are generally not useful (see attachments dotc.txt and dotdotc.txt for full lists of examples). The specific stems ać, eć, yć, ąć, ść, and źć are particularly random.

The specific stems ować, iwać, obić, snąć, ywać, ium also stand out as egregious:

ium: IIIC, Treze
iwać: Blefa, Crew, Iwano, Krall, Leseur, Maksiu, Stefa, Wrycz, cygar, horou
obić: Dawka, Obiło, dawka, obicia, obito
ować: Abdou, Bangu, Beess, Biblie, Birmie, Bohle, Bredy, Buddę, Czubą, Darją, Fatou, Firmie, Füssli, Ghany, Haeng, Katją, Koszyc, Ligę, Limie, Madou, Ozmy, Pitou, Riess, Sloane, Smółka, Soeng, TheFa, UWSS, firmie, ligę, szury, úzkost
snąć: Koziej, Schwab, Serial, Spain, serial
ywać: Ariza, odkuł, sorgo

Proposed solution: Return the input token if the stem meets one or more of the following criteria:

stem matches /^[a-zął][a-zćń]$/
stem matches /^.ć/
stem is one of my, um, ąc, ły, ać, eć, yć, ąć, ść, or źć
stem matches /^..ć/
stem is one of ować, iwać, obić, snąć, ywać, ium

Note: (1) is a superset of (2) and (3). (2) does not cover my, um, ąc, or ły in (3), so (2) and part of (3) could be combined.

General workaround: Unpack Stempel into constituent parts, recreate Stempel's stopword list as a stop filter (see LUCENE-8417), use polish_stem as a stemmer, use a pattern_replace filter to replace /^([a-zął]?[a-zćń]|..ć|\d.*ć)$/ with '', and then a length filter to remove zero-length tokens, and add a stop filter with ować, iwać, obić, snąć, ywać, ium. Since many tokens are lost by this process, you need to also have an unstemmed index of the same text so you don't lose recall. (That's not exactly "easy", but it's what I've had to do.)

Return token unchanged for pathological Stempel tokens

Details

Description

Attachments

Attachments

Activity

People

Dates