Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-9687

Hunspell support improvements

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 9.0, 8.9
    • None
    • None
    • New

    Description

      I'd like Lucene's Hunspell support to be on a par with the native C++ Hunspell for spellchecking and suggestions, at least for some languages. So I propose to:

      • support the affix rules necessary for English, German, French, Spanish and
        Russian dictionaries, possibly more languages later
      • mirror Hunspell's suggestion algorithm in Lucene
      • provide a public APIs for spellchecking, suggestion, stemming, morphological data
      • check corpora for specific languages to find and fix spellchecking/suggestion discrepancices between Lucene's implementation and Hunspell/C++

      Attachments

        1.
        Hunspell: support COMPOUNDRULE Sub-task Closed Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 1h 50m
        2.
        Hunspell: support default encoding Sub-task Resolved Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 1h 10m
        3.
        Hunspell: prefix condition is only checked on suffix, not stem Sub-task Closed Dawid Weiss

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 2h 10m
        4.
        Hunspell spellchecker: support numbers with separators Sub-task Resolved Unassigned  
        5.
        Hunspell: deduplicate decodeFlags+hasFlag checks Sub-task Closed Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 2h
        6.
        Hunspell: fix off-by-one error to support prefixes of word.length - 1 Sub-task Closed Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 0.5h
        7.
        Hunspell: simplify Dictionary.affixData storage Sub-task Closed Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 0.5h
        8.
        Hunspell: improve stemming of all-caps words Sub-task Closed Dawid Weiss

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 40m
        9.
        Hunspell: shorten Stemmer.applyAffix Sub-task Closed Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 20m
        10.
        Hunspell: add a spellchecker, support BREAK and FORBIDDENWORD affix rules Sub-task Resolved Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 2h
        11.
        Hunspell support: fix most IntelliJ warnings, cleanup Sub-task Resolved Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 3h 40m
        12.
        Hunspell: consider prefix's continuation flags when applying suffix Sub-task Resolved Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 1h 50m
        13.
        Hunspell: support special title-case for words with apostrophe Sub-task Closed Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 50m
        14.
        Hunspell: support trailing comments on aff option lines Sub-task Closed Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 0.5h
        15.
        Hunspell: extract Stemmer.stripAffix from similar code in prefix/suffix processing Sub-task Closed Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 0.5h
        16.
        Hunspell: check that all flags are > 0 and fit char Sub-task Closed Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 2h 10m
        17.
        Hunspell Stemmer: use the same FST.BytesReader on all recursion levels Sub-task Closed Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 0.5h
        18.
        Hunspell: reuse char[] when possible when stripping affix Sub-task Closed Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 50m
        19.
        Support German-like compound words Sub-task Closed Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 20m
        20.
        Hunspell: support words with trailing dots Sub-task Closed Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 1.5h
        21.
        Hunspell: implement simple REP-based suggestion algorithm Sub-task Closed Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 20m
        22.
        Hunspell: support alternate casing for short language codes Sub-task Closed Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 1h 20m
        23.
        Hunspell: prohibit FORBIDDENWORD words and their case variations Sub-task Closed Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 20m
        24.
        Hunspell: support capitalization for German ß Sub-task Closed Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 2h 40m
        25.
        Hunspell: support NEEDAFFIX flag on affixes Sub-task Closed Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 20m
        26.
        Hunspell: check Lucene's implementation against Hunspell's test data Sub-task Closed Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 6h 40m
        27.
        Hunspell: support FLAG UTF-8 in absence of SET UTF-8 Sub-task Closed Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 40m
        28.
        Hunspell: no special dotted i treatment outside tr/az languages Sub-task Closed Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 20m
        29.
        Hunspell: support minor compounding-related flags Sub-task Closed Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 0.5h
        30.
        Hunspell: support flag usage before its format is even specified Sub-task Closed Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 3h 40m
        31.
        Hunspell: support CHECKCOMPOUNDPATTERN Sub-task Closed Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 20m
        32.
        Hunspell: more ways to vary misspelled word variations for suggestions Sub-task Closed Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 50m
        33.
        Hunspell: disallow ONLYINCOMPOUND suffixes at the very end of compound words Sub-task Closed Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 20m
        34.
        Hunspell: update sanity tests that load all dictionaries Sub-task Closed Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 50m
        35.
        Hunspell: tolerate existing aff/dic file typos Sub-task Closed Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 0.5h
        36.
        Hunspell: speed up spellchecking by stopping at a single found stem Sub-task Closed Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 50m
        37.
        Add build-side support for running full validation checks against hunspell repos Sub-task Closed Dawid Weiss

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 2.5h
        38.
        Hunspell: add a performance test Sub-task Closed Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 1h 10m
        39.
        Hunspell: support CHECKCOMPOUNDREP flags Sub-task Closed Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 1h 10m
        40.
        Clean up temporary folder management in Dictionary Sub-task Closed Dawid Weiss  
        41.
        Hunspell: support dictionary entries starting with slash Sub-task Closed Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 1h 20m
        42.
        Hunspell: exception when loading dictionaries with mixed-case words and aliased flags Sub-task Closed Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 40m
        43.
        Hunspell: support suggestions based on "ph" morphological data Sub-task Closed Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 0.5h
        44.
        Hunspell: speed up flag checks by avoiding allocations Sub-task Closed Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 20m
        45.
        Hunspell: support MAP-based suggestions for groups of similar letters Sub-task Closed Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 20m
        46.
        Hunspell: speed up numeric flag parsing Sub-task Closed Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 20m
        47.
        Avoid buffering and double-scan of flags in *.aff file Sub-task Closed Dawid Weiss

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 2h 50m
        48.
        Hunspell: suggest dictionary entries similar to the misspelled word Sub-task Closed Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 2h 10m
        49.
        Hunspell: ignore original tests which are out of scope for now Sub-task Closed Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 1h
        50.
        Hunspell: tolerate more aff/dic file typos Sub-task Closed Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 20m
        51.
        Hunspell: unify case variation logic in Stemmer and SpellChecker Sub-task Closed Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 20m
        52.
        Hunspell: suggest inflected dictionary entries similar to the misspelled word Sub-task Closed Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 2h 10m
        53.
        Hunspell: apply output conversion (OCONV) to the suggestions Sub-task Closed Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 20m
        54.
        Hunspell: improve suggestions for mixed-case misspelled words Sub-task Closed Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 1h
        55.
        Hunspell Stemmer: reduce parameter count Sub-task Closed Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 40m
        56.
        Hunspell: disallow compounds with parts present in dictionary space-separated Sub-task Closed Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 20m
        57.
        Hunspell: support NOSUGGEST option Sub-task Closed Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 0.5h
        58.
        Hunspell: add more to TestHunspellRepositoryTestCases.EXPECTED_FAILURES Sub-task Closed Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 20m
        59.
        Hunspell: print total memory usage in TestAllDictionaries, cleanup Sub-task Closed Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 1h
        60.
        Hunspell: check that FLAG and SET don't occur too far in the file, cleanup Sub-task Closed Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 20m
        61.
        Hunspell: fix FORBIDDENWORD support Sub-task Closed Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 1h 50m
        62.
        Hunspell: try title case as FORCEUCASE misspelled word suggestions Sub-task Closed Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 0.5h
        63.
        Hunspell: rename SpellChecker to Hunspell, fix test name, update javadoc and CHANGES.txt Sub-task Closed Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 1h 10m
        64.
        Hunspell: add API for retrieving dictionary morphological data and stemming Sub-task Closed Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 1h 20m
        65.
        Hunspell: KEEPCASE should take precedence over affixed forms Sub-task Closed Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 1h
        66.
        Hunspell: don't perform compound check recursively when looking for space-separated word pairs Sub-task Closed Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 20m
        67.
        Hunspell: don't lookup word roots unnecessarily to check flags Sub-task Closed Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 20m
        68.
        Hunspell: CHECKCOMPOUNDCASE shouldn't prohibit dash-separated uppercase compounds Sub-task Closed Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 20m
        69.
        Hunspell: make FORCEUCASE work when the first compound word is inherently title-case Sub-task Closed Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 40m
        70.
        Hunspell: allow to inflect the last part of COMPOUNDRULE compound Sub-task Closed Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 1h 10m
        71.
        Hunspell: speed up input conversion Sub-task Closed Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 1h 40m
        72.
        Hunspell: add an API to interrupt long computations Sub-task Closed Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 50m
        73.
        Hunspell suggestions: split by space (but not dash) also before last char Sub-task Closed Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 20m
        74.
        Hunspell: don't suggest more than 4 ngram corrections by default Sub-task Closed Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 20m
        75.
        Hunspell suggestions: use US keyboard in absence of KEY option Sub-task Closed Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 20m
        76.
        Hunspell: don't check case in compound middle and end Sub-task Closed Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 20m
        77.
        Hunspell suggestions: try moving the last character into the middle Sub-task Closed Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 20m
        78.
        Hunspell: speed up suggesting a bit by not creating a huge TreeSet Sub-task Closed Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 20m
        79.
        Hunspell: avoid slow dictionary lookup if the word's hash isn't there Sub-task Closed Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 1h
        80.
        Add automation for running regression tests Sub-task Closed Dawid Weiss

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 20m
        81.
        Hunspell: don't check second-level affixes when the first level isn't a continuation Sub-task Closed Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 0.5h
        82.
        Hunspell: put a time limit on suggestion calculation Sub-task Closed Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 1h 40m
        83.
        Hunspell suggestions: speed up expandWord by enumerating only applicable affixes Sub-task Closed Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 1h 10m
        84.
        Hunspell: don't check second stage suffixes if the first stage flag only occurs in prefixes Sub-task Closed Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 20m
        85.
        Hunspell: fix most similar dictionary entry search by reversing the comparator Sub-task Closed Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 0.5h
        86.
        Hunspell: fix space + mixed case heuristics on suggestions Sub-task Closed Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 20m
        87.
        Hunspell: speed up affix condition checking Sub-task Closed Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 1h 10m
        88.
        Hunspell suggestions: consider space/dash-separated words for each case variation Sub-task Closed Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 20m
        89.
        Hunspell: when generating suggestions, skip too deep word FST subtrees Sub-task Closed Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 1h
        90.
        Hunspell suggestions: speed up ngram calculation by not searching for substrings in impossible places Sub-task Closed Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 20m
        91.
        Hunspell: honor empty stripping affixes when generating suggestions Sub-task Closed Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 20m
        92.
        Hunspell suggestions: speed up ngram score calculation for each dictionary entry Sub-task Closed Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 2h 40m
        93.
        Hunspell: reverse the "words" trie for faster word lookup/suggestions Sub-task Resolved Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 9h 40m
        94.
        Hunspell: store word length for faster dictionary lookup/enumeration Sub-task Closed Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 1h 10m
        95.
        Hunspell GeneratingSuggester: faster flag & case checks, less allocations Sub-task Closed Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 1h 50m
        96.
        Hunspell: SIOOBE in GeneratingSuggester.expandRoot Sub-task Closed Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 20m
        97.
        Hunspell: AssertionError in WordStorage.lookupWord Sub-task Closed Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 20m
        98.
        Hunspell suggestions: speed up for some non-Latin scripts Sub-task Closed Unassigned  

        Activity

          People

            Unassigned Unassigned
            Gromov Peter Gromov
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 105.5h
                105.5h