Lucene - Core
  1. Lucene - Core
  2. LUCENE-5826

Support proper hunspell case handling and related options

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.10, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      When ignoreCase=false, we should accept title-cased/upper-cased forms just like hunspell -m. Furthermore there are some options around this:

      • LANG: can turn on alternate casing for turkish/azeri
      • KEEPCASE: can prevent acceptance of title/upper cased forms for words

      While we are here setting up the same logic anyway, add support for similar options:

      • NEEDAFFIX/PSEUDOROOT: form is invalid without being affixed
      • ONLYINCOMPOUND: form/affixes only make sense inside compounds.

      This stuff is unrelated to the ignoreCase=true option. If you use that option though, it does use correct alternate casing for tr_TR/az_AZ now though.

      I didn't yet implement CHECKSHARPS because it seems more complicated, I have to figure out what the logic there should be first.

        Activity

        Hide
        Robert Muir added a comment -

        Patch with tests for these options and casing behavior.

        Show
        Robert Muir added a comment - Patch with tests for these options and casing behavior.
        Hide
        Ryan Ernst added a comment -

        Looks good. A couple minor comments.

        • Can the TODO around line 176 of Stemmer.java be removed?
        • stem() is pretty long. Can the block that computes compatible be moved out? It is almost exactly the same for the prefix and suffix loops?
        Show
        Ryan Ernst added a comment - Looks good. A couple minor comments. Can the TODO around line 176 of Stemmer.java be removed? stem() is pretty long. Can the block that computes compatible be moved out? It is almost exactly the same for the prefix and suffix loops?
        Hide
        Robert Muir added a comment -

        Thanks for looking.

        hmm, I removed the TODO locally, i dont know how it didnt make it into the patch.

        As far as refactoring stem(), I am opposed to this, its too early for that. Once the core features (e.g. decompounding) are implemented, then I think it will be the right time. Until then it will just cause pain with zero gain: create useless abstractions, oversharing, and bugs.

        Show
        Robert Muir added a comment - Thanks for looking. hmm, I removed the TODO locally, i dont know how it didnt make it into the patch. As far as refactoring stem(), I am opposed to this, its too early for that. Once the core features (e.g. decompounding) are implemented, then I think it will be the right time. Until then it will just cause pain with zero gain: create useless abstractions, oversharing, and bugs.
        Hide
        Robert Muir added a comment -

        I created a followup issue to try to factor that big method after decomposition is implemented: LUCENE-5829

        Show
        Robert Muir added a comment - I created a followup issue to try to factor that big method after decomposition is implemented: LUCENE-5829

          People

          • Assignee:
            Unassigned
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development