Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-5826

Support proper hunspell case handling and related options

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.10, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      When ignoreCase=false, we should accept title-cased/upper-cased forms just like hunspell -m. Furthermore there are some options around this:

      • LANG: can turn on alternate casing for turkish/azeri
      • KEEPCASE: can prevent acceptance of title/upper cased forms for words

      While we are here setting up the same logic anyway, add support for similar options:

      • NEEDAFFIX/PSEUDOROOT: form is invalid without being affixed
      • ONLYINCOMPOUND: form/affixes only make sense inside compounds.

      This stuff is unrelated to the ignoreCase=true option. If you use that option though, it does use correct alternate casing for tr_TR/az_AZ now though.

      I didn't yet implement CHECKSHARPS because it seems more complicated, I have to figure out what the logic there should be first.

        Activity

        Hide
        rcmuir Robert Muir added a comment -

        I created a followup issue to try to factor that big method after decomposition is implemented: LUCENE-5829

        Show
        rcmuir Robert Muir added a comment - I created a followup issue to try to factor that big method after decomposition is implemented: LUCENE-5829
        Hide
        rcmuir Robert Muir added a comment -

        Thanks for looking.

        hmm, I removed the TODO locally, i dont know how it didnt make it into the patch.

        As far as refactoring stem(), I am opposed to this, its too early for that. Once the core features (e.g. decompounding) are implemented, then I think it will be the right time. Until then it will just cause pain with zero gain: create useless abstractions, oversharing, and bugs.

        Show
        rcmuir Robert Muir added a comment - Thanks for looking. hmm, I removed the TODO locally, i dont know how it didnt make it into the patch. As far as refactoring stem(), I am opposed to this, its too early for that. Once the core features (e.g. decompounding) are implemented, then I think it will be the right time. Until then it will just cause pain with zero gain: create useless abstractions, oversharing, and bugs.
        Hide
        rjernst Ryan Ernst added a comment -

        Looks good. A couple minor comments.

        • Can the TODO around line 176 of Stemmer.java be removed?
        • stem() is pretty long. Can the block that computes compatible be moved out? It is almost exactly the same for the prefix and suffix loops?
        Show
        rjernst Ryan Ernst added a comment - Looks good. A couple minor comments. Can the TODO around line 176 of Stemmer.java be removed? stem() is pretty long. Can the block that computes compatible be moved out? It is almost exactly the same for the prefix and suffix loops?
        Hide
        rcmuir Robert Muir added a comment -

        Patch with tests for these options and casing behavior.

        Show
        rcmuir Robert Muir added a comment - Patch with tests for these options and casing behavior.

          People

          • Assignee:
            Unassigned
            Reporter:
            rcmuir Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development