Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-5013

ScandinavianFoldingFilterFactory and ScandinavianNormalizationFilterFactory

Details

    • New Feature
    • Status: Closed
    • Trivial
    • Resolution: Fixed
    • 4.3
    • 4.4, 6.0
    • modules/analysis
    • None
    • New

    Description

      This filter is an augmentation of output from ASCIIFoldingFilter,
      it discriminate against double vowels aa, ae, ao, oe and oo, leaving just the first one.

      blåbærsyltetøj == blåbärsyltetöj == blaabaarsyltetoej == blabarsyltetoj
      räksmörgås == ræksmørgås == ræksmörgaos == raeksmoergaas == raksmorgas

      Caveats:
      Since this is a filtering on top of ASCIIFoldingFilter äöåøæ already has been folded down to aoaoae when handled by this filter it will cause effects such as:

      bøen -> boen -> bon
      åene -> aene -> ane

      I find this to be a trivial problem compared to not finding anything at all.

      Background:
      Swedish åäö is in fact the same letters as Norwegian and Danish åæø and thus interchangeable in when used between these languages. They are however folded differently when people type them on a keyboard lacking these characters and ASCIIFoldingFilter handle ä and æ differently.

      When a Swedish person is lacking umlauted characters on the keyboard they consistently type a, a, o instead of å, ä, ö. Foreigners also tend to use a, a, o.

      In Norway people tend to type aa, ae and oe instead of å, æ and ø. Some use a, a, o. I've also seen oo, ao, etc. And permutations. Not sure about Denmark but the pattern is probably the same.

      This filter solves that problem, but might also cause new.

      Attachments

        1. LUCENE-5013.txt
          8 kB
          Karl Wettin
        2. LUCENE-5013-2.txt
          8 kB
          Karl Wettin
        3. LUCENE-5013-3.txt
          8 kB
          Karl Wettin
        4. LUCENE-5013-4.txt
          27 kB
          Karl Wettin
        5. LUCENE-5013-5.txt
          27 kB
          Karl Wettin
        6. LUCENE-5013-6.txt
          28 kB
          Karl Wettin
        7. LUCENE-5013.patch
          28 kB
          Karl Wettin

        Issue Links

          Activity

            People

              janhoy Jan Høydahl
              karl.wettin Karl Wettin
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: