Lucene - Core
  1. Lucene - Core
  2. LUCENE-5013

ScandinavianFoldingFilterFactory and ScandinavianNormalizationFilterFactory

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Trivial Trivial
    • Resolution: Fixed
    • Affects Version/s: 4.3
    • Fix Version/s: 4.4, 6.0
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      This filter is an augmentation of output from ASCIIFoldingFilter,
      it discriminate against double vowels aa, ae, ao, oe and oo, leaving just the first one.

      blåbærsyltetøj == blåbärsyltetöj == blaabaarsyltetoej == blabarsyltetoj
      räksmörgås == ræksmørgås == ræksmörgaos == raeksmoergaas == raksmorgas

      Caveats:
      Since this is a filtering on top of ASCIIFoldingFilter äöåøæ already has been folded down to aoaoae when handled by this filter it will cause effects such as:

      bøen -> boen -> bon
      åene -> aene -> ane

      I find this to be a trivial problem compared to not finding anything at all.

      Background:
      Swedish åäö is in fact the same letters as Norwegian and Danish åæø and thus interchangeable in when used between these languages. They are however folded differently when people type them on a keyboard lacking these characters and ASCIIFoldingFilter handle ä and æ differently.

      When a Swedish person is lacking umlauted characters on the keyboard they consistently type a, a, o instead of å, ä, ö. Foreigners also tend to use a, a, o.

      In Norway people tend to type aa, ae and oe instead of å, æ and ø. Some use a, a, o. I've also seen oo, ao, etc. And permutations. Not sure about Denmark but the pattern is probably the same.

      This filter solves that problem, but might also cause new.

      1. LUCENE-5013.patch
        28 kB
        Karl Wettin
      2. LUCENE-5013.txt
        8 kB
        Karl Wettin
      3. LUCENE-5013-2.txt
        8 kB
        Karl Wettin
      4. LUCENE-5013-3.txt
        8 kB
        Karl Wettin
      5. LUCENE-5013-4.txt
        27 kB
        Karl Wettin
      6. LUCENE-5013-5.txt
        27 kB
        Karl Wettin
      7. LUCENE-5013-6.txt
        28 kB
        Karl Wettin

        Issue Links

          Activity

          Hide
          Karl Wettin added a comment -

          Code blessed with ASL2

          Show
          Karl Wettin added a comment - Code blessed with ASL2
          Hide
          Shawn Heisey added a comment -

          Karl Wettin I'm clueless when it comes to Scandinavian characters and languages ... but I do have a question. Does this filter do anything that isn't already accomplished by ICUNormalizer2Filter, also incorporated in ICUFoldingFilter?

          http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ICUFoldingFilterFactory
          http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ICUNormalizer2FilterFactory

          Show
          Shawn Heisey added a comment - Karl Wettin I'm clueless when it comes to Scandinavian characters and languages ... but I do have a question. Does this filter do anything that isn't already accomplished by ICUNormalizer2Filter, also incorporated in ICUFoldingFilter? http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ICUFoldingFilterFactory http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ICUNormalizer2FilterFactory
          Hide
          Karl Wettin added a comment -

          I do indeed believe that this does something different, at least as far as I can see.

          Example:

          People in Norway would spell the Swedish village of Särdal as Særdal, but when lacking those characters on their keyboard they would write Saerdal. In Sweden people would write Sardal. ASCIIFoldingFilter and friends would fold æ as ae and ä as a. The mismatch is primarily when a query contains the folded text, such as Saerdal. Folding all ä:s to ae will cause problem for people that just writes an a rather than ä. The same sort of mismatch will occur for å->aa, å->a, å->ao, ø->oe, ö->o. People tend to use different permutations of these alternatives and this filter normalizes it.

          So this is a filter that solves mismatching on ASCII folds for people in Norway and Denmark searching in a Swedish index and vice verse.

          See what I mean?

          Show
          Karl Wettin added a comment - I do indeed believe that this does something different, at least as far as I can see. Example: People in Norway would spell the Swedish village of Särdal as Særdal, but when lacking those characters on their keyboard they would write Saerdal. In Sweden people would write Sardal. ASCIIFoldingFilter and friends would fold æ as ae and ä as a. The mismatch is primarily when a query contains the folded text, such as Saerdal. Folding all ä:s to ae will cause problem for people that just writes an a rather than ä. The same sort of mismatch will occur for å->aa, å->a, å->ao, ø->oe, ö->o. People tend to use different permutations of these alternatives and this filter normalizes it. So this is a filter that solves mismatching on ASCII folds for people in Norway and Denmark searching in a Swedish index and vice verse. See what I mean?
          Hide
          Robert Muir added a comment -

          This is conceptually similar to the one for german (algorithm created by the snowball folks, but factored out of their stemmer):
          http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/analysis/common/src/java/org/apache/lucene/analysis/de/GermanNormalizationFilter.java?view=markup

          I think its nice to provide filters like this with language-specific normalizations. Though maybe the name could be simpler, (ScandinavianNormalizationFilter?)

          Show
          Robert Muir added a comment - This is conceptually similar to the one for german (algorithm created by the snowball folks, but factored out of their stemmer): http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/analysis/common/src/java/org/apache/lucene/analysis/de/GermanNormalizationFilter.java?view=markup I think its nice to provide filters like this with language-specific normalizations. Though maybe the name could be simpler, (ScandinavianNormalizationFilter?)
          Hide
          Steve Rowe added a comment -

          GermanNormalizationFilter.java

          This one operates directly on the input buffer, instead of copying to a (fixed 512 char, potentially too small) output buffer and then swapping.

          Though maybe the name could be simpler, (ScandinavianNormalizationFilter?)

          +1

          Show
          Steve Rowe added a comment - GermanNormalizationFilter.java This one operates directly on the input buffer, instead of copying to a (fixed 512 char, potentially too small) output buffer and then swapping. Though maybe the name could be simpler, (ScandinavianNormalizationFilter?) +1
          Hide
          Christian Moen added a comment -

          Though maybe the name could be simpler, (ScandinavianNormalizationFilter?)

          +1

          Show
          Christian Moen added a comment - Though maybe the name could be simpler, (ScandinavianNormalizationFilter?) +1
          Hide
          Jan Høydahl added a comment -

          Nice and needed component.

          I have one question though, whether it is too aggressive to fold å->a, ö->o, æ->a etc?

          In my experience it is better to skip the generic folding of ø/ö->oe/o, æ/ä->ae/a, å->aa/a which is rather destructive and instead normalize across Norwegian/Swedish/Danish the opposite way, preserving the semantic meaning:

          ø,ö,oe->ø
          æ,ä,ae->æ
          å,aa->å
          

          This will support most common cases and give:

          blåbærsyltetøj == blåbärsyltetöj == blaabaersyltetoej (but not blabarsyltetoj)
          räksmörgås == ræksmørgås == ræksmörgaas == raeksmoergaas (but not raksmorgas)

          I think this would be a good compromise which avoids many false matches between ø/o, å/a, æ/a. One other example is the Norwegian word for "berry": bær. With the aggressive approach it would be bar which clashes with the words for "bare" and "bar" as well as clash with bår folded to bar. Other unfortunate Norwegian examples are bør/bor, klær/klår/klar, får/far, før/for, klør/klor, møte/mote, blå/bla... Perhaps the aggressive option could be a configuration option?

          Btw. I have never seen the use of eo for ø or ea for æ

          Though maybe the name could be simpler, (ScandinavianNormalizationFilter?)

          +1

          Show
          Jan Høydahl added a comment - Nice and needed component. I have one question though, whether it is too aggressive to fold å->a, ö->o, æ->a etc? In my experience it is better to skip the generic folding of ø/ö->oe/o, æ/ä->ae/a, å->aa/a which is rather destructive and instead normalize across Norwegian/Swedish/Danish the opposite way, preserving the semantic meaning: ø,ö,oe->ø æ,ä,ae->æ å,aa->å This will support most common cases and give: blåbærsyltetøj == blåbärsyltetöj == blaabaersyltetoej (but not blabarsyltetoj) räksmörgås == ræksmørgås == ræksmörgaas == raeksmoergaas (but not raksmorgas) I think this would be a good compromise which avoids many false matches between ø/o, å/a, æ/a. One other example is the Norwegian word for "berry": bær . With the aggressive approach it would be bar which clashes with the words for "bare" and "bar" as well as clash with bår folded to bar . Other unfortunate Norwegian examples are bør/bor, klær/klår/klar, får/far, før/for, klør/klor, møte/mote, blå/bla... Perhaps the aggressive option could be a configuration option? Btw. I have never seen the use of eo for ø or ea for æ Though maybe the name could be simpler, (ScandinavianNormalizationFilter?) +1
          Hide
          Karl Wettin added a comment -

          I have one question though, whether it is too aggressive

          You do indeed have a point I never thought of before. It makes a lot of sense to also go from ø,ö,oe->ø for those that are using a Scandinavian keyboard. This is a feature I too want now.

          But the problem isn't just that we use ä and you use æ, it's native and non speakers sitting in front of the wrong sort of keyboard. Swedish people will most definitely in that situation write raksmorgas when searching for räksmörgås and most probably blabarsyltetoj when searching for blåbærssyltetøj, while my guess is that an American would write raksmorgas and blabaersyltetoj.

          I ran a test too see how bad the Norwegian mismatches are using the "Norsk scrabbleforbund"-dictionary:

          593526 Norwegian words in dictionary.
          4698 Norwegian mismatches using ScandinavianNormalizerFilter.
          3943 Norwegian mismatches using ASCIIFoldingFilter.

          That's something like 0,6%-0,8%. I find that totally acceptable, but I also suppose it depends on how you implement your index. If you're indexing nothing but the folded text then it might be a problem, but if it's something secondary on a disjunction with a lower boost, then it's hopefully just a matter of a few extra CPU-cycles and FS-seeks.

          Show
          Karl Wettin added a comment - I have one question though, whether it is too aggressive You do indeed have a point I never thought of before. It makes a lot of sense to also go from ø,ö,oe->ø for those that are using a Scandinavian keyboard. This is a feature I too want now. But the problem isn't just that we use ä and you use æ, it's native and non speakers sitting in front of the wrong sort of keyboard. Swedish people will most definitely in that situation write raksmorgas when searching for räksmörgås and most probably blabarsyltetoj when searching for blåbærssyltetøj, while my guess is that an American would write raksmorgas and blabaersyltetoj. I ran a test too see how bad the Norwegian mismatches are using the "Norsk scrabbleforbund"-dictionary: 593526 Norwegian words in dictionary. 4698 Norwegian mismatches using ScandinavianNormalizerFilter. 3943 Norwegian mismatches using ASCIIFoldingFilter. That's something like 0,6%-0,8%. I find that totally acceptable, but I also suppose it depends on how you implement your index. If you're indexing nothing but the folded text then it might be a problem, but if it's something secondary on a disjunction with a lower boost, then it's hopefully just a matter of a few extra CPU-cycles and FS-seeks.
          Hide
          Shawn Heisey added a comment -

          Does it make sense to have this filter do the Scandinavian folding before the ascii folding, rather than after? Would that cause fewer search misses and false positives, or more? Would it make sense to leave the ASCII step out, and let the user run it separately, either before or after according to the way they want it to work?

          One of the things I really like about the ICU filters is that they handle international notions of uppercase and lowercase, so you're not dealing with just ASCII characters. The example given on the wiki page is ß/SS, which honestly means little to me with my uneducated (American) viewpoint. If this filter can do something similar for the differences between Scandinavian languages, that would really be useful.

          Show
          Shawn Heisey added a comment - Does it make sense to have this filter do the Scandinavian folding before the ascii folding, rather than after? Would that cause fewer search misses and false positives, or more? Would it make sense to leave the ASCII step out, and let the user run it separately, either before or after according to the way they want it to work? One of the things I really like about the ICU filters is that they handle international notions of uppercase and lowercase, so you're not dealing with just ASCII characters. The example given on the wiki page is ß/SS, which honestly means little to me with my uneducated (American) viewpoint. If this filter can do something similar for the differences between Scandinavian languages, that would really be useful.
          Hide
          Karl Wettin added a comment -

          Does it make sense to have this filter do the Scandinavian folding before the ascii folding, rather than after?

          I implemented it the way I did because I want all the features of ASCIIFoldingFilter but slightly improved for my Scandinavian corpora. I suppose it's not completely wrong to say that ASCIIFoldingFilter is in this case used to fold æ->ae and is thus required to be executed prior to the Scandinavian normalization.

          What possibly makes most sense it to not rely on ASCIIFoldingFilter at all. To make it a pure ScandinavianNormalizationFilter without ü, ß and what not, that people would have to run a second pass through some ICU-filter in order to get that.

          Show
          Karl Wettin added a comment - Does it make sense to have this filter do the Scandinavian folding before the ascii folding, rather than after? I implemented it the way I did because I want all the features of ASCIIFoldingFilter but slightly improved for my Scandinavian corpora. I suppose it's not completely wrong to say that ASCIIFoldingFilter is in this case used to fold æ->ae and is thus required to be executed prior to the Scandinavian normalization. What possibly makes most sense it to not rely on ASCIIFoldingFilter at all. To make it a pure ScandinavianNormalizationFilter without ü, ß and what not, that people would have to run a second pass through some ICU-filter in order to get that.
          Hide
          Karl Wettin added a comment -

          A nice comment appeared on java-users, I'm pasting it in here to gather everything in one place.

          22 maj 2013 kl. 20:29 skrev Petite Abeille:

          On May 22, 2013, at 7:08 PM, Karl Wettin <karl.wettin@kodapan.se> wrote:

          • Use a filter after ASCIIFoldingFilter that discriminate all use of ae, oe, oo, and other combination of double vowels, just keeping the first one.

          I ended up with that solution.

          https://issues.apache.org/jira/browse/LUCENE-5013

          Interesting problem… perhaps you could generalize your solution a bit… for example, in, say, German, one could substitute 'ue' for 'ü', etc… so it looks like what you are after is folding double vowels… irrespectively of how they got there…

          So… assuming something along the lines of Sean M. Burke Unidecode [1] for the purpose of ASCII transliteration, what's left is simply to fold double vowels, e.g.:

          print( 1, Unidecode( 'blåbærsyltetøj' ):lower():gsub( '([aeiou]?)([aeiou]?)', '%1' ) )
          print( 2, Unidecode( 'blåbärsyltetöj' ):lower():gsub( '([aeiou]?)([aeiou]?)', '%1' ) )
          print( 3, Unidecode( 'blaabaarsyltetoej' ):lower():gsub( '([aeiou]?)([aeiou]?)', '%1' ) )
          print( 4, Unidecode( 'blabarsyltetoj' ):lower():gsub( '([aeiou]?)([aeiou]?)', '%1' ) )
          print( 5, Unidecode( 'Räksmörgås' ):lower():gsub( '([aeiou]?)([aeiou]?)', '%1' ) )
          print( 6, Unidecode( 'Göteborg' ):lower():gsub( '([aeiou]?)([aeiou]?)', '%1' ) )
          print( 7, Unidecode( 'Gøteborg' ):lower():gsub( '([aeiou]?)([aeiou]?)', '%1' ) )
          print( 8, Unidecode( 'Über' ):lower():gsub( '([aeiou]?)([aeiou]?)', '%1' ) )
          print( 9, Unidecode( 'ueber' ):lower():gsub( '([aeiou]?)([aeiou]?)', '%1' ) )
          print( 10, Unidecode( 'uber' ):lower():gsub( '([aeiou]?)([aeiou]?)', '%1' ) )
          print( 11, Unidecode( 'uuber' ):lower():gsub( '([aeiou]?)([aeiou]?)', '%1' ) )

          1 blabarsyltetoj
          2 blabarsyltetoj
          3 blabarsyltetoj
          4 blabarsyltetoj
          5 raksmorgas
          6 goteborg
          7 goteborg
          8 uber
          9 uber
          10 uber
          11 uber

          [1] http://search.cpan.org/~sburke/Text-Unidecode-0.04/lib/Text/Unidecode.pm

          Show
          Karl Wettin added a comment - A nice comment appeared on java-users, I'm pasting it in here to gather everything in one place. 22 maj 2013 kl. 20:29 skrev Petite Abeille: On May 22, 2013, at 7:08 PM, Karl Wettin <karl.wettin@kodapan.se> wrote: Use a filter after ASCIIFoldingFilter that discriminate all use of ae, oe, oo, and other combination of double vowels, just keeping the first one. I ended up with that solution. https://issues.apache.org/jira/browse/LUCENE-5013 Interesting problem… perhaps you could generalize your solution a bit… for example, in, say, German, one could substitute 'ue' for 'ü', etc… so it looks like what you are after is folding double vowels… irrespectively of how they got there… So… assuming something along the lines of Sean M. Burke Unidecode [1] for the purpose of ASCII transliteration, what's left is simply to fold double vowels, e.g.: print( 1, Unidecode( 'blåbærsyltetøj' ):lower():gsub( '( [aeiou] ?)( [aeiou] ?)', '%1' ) ) print( 2, Unidecode( 'blåbärsyltetöj' ):lower():gsub( '( [aeiou] ?)( [aeiou] ?)', '%1' ) ) print( 3, Unidecode( 'blaabaarsyltetoej' ):lower():gsub( '( [aeiou] ?)( [aeiou] ?)', '%1' ) ) print( 4, Unidecode( 'blabarsyltetoj' ):lower():gsub( '( [aeiou] ?)( [aeiou] ?)', '%1' ) ) print( 5, Unidecode( 'Räksmörgås' ):lower():gsub( '( [aeiou] ?)( [aeiou] ?)', '%1' ) ) print( 6, Unidecode( 'Göteborg' ):lower():gsub( '( [aeiou] ?)( [aeiou] ?)', '%1' ) ) print( 7, Unidecode( 'Gøteborg' ):lower():gsub( '( [aeiou] ?)( [aeiou] ?)', '%1' ) ) print( 8, Unidecode( 'Über' ):lower():gsub( '( [aeiou] ?)( [aeiou] ?)', '%1' ) ) print( 9, Unidecode( 'ueber' ):lower():gsub( '( [aeiou] ?)( [aeiou] ?)', '%1' ) ) print( 10, Unidecode( 'uber' ):lower():gsub( '( [aeiou] ?)( [aeiou] ?)', '%1' ) ) print( 11, Unidecode( 'uuber' ):lower():gsub( '( [aeiou] ?)( [aeiou] ?)', '%1' ) ) 1 blabarsyltetoj 2 blabarsyltetoj 3 blabarsyltetoj 4 blabarsyltetoj 5 raksmorgas 6 goteborg 7 goteborg 8 uber 9 uber 10 uber 11 uber [1] http://search.cpan.org/~sburke/Text-Unidecode-0.04/lib/Text/Unidecode.pm
          Hide
          Karl Wettin added a comment -

          Hmmm interesting thought though. I have to consider if it make sense to make it this generic. I think it might be problematic for some languages though, especially Dutch.

          Show
          Karl Wettin added a comment - Hmmm interesting thought though. I have to consider if it make sense to make it this generic. I think it might be problematic for some languages though, especially Dutch.
          Hide
          Markus Jelsma added a comment -

          Dutch does not use many accents except for some french loan words, the ASCIIFoldingFilter should suffice. Frisian does use a grave, aigu and circumflex quite frequently.

          Show
          Markus Jelsma added a comment - Dutch does not use many accents except for some french loan words, the ASCIIFoldingFilter should suffice. Frisian does use a grave, aigu and circumflex quite frequently.
          Hide
          Karl Wettin added a comment - - edited

          Dutch does not use many accents

          My comment was regarding Petite's idea to use a more generic double vowel-removal filter. I fear it might be too destructive.

          heersen -> hersen
          noors -> nors
          een -> en

          Show
          Karl Wettin added a comment - - edited Dutch does not use many accents My comment was regarding Petite's idea to use a more generic double vowel-removal filter. I fear it might be too destructive. heersen -> hersen noors -> nors een -> en
          Hide
          Karl Wettin added a comment -

          I have one question though, whether it is too aggressive to fold å->a, ö->o, æ->a etc?
          In my experience it is better to skip the generic folding of ø/ö->oe/o, æ/ä->ae/a, å->aa/a which is rather destructive and instead normalize across Norwegian/Swedish/Danish the opposite way, preserving the semantic meaning:
          ø,ö,oe->ø
          æ,ä,ae->æ
          å,aa->å

          I think it should be two different filters rather than a setting.

          ScandinavianFoldingFilter (æ, ä,ae->a) and ScandinavianNormalizationFilter (ae,ä,æ->æ)?

          Show
          Karl Wettin added a comment - I have one question though, whether it is too aggressive to fold å->a, ö->o, æ->a etc? In my experience it is better to skip the generic folding of ø/ö->oe/o, æ/ä->ae/a, å->aa/a which is rather destructive and instead normalize across Norwegian/Swedish/Danish the opposite way, preserving the semantic meaning: ø,ö,oe->ø æ,ä,ae->æ å,aa->å I think it should be two different filters rather than a setting. ScandinavianFoldingFilter (æ, ä,ae->a) and ScandinavianNormalizationFilter (ae,ä,æ->æ)?
          Hide
          Karl Wettin added a comment - - edited
          • Renamed to ScandinavianFoldingFilter
          • Does not use ASCIIFoldingFilter (less destructive, bøen -> boen rather than bøen -> bon as previously)
          • Modifies the input term char buffer rather than copying and switching
          • \escaped utf-8 in code
          Show
          Karl Wettin added a comment - - edited Renamed to ScandinavianFoldingFilter Does not use ASCIIFoldingFilter (less destructive, bøen -> boen rather than bøen -> bon as previously) Modifies the input term char buffer rather than copying and switching \escaped utf-8 in code
          Hide
          Steve Rowe added a comment -

          Karl, I like this approach better - focussed and self-contained.

          Does not use ASCIIFoldingFilter

          I think the class javadoc needed updating? E.g. "This filter is an augmentation of output from ASCIIFoldingFilter"

          Also, @author tags aren't allowed anymore - CHANGES.txt is where attribution happens.

          Show
          Steve Rowe added a comment - Karl, I like this approach better - focussed and self-contained. Does not use ASCIIFoldingFilter I think the class javadoc needed updating? E.g. "This filter is an augmentation of output from ASCIIFoldingFilter" Also, @author tags aren't allowed anymore - CHANGES.txt is where attribution happens.
          Hide
          Karl Wettin added a comment -

          Oups, artifacts from copy and pasting between two projects Sorry. I'll send a new patch.

          Show
          Karl Wettin added a comment - Oups, artifacts from copy and pasting between two projects Sorry. I'll send a new patch.
          Hide
          Karl Wettin added a comment -

          Cleaned up docs

          Show
          Karl Wettin added a comment - Cleaned up docs
          Hide
          Karl Wettin added a comment -

          Just realized that this new patch can cause an ArrayIndexOutOfBoundsException. Will send an updated version tomorrow.

          Show
          Karl Wettin added a comment - Just realized that this new patch can cause an ArrayIndexOutOfBoundsException. Will send an updated version tomorrow.
          Hide
          Robert Muir added a comment -

          Can the test be changed to use BaseTokenStreamTestCase?

          here's an example: http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/analysis/common/src/test/org/apache/lucene/analysis/de/TestGermanNormalizationFilter.java?view=markup

          We should also add a factory (and a test for that).

          Show
          Robert Muir added a comment - Can the test be changed to use BaseTokenStreamTestCase? here's an example: http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/analysis/common/src/test/org/apache/lucene/analysis/de/TestGermanNormalizationFilter.java?view=markup We should also add a factory (and a test for that).
          Hide
          Karl Wettin added a comment -
          • ScandinavianNormalizationFilter (new, feature as described by Jan)
          • ScandinavianFoldingFilter
          • Factories
          • Factory tests, but their failing with SPI exceptions for me, not sure what to do here.

          Jan, would you mind spending a few minutes reading javadocs of the filters to see if you think it seems to make sense?

          Show
          Karl Wettin added a comment - ScandinavianNormalizationFilter (new, feature as described by Jan) ScandinavianFoldingFilter Factories Factory tests, but their failing with SPI exceptions for me, not sure what to do here. Jan, would you mind spending a few minutes reading javadocs of the filters to see if you think it seems to make sense?
          Hide
          Jan Høydahl added a comment -

          Comments for class ScandinavianFoldingFilter:

          • Typo in sentence "It's is a semantically more..."
          • "I've also seen oo, ao, etc." -> Don't use personal "I" in JavaDocs
          • "Not sure about Denmark..." -> Better not to mention Denmark if you're not sure

          Comments for class ScandinavianFoldingFilterFactory:

          • Comment "Creates a new ScandinavianFoldingFilterFactory" does not add any value

          Comments for class ScandinavianNormalizationFilter:

          • "...æäÆÄöøÖØ...translating them to åæøÅÆØ" -> Should perhaps be "æÆäÄöÖøØ...to æÆæÆøØøØ"

          Comments for class ScandinavianNormalizationFilterFactory:

          • Unneccesary comment for constructor

          Have not tested or really reviewed the code, but unit tests seem sound.

          PS: Karl, you can use the same name LUCENE-5013.patch for every upload. JIRA will take care of greying out the older ones.

          Show
          Jan Høydahl added a comment - Comments for class ScandinavianFoldingFilter: Typo in sentence "It's is a semantically more..." "I've also seen oo, ao, etc." -> Don't use personal "I" in JavaDocs "Not sure about Denmark..." -> Better not to mention Denmark if you're not sure Comments for class ScandinavianFoldingFilterFactory: Comment "Creates a new ScandinavianFoldingFilterFactory" does not add any value Comments for class ScandinavianNormalizationFilter: "...æäÆÄöøÖØ...translating them to åæøÅÆØ" -> Should perhaps be "æÆäÄöÖøØ...to æÆæÆøØøØ" Comments for class ScandinavianNormalizationFilterFactory: Unneccesary comment for constructor Have not tested or really reviewed the code, but unit tests seem sound. PS: Karl, you can use the same name LUCENE-5013 .patch for every upload. JIRA will take care of greying out the older ones.
          Hide
          Karl Wettin added a comment -

          Cleaned up the javadocs.

          This is as far as I can take this patch my self:

          I need help with the TestFilterFactories, they throw SPI exceptions stating the factories are not available in the classpath via lookup. I shouldn't have to register them somewhere, right?

          Show
          Karl Wettin added a comment - Cleaned up the javadocs. This is as far as I can take this patch my self: I need help with the TestFilterFactories, they throw SPI exceptions stating the factories are not available in the classpath via lookup. I shouldn't have to register them somewhere, right?
          Hide
          Steve Rowe added a comment -

          I need help with the TestFilterFactories, they throw SPI exceptions stating the factories are not available in the classpath via lookup. I shouldn't have to register them somewhere, right?

          The following lines need to be added to src/resources/META-INF/services/org.apache.lucene.analysis.util.TokenFilterFactory - all tests pass for me when I do this:

          org.apache.lucene.analysis.miscellaneous.ScandinavianFoldingFilterFactory
          org.apache.lucene.analysis.miscellaneous.ScandinavianNormalizationFilterFactory
          
          Show
          Steve Rowe added a comment - I need help with the TestFilterFactories, they throw SPI exceptions stating the factories are not available in the classpath via lookup. I shouldn't have to register them somewhere, right? The following lines need to be added to src/resources/META-INF/services/org.apache.lucene.analysis.util.TokenFilterFactory - all tests pass for me when I do this: org.apache.lucene.analysis.miscellaneous.ScandinavianFoldingFilterFactory org.apache.lucene.analysis.miscellaneous.ScandinavianNormalizationFilterFactory
          Hide
          Karl Wettin added a comment - - edited

          It's all good now.

          Thanks for the help and input, everybody. Have fun, and I hope someone else but me finds this useful.

          Show
          Karl Wettin added a comment - - edited It's all good now. Thanks for the help and input, everybody. Have fun, and I hope someone else but me finds this useful.
          Hide
          Jan Høydahl added a comment -

          Can you upload the patch as LUCENE-5013.patch ? That's the standard naming convention around here

          Show
          Jan Høydahl added a comment - Can you upload the patch as LUCENE-5013 .patch ? That's the standard naming convention around here
          Hide
          Karl Wettin added a comment -

          Patch blessed with ASL2

          Show
          Karl Wettin added a comment - Patch blessed with ASL2
          Hide
          Karl Wettin added a comment -

          Patch blessed with ASL2.

          Show
          Karl Wettin added a comment - Patch blessed with ASL2.
          Hide
          ASF subversion and git services added a comment -

          Commit 1499382 from janhoy@apache.org
          [ https://svn.apache.org/r1499382 ]

          LUCENE-5013: ScandinavianFoldingFilterFactory and ScandinavianNormalizationFilterFactory

          Show
          ASF subversion and git services added a comment - Commit 1499382 from janhoy@apache.org [ https://svn.apache.org/r1499382 ] LUCENE-5013 : ScandinavianFoldingFilterFactory and ScandinavianNormalizationFilterFactory
          Hide
          Jan Høydahl added a comment -

          Oops, added at wrong root path, will fix

          Show
          Jan Høydahl added a comment - Oops, added at wrong root path, will fix
          Hide
          ASF subversion and git services added a comment -

          Commit 1499392 from janhoy@apache.org
          [ https://svn.apache.org/r1499392 ]

          LUCENE-5013: Revert bad commit

          Show
          ASF subversion and git services added a comment - Commit 1499392 from janhoy@apache.org [ https://svn.apache.org/r1499392 ] LUCENE-5013 : Revert bad commit
          Hide
          ASF subversion and git services added a comment -

          Commit 1499409 from janhoy@apache.org
          [ https://svn.apache.org/r1499409 ]

          LUCENE-5013: ScandinavianFoldingFilterFactory and ScandinavianNormalizationFilterFactory

          Show
          ASF subversion and git services added a comment - Commit 1499409 from janhoy@apache.org [ https://svn.apache.org/r1499409 ] LUCENE-5013 : ScandinavianFoldingFilterFactory and ScandinavianNormalizationFilterFactory
          Hide
          ASF subversion and git services added a comment -

          Commit 1499437 from janhoy@apache.org
          [ https://svn.apache.org/r1499437 ]

          LUCENE-5013: ScandinavianFoldingFilterFactory and ScandinavianNormalizationFilterFactory (backport)

          Show
          ASF subversion and git services added a comment - Commit 1499437 from janhoy@apache.org [ https://svn.apache.org/r1499437 ] LUCENE-5013 : ScandinavianFoldingFilterFactory and ScandinavianNormalizationFilterFactory (backport)
          Hide
          Karl Wettin added a comment -

          Takk Jan! <3

          Show
          Karl Wettin added a comment - Takk Jan! <3
          Hide
          Steve Rowe added a comment -

          Bulk close resolved 4.4 issues

          Show
          Steve Rowe added a comment - Bulk close resolved 4.4 issues
          Show
          Jan Høydahl added a comment - Refguide paragraph added ( SOLR-4493 ): https://cwiki.apache.org/confluence/display/solr/Language+Analysis#LanguageAnalysis-Scandinavian

            People

            • Assignee:
              Jan Høydahl
              Reporter:
              Karl Wettin
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development