Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.0, 6.0
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      This is analyzer for Serbian language, so far consisting only of a normalizer. Serbian language uses both Cyrillic and Latin alphabet, so the normalizer works with both alphabets.

      In the future, I'll see to add stopwords, stemmer and so on.

      1. LUCENE-Serbian-1.patch
        17 kB
        Nikola Smolenski

        Activity

        Hide
        Robert Muir added a comment -

        Looks good (caveat: I am not intimately familiar with the normalizations of diacritics here).

        Should we add a note to SerbianNormalizationFilter that it expects lowercase input?

        Show
        Robert Muir added a comment - Looks good (caveat: I am not intimately familiar with the normalizations of diacritics here). Should we add a note to SerbianNormalizationFilter that it expects lowercase input?
        Hide
        Nikola Smolenski added a comment -

        Why not, here is the patch with the comment.

        Show
        Nikola Smolenski added a comment - Why not, here is the patch with the comment.
        Hide
        Robert Muir added a comment -

        Looks great: thank you. I plan to commit this patch later today!

        Please open additional issues if you feel inspired to add stemmer/stopwords/etc

        Show
        Robert Muir added a comment - Looks great: thank you. I plan to commit this patch later today! Please open additional issues if you feel inspired to add stemmer/stopwords/etc
        Hide
        ASF subversion and git services added a comment -

        Commit 1638220 from Michael McCandless in branch 'dev/trunk'
        [ https://svn.apache.org/r1638220 ]

        LUCENE-6053: add Serbian analyzer

        Show
        ASF subversion and git services added a comment - Commit 1638220 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1638220 ] LUCENE-6053 : add Serbian analyzer
        Hide
        ASF subversion and git services added a comment -

        Commit 1638221 from Michael McCandless in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1638221 ]

        LUCENE-6053: add Serbian analyzer

        Show
        ASF subversion and git services added a comment - Commit 1638221 from Michael McCandless in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1638221 ] LUCENE-6053 : add Serbian analyzer
        Hide
        Michael McCandless added a comment -

        Thanks Nikola!

        Show
        Michael McCandless added a comment - Thanks Nikola!
        Hide
        Nikola Smolenski added a comment -

        Thank you for committing so quickly

        Show
        Nikola Smolenski added a comment - Thank you for committing so quickly
        Hide
        Otis Gospodnetic added a comment -

        Hm, calling this Serbian is a bit limiting - languages from all ex-Yugoslavian countries use the exact-same diacritic characters (the "abcčćddžđefghijklljmnnjoprsštuvzž" ones, not the Cyrillic ones). Nikola Smolenski - do you think you could reorganize things a bit so isolate Cyrillic part and thus make the rest reusable?

        Show
        Otis Gospodnetic added a comment - Hm, calling this Serbian is a bit limiting - languages from all ex-Yugoslavian countries use the exact-same diacritic characters (the "abcčćddžđefghijklljmnnjoprsštuvzž" ones, not the Cyrillic ones). Nikola Smolenski - do you think you could reorganize things a bit so isolate Cyrillic part and thus make the rest reusable?
        Hide
        Nikola Smolenski added a comment -

        Well, there is already nothing stopping you from using it, if you don't mind losing some CPU cycles in search of non-existent Cyrillic letters. In fact, you could even use it for Slovene!

        I was thinking of making some sort of unified name for the analyzer, but decided against it mainly for two reasons:

        • Various dictionaries and tools would be different, and I believe dictionary name should match the language name.
        • We might lose "political" support or good will from institutions or people who would be willing to work on the dictionaries.

        In short, if CPU cycles are the problem, it should be easy to make a separate Croatian analyzer, just by copying this one and removing all the Cyrillic branches. For Slovene language, also by removing ć and đ, if that is necessary.

        (Macedonian doesn't use the same system, BTW.)

        Show
        Nikola Smolenski added a comment - Well, there is already nothing stopping you from using it, if you don't mind losing some CPU cycles in search of non-existent Cyrillic letters. In fact, you could even use it for Slovene! I was thinking of making some sort of unified name for the analyzer, but decided against it mainly for two reasons: Various dictionaries and tools would be different, and I believe dictionary name should match the language name. We might lose "political" support or good will from institutions or people who would be willing to work on the dictionaries. In short, if CPU cycles are the problem, it should be easy to make a separate Croatian analyzer, just by copying this one and removing all the Cyrillic branches. For Slovene language, also by removing ć and đ, if that is necessary. (Macedonian doesn't use the same system, BTW.)
        Hide
        Anshum Gupta added a comment -

        Bulk close after 5.0 release.

        Show
        Anshum Gupta added a comment - Bulk close after 5.0 release.

          People

          • Assignee:
            Unassigned
            Reporter:
            Nikola Smolenski
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development