Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.4, 6.0
    • Component/s: modules/analysis
    • Labels:
      None

      Description

      This is a new Serbian filter that works with regular Latin text (the current filter works with "bald" Latin). I described in detail what does it do and why is it necessary at the wiki.

      1. LUCENE-6875.patch
        14 kB
        Dawid Weiss
      2. Lucene-Serbian-Regular-1.patch
        11 kB
        Nikola Smolenski

        Activity

        Hide
        Dawid Weiss added a comment -

        Interesting. Are these transliteration rules somehow normalized? Or are they something you came up with? If they're normalized it'd be nice to include a reference in the JavaDoc. For example џ is transliterated as ž, but so is ж? I only found this, but it doesn't talk much on the subject:

        https://en.wikipedia.org/wiki/Romanization_of_Serbian

        Show
        Dawid Weiss added a comment - Interesting. Are these transliteration rules somehow normalized? Or are they something you came up with? If they're normalized it'd be nice to include a reference in the JavaDoc. For example џ is transliterated as ž , but so is ж ? I only found this, but it doesn't talk much on the subject: https://en.wikipedia.org/wiki/Romanization_of_Serbian
        Hide
        Nikola Smolenski added a comment -

        I'm not sure what do you mean by "normalized". There are the two alphabets, and this is the conversion between them. This is the common conversion, not something I came up with. Regarding the letters you mentioned, ж is transliterated as ž, but џ is transliterated as .

        Show
        Nikola Smolenski added a comment - I'm not sure what do you mean by "normalized". There are the two alphabets, and this is the conversion between them. This is the common conversion, not something I came up with. Regarding the letters you mentioned, ж is transliterated as ž , but џ is transliterated as dž .
        Hide
        Dawid Weiss added a comment -

        By normalized I meant some kind of standard that defines this transliteration. I'm (fairly) confident there is a transliteration guide for doing cyrillic -> Polish (latin and diacritics), but I'd have to look for exact reference.

        I was just curious, it wasn't meant to be a negative remark

        Regarding the letters you mentioned, ж is transliterated as ž, but џ is transliterated as dž.

        Oops, sorry – I only looked at the patch and the aligned conversion strings being compared, my bad.

        Show
        Dawid Weiss added a comment - By normalized I meant some kind of standard that defines this transliteration. I'm (fairly) confident there is a transliteration guide for doing cyrillic -> Polish (latin and diacritics), but I'd have to look for exact reference. I was just curious, it wasn't meant to be a negative remark Regarding the letters you mentioned, ж is transliterated as ž, but џ is transliterated as dž. Oops, sorry – I only looked at the patch and the aligned conversion strings being compared, my bad.
        Hide
        Robert Muir added a comment -
        Show
        Robert Muir added a comment - Dawid, at least http://geonames.nga.mil/gns/html/Romanization/Romanization_Serbian.pdf defines it in this way.
        Hide
        Dawid Weiss added a comment -

        Thanks Robert. I looked up the Polish norm (note the lack of quotes, it actually is a Polish country standard), it is a translated (and adopted) version of ISO-9 [2]. A rough scan is at [3], the norm (for a fee) available at [1].

        [1] http://sklep.pkn.pl/pn-iso-9-2000p.html
        [2] https://en.wikipedia.org/wiki/ISO_9
        [3] http://bg.p.lodz.pl/dokumenty/cyrylica1.pdf

        Show
        Dawid Weiss added a comment - Thanks Robert. I looked up the Polish norm (note the lack of quotes, it actually is a Polish country standard), it is a translated (and adopted) version of ISO-9 [2] . A rough scan is at [3] , the norm (for a fee) available at [1] . [1] http://sklep.pkn.pl/pn-iso-9-2000p.html [2] https://en.wikipedia.org/wiki/ISO_9 [3] http://bg.p.lodz.pl/dokumenty/cyrylica1.pdf
        Hide
        Nikola Smolenski added a comment -

        This is so ubiquitous that I can't find a reference. The official orthography of Serbian lists the two alphabets, but doesn't explicitly specify how to convert between them. You can see that various other software projects use the same conversion, for example GNU GetText http://cvs.savannah.gnu.org/viewvc/gettext/gettext-tools/src/filter-sr-latin.c?revision=1.4&root=gettext&view=markup or MediaWiki https://phabricator.wikimedia.org/diffusion/MW/browse/master/languages/classes/LanguageSr.php

        I have never seen ISO 9 used in practice, and it wouldn't be useful here anyway, since no one would enter the queries in ISO 9.

        Show
        Nikola Smolenski added a comment - This is so ubiquitous that I can't find a reference. The official orthography of Serbian lists the two alphabets, but doesn't explicitly specify how to convert between them. You can see that various other software projects use the same conversion, for example GNU GetText http://cvs.savannah.gnu.org/viewvc/gettext/gettext-tools/src/filter-sr-latin.c?revision=1.4&root=gettext&view=markup or MediaWiki https://phabricator.wikimedia.org/diffusion/MW/browse/master/languages/classes/LanguageSr.php I have never seen ISO 9 used in practice, and it wouldn't be useful here anyway, since no one would enter the queries in ISO 9.
        Hide
        Robert Muir added a comment -

        I think the scheme is fine.

        in the patch, the "regular" filter actually documents that it goes to "bald". I think this is just an accident?

        Show
        Robert Muir added a comment - I think the scheme is fine. in the patch, the "regular" filter actually documents that it goes to "bald". I think this is just an accident?
        Hide
        Dawid Weiss added a comment -

        It's fine with me as well, I was just curious. I am definitely not the authority to tell whether it's good or bad

        Show
        Dawid Weiss added a comment - It's fine with me as well, I was just curious. I am definitely not the authority to tell whether it's good or bad
        Hide
        Nikola Smolenski added a comment -

        Yes, that was a remnant of the copy/paste. Here is the new patch with corrected comment.

        Show
        Nikola Smolenski added a comment - Yes, that was a remnant of the copy/paste. Here is the new patch with corrected comment.
        Hide
        Dawid Weiss added a comment -

        TestAllAnalyzersHaveFactories does not pass (SPI entry missing). Run ant test.

        Show
        Dawid Weiss added a comment - TestAllAnalyzersHaveFactories does not pass (SPI entry missing). Run ant test.
        Hide
        Dawid Weiss added a comment -

        Hmm... this is in fact I think a problem with the test because the factory is there, but there are two different filters that accompany it:

        SerbianNormalizationFilter.java
        SerbianNormalizationFilterFactory.java
        SerbianNormalizationRegularFilter.java
        

        and the test complains about the other one:

        [09:53:30.679] ERROR   1.09s J3 | TestAllAnalyzersHaveFactories.test <<<
           > Throwable #1: java.lang.IllegalArgumentException: A SPI class of type org.apache.lucene.analysis.util.TokenFilterFactory with name 'SerbianNormalizationRegular' does not exist. You need to add the corresponding JAR file supporting this SPI to your classpath. The current classpath supports the following names: [apostrophe, arabicnormalization, arabicstem, bulgarianstem, brazilianstem, cjkbigram, cjkwidth, soraninormalization, soranistem, commongrams, commongramsquery, dictionarycompoundword, hyphenationcompoundword, decimaldigit, lowercase, stop, type, uppercase, czechstem, germanlightstem, germanminimalstem, germannormalization, germanstem, greeklowercase, greekstem, englishminimalstem, englishpossessive, kstem, porterstem, spanishlightstem, persiannormalization, finnishlightstem, frenchlightstem, frenchminimalstem, irishlowercase, galicianminimalstem, galicianstem, hindinormalization, hindistem, hungarianlightstem, hunspellstem, indonesianstem, indicnormalization, italianlightstem, latvianstem, asciifolding, capitalization, codepointcount, fingerprint, hyphenatedwords, keepword, keywordmarker, keywordrepeat, length, limittokencount, limittokenoffset, limittokenposition, removeduplicates, stemmeroverride, trim, truncate, worddelimiter, scandinavianfolding, scandinaviannormalization, edgengram, ngram, norwegianlightstem, norwegianminimalstem, patternreplace, patterncapturegroup, delimitedpayload, numericpayload, tokenoffsetpayload, typeaspayload, portugueselightstem, portugueseminimalstem, portuguesestem, reversestring, russianlightstem, shingle, snowballporter, serbiannormalization, classic, standard, swedishlightstem, synonym, turkishlowercase, elision]
        

        Robert, should there be a separate factory for that filter?

        Show
        Dawid Weiss added a comment - Hmm... this is in fact I think a problem with the test because the factory is there, but there are two different filters that accompany it: SerbianNormalizationFilter.java SerbianNormalizationFilterFactory.java SerbianNormalizationRegularFilter.java and the test complains about the other one: [09:53:30.679] ERROR 1.09s J3 | TestAllAnalyzersHaveFactories.test <<< > Throwable #1: java.lang.IllegalArgumentException: A SPI class of type org.apache.lucene.analysis.util.TokenFilterFactory with name 'SerbianNormalizationRegular' does not exist. You need to add the corresponding JAR file supporting this SPI to your classpath. The current classpath supports the following names: [apostrophe, arabicnormalization, arabicstem, bulgarianstem, brazilianstem, cjkbigram, cjkwidth, soraninormalization, soranistem, commongrams, commongramsquery, dictionarycompoundword, hyphenationcompoundword, decimaldigit, lowercase, stop, type, uppercase, czechstem, germanlightstem, germanminimalstem, germannormalization, germanstem, greeklowercase, greekstem, englishminimalstem, englishpossessive, kstem, porterstem, spanishlightstem, persiannormalization, finnishlightstem, frenchlightstem, frenchminimalstem, irishlowercase, galicianminimalstem, galicianstem, hindinormalization, hindistem, hungarianlightstem, hunspellstem, indonesianstem, indicnormalization, italianlightstem, latvianstem, asciifolding, capitalization, codepointcount, fingerprint, hyphenatedwords, keepword, keywordmarker, keywordrepeat, length, limittokencount, limittokenoffset, limittokenposition, removeduplicates, stemmeroverride, trim, truncate, worddelimiter, scandinavianfolding, scandinaviannormalization, edgengram, ngram, norwegianlightstem, norwegianminimalstem, patternreplace, patterncapturegroup, delimitedpayload, numericpayload, tokenoffsetpayload, typeaspayload, portugueselightstem, portugueseminimalstem, portuguesestem, reversestring, russianlightstem, shingle, snowballporter, serbiannormalization, classic, standard, swedishlightstem, synonym, turkishlowercase, elision] Robert, should there be a separate factory for that filter?
        Hide
        Robert Muir added a comment -

        in general most are 1-1, but in this case i think the factory setup is fine, i think there should be an exception list in the test?

        Show
        Robert Muir added a comment - in general most are 1-1, but in this case i think the factory setup is fine, i think there should be an exception list in the test?
        Hide
        Nikola Smolenski added a comment -

        I was considering making two separate factories, but in the end I decided against it because all the other analyzers in the chain might need to be separate as well (for example there could be a regular stemmer and a bald stemmer etc) and so all would need separate factories...

        Show
        Nikola Smolenski added a comment - I was considering making two separate factories, but in the end I decided against it because all the other analyzers in the chain might need to be separate as well (for example there could be a regular stemmer and a bald stemmer etc) and so all would need separate factories...
        Hide
        Dawid Weiss added a comment -

        Added an exception to the test. Added CHANGES.txt entry. Nikola, it'd be good if you could perhaps add a sentence or two in CHANGES on what the "new" filter does. There are actually people who read CHANGES.txt

        Show
        Dawid Weiss added a comment - Added an exception to the test. Added CHANGES.txt entry. Nikola, it'd be good if you could perhaps add a sentence or two in CHANGES on what the "new" filter does. There are actually people who read CHANGES.txt
        Hide
        ASF subversion and git services added a comment -

        Commit 1713712 from Dawid Weiss in branch 'dev/trunk'
        [ https://svn.apache.org/r1713712 ]

        LUCENE-6875: New Serbian Filter. (Nikola Smolenski via Robert Muir, Dawid Weiss)

        Show
        ASF subversion and git services added a comment - Commit 1713712 from Dawid Weiss in branch 'dev/trunk' [ https://svn.apache.org/r1713712 ] LUCENE-6875 : New Serbian Filter. (Nikola Smolenski via Robert Muir, Dawid Weiss)
        Hide
        ASF subversion and git services added a comment -

        Commit 1713714 from Dawid Weiss in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1713714 ]

        LUCENE-6875: New Serbian Filter. (Nikola Smolenski via Robert Muir, Dawid Weiss)

        Show
        ASF subversion and git services added a comment - Commit 1713714 from Dawid Weiss in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1713714 ] LUCENE-6875 : New Serbian Filter. (Nikola Smolenski via Robert Muir, Dawid Weiss)
        Hide
        Dawid Weiss added a comment -

        Thanks Nikola.

        Show
        Dawid Weiss added a comment - Thanks Nikola.
        Hide
        Steve Rowe added a comment -

        Dawid Weiss, my Jenkins reports that TestAllAnalyzersHaveFactories is failing: http://jenkins.sarowe.net/job/Lucene-Solr-tests-5.x-Java8/3309/ and http://jenkins.sarowe.net/job/Lucene-Solr-tests-trunk/3647/

        1 tests failed.
        FAILED:  org.apache.lucene.analysis.core.TestAllAnalyzersHaveFactories.test
        
        Error Message:
        A SPI class of type org.apache.lucene.analysis.util.TokenFilterFactory with name 'SerbianNormalizationRegular' does not exist. You need to add the corresponding JAR file supporting this SPI to your classpath. The current classpath supports the following names: [apostrophe, arabicnormalization, arabicstem, bulgarianstem, brazilianstem, cjkbigram, cjkwidth, soraninormalization, soranistem, commongrams, commongramsquery, dictionarycompoundword, hyphenationcompoundword, decimaldigit, lowercase, stop, type, uppercase, czechstem, germanlightstem, germanminimalstem, germannormalization, germanstem, greeklowercase, greekstem, englishminimalstem, englishpossessive, kstem, porterstem, spanishlightstem, persiannormalization, finnishlightstem, frenchlightstem, frenchminimalstem, irishlowercase, galicianminimalstem, galicianstem, hindinormalization, hindistem, hungarianlightstem, hunspellstem, indonesianstem, indicnormalization, italianlightstem, latvianstem, asciifolding, capitalization, codepointcount, fingerprint, hyphenatedwords, keepword, keywordmarker, keywordrepeat, length, limittokencount, limittokenoffset, limittokenposition, removeduplicates, stemmeroverride, trim, truncate, worddelimiter, scandinavianfolding, scandinaviannormalization, edgengram, ngram, norwegianlightstem, norwegianminimalstem, patternreplace, patterncapturegroup, delimitedpayload, numericpayload, tokenoffsetpayload, typeaspayload, portugueselightstem, portugueseminimalstem, portuguesestem, reversestring, russianlightstem, shingle, snowballporter, serbiannormalization, classic, standard, swedishlightstem, synonym, turkishlowercase, elision]
        
        Show
        Steve Rowe added a comment - Dawid Weiss , my Jenkins reports that TestAllAnalyzersHaveFactories is failing: http://jenkins.sarowe.net/job/Lucene-Solr-tests-5.x-Java8/3309/ and http://jenkins.sarowe.net/job/Lucene-Solr-tests-trunk/3647/ 1 tests failed. FAILED: org.apache.lucene.analysis.core.TestAllAnalyzersHaveFactories.test Error Message: A SPI class of type org.apache.lucene.analysis.util.TokenFilterFactory with name 'SerbianNormalizationRegular' does not exist. You need to add the corresponding JAR file supporting this SPI to your classpath. The current classpath supports the following names: [apostrophe, arabicnormalization, arabicstem, bulgarianstem, brazilianstem, cjkbigram, cjkwidth, soraninormalization, soranistem, commongrams, commongramsquery, dictionarycompoundword, hyphenationcompoundword, decimaldigit, lowercase, stop, type, uppercase, czechstem, germanlightstem, germanminimalstem, germannormalization, germanstem, greeklowercase, greekstem, englishminimalstem, englishpossessive, kstem, porterstem, spanishlightstem, persiannormalization, finnishlightstem, frenchlightstem, frenchminimalstem, irishlowercase, galicianminimalstem, galicianstem, hindinormalization, hindistem, hungarianlightstem, hunspellstem, indonesianstem, indicnormalization, italianlightstem, latvianstem, asciifolding, capitalization, codepointcount, fingerprint, hyphenatedwords, keepword, keywordmarker, keywordrepeat, length, limittokencount, limittokenoffset, limittokenposition, removeduplicates, stemmeroverride, trim, truncate, worddelimiter, scandinavianfolding, scandinaviannormalization, edgengram, ngram, norwegianlightstem, norwegianminimalstem, patternreplace, patterncapturegroup, delimitedpayload, numericpayload, tokenoffsetpayload, typeaspayload, portugueselightstem, portugueseminimalstem, portuguesestem, reversestring, russianlightstem, shingle, snowballporter, serbiannormalization, classic, standard, swedishlightstem, synonym, turkishlowercase, elision]
        Hide
        Dawid Weiss added a comment -

        Hmm... looking into it.

        Show
        Dawid Weiss added a comment - Hmm... looking into it.
        Hide
        ASF subversion and git services added a comment -

        Commit 1713716 from Dawid Weiss in branch 'dev/trunk'
        [ https://svn.apache.org/r1713716 ]

        Reverting 1713712 (LUCENE-6875), wrong patch.

        Show
        ASF subversion and git services added a comment - Commit 1713716 from Dawid Weiss in branch 'dev/trunk' [ https://svn.apache.org/r1713716 ] Reverting 1713712 ( LUCENE-6875 ), wrong patch.
        Hide
        ASF subversion and git services added a comment -

        Commit 1713717 from Dawid Weiss in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1713717 ]

        Reverting 1713712 (LUCENE-6875), wrong patch.

        Show
        ASF subversion and git services added a comment - Commit 1713717 from Dawid Weiss in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1713717 ] Reverting 1713712 ( LUCENE-6875 ), wrong patch.
        Hide
        Dawid Weiss added a comment -

        Don't know how it happened, but I committed the wrong patch... Sorry about it! Thanks for the heads up, Steve. I've reverted the wrong patch and will commit the corrected one shortly.

        Show
        Dawid Weiss added a comment - Don't know how it happened, but I committed the wrong patch... Sorry about it! Thanks for the heads up, Steve. I've reverted the wrong patch and will commit the corrected one shortly.
        Hide
        ASF subversion and git services added a comment -

        Commit 1713719 from Dawid Weiss in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1713719 ]

        LUCENE-6875: New Serbian Filter. (Nikola Smolenski via Robert Muir, Dawid Weiss)

        Show
        ASF subversion and git services added a comment - Commit 1713719 from Dawid Weiss in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1713719 ] LUCENE-6875 : New Serbian Filter. (Nikola Smolenski via Robert Muir, Dawid Weiss)
        Hide
        ASF subversion and git services added a comment -

        Commit 1713720 from Dawid Weiss in branch 'dev/trunk'
        [ https://svn.apache.org/r1713720 ]

        LUCENE-6875: New Serbian Filter. (Nikola Smolenski via Robert Muir, Dawid Weiss)

        Show
        ASF subversion and git services added a comment - Commit 1713720 from Dawid Weiss in branch 'dev/trunk' [ https://svn.apache.org/r1713720 ] LUCENE-6875 : New Serbian Filter. (Nikola Smolenski via Robert Muir, Dawid Weiss)
        Hide
        ASF subversion and git services added a comment -

        Commit 1713737 from janhoy@apache.org in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1713737 ]

        LUCENE-6875: Fix svn eol-style and space instead of tab, to pass precommit

        Show
        ASF subversion and git services added a comment - Commit 1713737 from janhoy@apache.org in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1713737 ] LUCENE-6875 : Fix svn eol-style and space instead of tab, to pass precommit
        Hide
        ASF subversion and git services added a comment -

        Commit 1713740 from janhoy@apache.org in branch 'dev/trunk'
        [ https://svn.apache.org/r1713740 ]

        LUCENE-6875: Fix svn eol-style and space instead of tab, to pass precommit

        Show
        ASF subversion and git services added a comment - Commit 1713740 from janhoy@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1713740 ] LUCENE-6875 : Fix svn eol-style and space instead of tab, to pass precommit
        Hide
        Hoss Man added a comment -

        Nikola: huge thank you for creating that Solr wiki page - very helpful for understanding the pros/cons of the different approaches.

        Show
        Hoss Man added a comment - Nikola: huge thank you for creating that Solr wiki page - very helpful for understanding the pros/cons of the different approaches.

          People

          • Assignee:
            Dawid Weiss
            Reporter:
            Nikola Smolenski
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development