Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-7287

New lemma-tizer plugin for ukrainian language.

    Details

    • Lucene Fields:
      New

      Description

      Hi all,

      I wonder whether you are interested in supporting a plugin which provides a mapping between ukrainian word forms and their lemmas. Some tests and docs go out-of-the-box =) .

      https://github.com/mrgambal/elasticsearch-ukrainian-lemmatizer

      It's really simple but still works and generates some value for its users.

      More: https://github.com/elastic/elasticsearch/issues/18303

      1. LUCENE-7287.patch
        29 kB
        Michael McCandless
      2. Screen Shot 2016-06-23 at 8.23.01 PM.png
        63 kB
        Ahmet Arslan
      3. Screen Shot 2016-06-23 at 8.41.28 PM.png
        52 kB
        Ahmet Arslan

        Issue Links

          Activity

          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user arysin closed the pull request at:

          https://github.com/apache/lucene-solr/pull/45

          Show
          githubbot ASF GitHub Bot added a comment - Github user arysin closed the pull request at: https://github.com/apache/lucene-solr/pull/45
          Hide
          ctargett Cassandra Targett added a comment -

          I missed it last go-around. I don't know if I will have time to add it for 6.3, but I added it to the TODO list (https://cwiki.apache.org/confluence/display/solr/Internal+-+TODO+List) so we'll at least know it needs to get done.

          Show
          ctargett Cassandra Targett added a comment - I missed it last go-around. I don't know if I will have time to add it for 6.3, but I added it to the TODO list ( https://cwiki.apache.org/confluence/display/solr/Internal+-+TODO+List ) so we'll at least know it needs to get done.
          Hide
          arysin Andriy Rysin added a comment -

          Cassandra looks like 6.2 is out could you please add Ukrainian section to https://cwiki.apache.org/confluence/display/solr/Language+Analysis ?

          Show
          arysin Andriy Rysin added a comment - Cassandra looks like 6.2 is out could you please add Ukrainian section to https://cwiki.apache.org/confluence/display/solr/Language+Analysis ?
          Hide
          mikemccand Michael McCandless added a comment -

          Bulk close resolved issues after 6.2.0 release.

          Show
          mikemccand Michael McCandless added a comment - Bulk close resolved issues after 6.2.0 release.
          Hide
          arysin Andriy Rysin added a comment -

          Thanks Michael, much appreciated!

          Show
          arysin Andriy Rysin added a comment - Thanks Michael, much appreciated!
          Hide
          mikemccand Michael McCandless added a comment -

          Andriy Rysin, I pushed the normalization changes above, thank you!

          Show
          mikemccand Michael McCandless added a comment - Andriy Rysin , I pushed the normalization changes above, thank you!
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 6c730ab74f2ac8a865d2d514344db18572f059da in lucene-solr's branch refs/heads/branch_6x from Mike McCandless
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=6c730ab ]

          LUCENE-7287: normalize Ukrainian morfologik dictionary to have unique token+lemma pairs

          Show
          jira-bot ASF subversion and git services added a comment - Commit 6c730ab74f2ac8a865d2d514344db18572f059da in lucene-solr's branch refs/heads/branch_6x from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=6c730ab ] LUCENE-7287 : normalize Ukrainian morfologik dictionary to have unique token+lemma pairs
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit bc502bd9c91669cec72f40fd6fc13b6a68e90c52 in lucene-solr's branch refs/heads/master from Mike McCandless
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=bc502bd ]

          LUCENE-7287: normalize Ukrainian morfologik dictionary to have unique token+lemma pairs

          Show
          jira-bot ASF subversion and git services added a comment - Commit bc502bd9c91669cec72f40fd6fc13b6a68e90c52 in lucene-solr's branch refs/heads/master from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=bc502bd ] LUCENE-7287 : normalize Ukrainian morfologik dictionary to have unique token+lemma pairs
          Hide
          mikemccand Michael McCandless added a comment -

          Andriy Rysin thank you! I'll merge this likely early next week ...

          Show
          mikemccand Michael McCandless added a comment - Andriy Rysin thank you! I'll merge this likely early next week ...
          Hide
          arysin Andriy Rysin added a comment -

          Hey Michael McCandless, can we please merge the pull request above, that should wrap up dictionary-based analyzer for Ukrainian. Thanks!

          Show
          arysin Andriy Rysin added a comment - Hey Michael McCandless , can we please merge the pull request above, that should wrap up dictionary-based analyzer for Ukrainian. Thanks!
          Hide
          arysin Andriy Rysin added a comment -

          Ok, I was able to run solr with Ukrainian analyzer and I can confirm it generates unique lemmas.
          I've created a pull request https://github.com/apache/lucene-solr/pull/45

          I've also added mapping_uk.txt so we can use mapping filter in solr, once it's merged we can add this line:
          <charFilter class="solr.MappingCharFilterFactory" mapping="org/apache/lucene/analysis/uk/mapping_uk.txt"/>

          We could potentially change UkrainianMorfologikAnalyzer to use MappingCharFilterFactory to read from the same file (so we don't have the mapping both in the code and the file) but not sure how appropriate using of factories in lucene is.

          Many thanks to Ahmet who helped with solr integration and found duplicate tokens!

          Show
          arysin Andriy Rysin added a comment - Ok, I was able to run solr with Ukrainian analyzer and I can confirm it generates unique lemmas. I've created a pull request https://github.com/apache/lucene-solr/pull/45 I've also added mapping_uk.txt so we can use mapping filter in solr, once it's merged we can add this line: <charFilter class="solr.MappingCharFilterFactory" mapping="org/apache/lucene/analysis/uk/mapping_uk.txt"/> We could potentially change UkrainianMorfologikAnalyzer to use MappingCharFilterFactory to read from the same file (so we don't have the mapping both in the code and the file) but not sure how appropriate using of factories in lucene is. Many thanks to Ahmet who helped with solr integration and found duplicate tokens!
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user arysin opened a pull request:

          https://github.com/apache/lucene-solr/pull/45

          LUCENE-7287: normalize Ukrainian morfologik dictionary to have unique…

          … token+lemma pair

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/arysin/lucene-solr master

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/lucene-solr/pull/45.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #45


          commit 45d1a80899ceb1afb467433529fe66d29e1a1d2b
          Author: Andriy Rysin <arysin@gmail.com>
          Date: 2016-06-24T23:41:07Z

          LUCENE-7287: normalize Ukrainian morfologik dictionary to have unique token+lemma pair


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user arysin opened a pull request: https://github.com/apache/lucene-solr/pull/45 LUCENE-7287 : normalize Ukrainian morfologik dictionary to have unique… … token+lemma pair You can merge this pull request into a Git repository by running: $ git pull https://github.com/arysin/lucene-solr master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/lucene-solr/pull/45.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #45 commit 45d1a80899ceb1afb467433529fe66d29e1a1d2b Author: Andriy Rysin <arysin@gmail.com> Date: 2016-06-24T23:41:07Z LUCENE-7287 : normalize Ukrainian morfologik dictionary to have unique token+lemma pair
          Hide
          arysin Andriy Rysin added a comment -

          I've created the dictionary that collapses token+lemma in one record (like Polish dictionary does) and added tests to make sure we don't generate duplicate lemmas.
          I'll do a bit more testing and will create a pull request.

          Show
          arysin Andriy Rysin added a comment - I've created the dictionary that collapses token+lemma in one record (like Polish dictionary does) and added tests to make sure we don't generate duplicate lemmas. I'll do a bit more testing and will create a pull request.
          Hide
          arysin Andriy Rysin added a comment -

          Ok, then I'll prepare the changes as part of this ticket.

          I've looked deeper into the morfologik dictionaries we have in LanguageTool and the Polish one has token+lemma normalized (with POS tags concatenated for each unique token+lemma), other dictionaries including Ukrainian have separate records thus token+lemma is not unique. I've sent an email to the morfologik guys and once I get an explanation I'll update the dictionary appropriately so we don't have have duplicates.

          Show
          arysin Andriy Rysin added a comment - Ok, then I'll prepare the changes as part of this ticket. I've looked deeper into the morfologik dictionaries we have in LanguageTool and the Polish one has token+lemma normalized (with POS tags concatenated for each unique token+lemma), other dictionaries including Ukrainian have separate records thus token+lemma is not unique. I've sent an email to the morfologik guys and once I get an explanation I'll update the dictionary appropriately so we don't have have duplicates.
          Hide
          iorixxx Ahmet Arslan added a comment -

          This is a new feature that is never released, new ticket may not be needed.

          Show
          iorixxx Ahmet Arslan added a comment - This is a new feature that is never released, new ticket may not be needed.
          Hide
          arysin Andriy Rysin added a comment -

          Hmm, that does not look right. Yes we can either use RemoveDuplicatesTokenFilterFactory (we'll have to add that to the UkrainianMorfologikAnalyzer too) or I need to rebuild the dictionary to remove the duplicates (probably preferred way).
          The problem is that currently the dictionary is the POS dictionary so there may be duplicate lemma records as long as the POS tags are different.
          I am thinking to file new jira issue for that and will provide a pull request, does that make sense?

          Show
          arysin Andriy Rysin added a comment - Hmm, that does not look right. Yes we can either use RemoveDuplicatesTokenFilterFactory (we'll have to add that to the UkrainianMorfologikAnalyzer too) or I need to rebuild the dictionary to remove the duplicates (probably preferred way). The problem is that currently the dictionary is the POS dictionary so there may be duplicate lemma records as long as the POS tags are different. I am thinking to file new jira issue for that and will provide a pull request, does that make sense?
          Hide
          iorixxx Ahmet Arslan added a comment -

          Hi,
          multiple tokens OK, but multiple identical tokens look weird, no?
          Have you checked the screenshot that includes RemoveDuplicatesTokenFilterFactory (RDTF)?

          Shall I create mappings_uk.txt so we can use it in solr?

          Lets ask Michael.
          Either separate file or we can just recommend to use mapping char filter the recommended mappings.
          May be we can place the uk_mappings.txt file under https://github.com/apache/lucene-solr/tree/master/solr/server/solr/configsets/sample_techproducts_configs/conf/lang

          Show
          iorixxx Ahmet Arslan added a comment - Hi, multiple tokens OK, but multiple identical tokens look weird, no? Have you checked the screenshot that includes RemoveDuplicatesTokenFilterFactory (RDTF)? Shall I create mappings_uk.txt so we can use it in solr? Lets ask Michael. Either separate file or we can just recommend to use mapping char filter the recommended mappings. May be we can place the uk_mappings.txt file under https://github.com/apache/lucene-solr/tree/master/solr/server/solr/configsets/sample_techproducts_configs/conf/lang
          Hide
          arysin Andriy Rysin added a comment -

          Thanks Ahmet!
          Shall I create mappings_uk.txt so we can use it in solr?
          As for the multiple tokens, MorfologikFilter produces lemmas so (how I understand) it may have multiple tokens in the output for single token in the input.

          Show
          arysin Andriy Rysin added a comment - Thanks Ahmet! Shall I create mappings_uk.txt so we can use it in solr? As for the multiple tokens, MorfologikFilter produces lemmas so (how I understand) it may have multiple tokens in the output for single token in the input.
          Hide
          iorixxx Ahmet Arslan added a comment -

          Please see screenshots in the attachments section at the begging of the page and let me know what you think.

          Show
          iorixxx Ahmet Arslan added a comment - Please see screenshots in the attachments section at the begging of the page and let me know what you think.
          Hide
          iorixxx Ahmet Arslan added a comment -

          Here is the screen shot of analysis admin page, with RemoveDuplicatesTokenFilter added.

             <!-- Ukrainian -->
              <fieldType name="text_uk" class="solr.TextField" positionIncrementGap="100">
                <analyzer> 
                  <tokenizer class="solr.StandardTokenizerFactory"/>
                  <filter class="solr.LowerCaseFilterFactory"/>
                  <filter class="solr.StopFilterFactory" words="org/apache/lucene/analysis/uk/stopwords.txt" />
                  <filter class="solr.MorfologikFilterFactory" dictionary="org/apache/lucene/analysis/uk/ukrainian.dict" />
                  <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
                </analyzer>
              </fieldType>
          
          Show
          iorixxx Ahmet Arslan added a comment - Here is the screen shot of analysis admin page, with RemoveDuplicatesTokenFilter added. <!-- Ukrainian --> <fieldType name= "text_uk" class= "solr.TextField" positionIncrementGap= "100" > <analyzer> <tokenizer class= "solr.StandardTokenizerFactory" /> <filter class= "solr.LowerCaseFilterFactory" /> <filter class= "solr.StopFilterFactory" words= "org/apache/lucene/analysis/uk/stopwords.txt" /> <filter class= "solr.MorfologikFilterFactory" dictionary= "org/apache/lucene/analysis/uk/ukrainian.dict" /> <filter class= "solr.RemoveDuplicatesTokenFilterFactory" /> </analyzer> </fieldType>
          Hide
          iorixxx Ahmet Arslan added a comment -
            <!-- Ukrainian -->
              <fieldType name="text_uk" class="solr.TextField" positionIncrementGap="100">
                <analyzer> 
                  <tokenizer class="solr.StandardTokenizerFactory"/>
                  <filter class="solr.LowerCaseFilterFactory"/>
                  <filter class="solr.StopFilterFactory" words="org/apache/lucene/analysis/uk/stopwords.txt" />
                  <filter class="solr.MorfologikFilterFactory" dictionary="org/apache/lucene/analysis/uk/ukrainian.dict" />
                </analyzer>
              </fieldType>
          
          Show
          iorixxx Ahmet Arslan added a comment - <!-- Ukrainian --> <fieldType name= "text_uk" class= "solr.TextField" positionIncrementGap= "100" > <analyzer> <tokenizer class= "solr.StandardTokenizerFactory" /> <filter class= "solr.LowerCaseFilterFactory" /> <filter class= "solr.StopFilterFactory" words= "org/apache/lucene/analysis/uk/stopwords.txt" /> <filter class= "solr.MorfologikFilterFactory" dictionary= "org/apache/lucene/analysis/uk/ukrainian.dict" /> </analyzer> </fieldType>
          Hide
          iorixxx Ahmet Arslan added a comment -

          Hi,

          I was able to run the analyzer successfully. Without mapping chart filter. Because character mappings are hardcoded into code.
          I am attaching an analysis screen shot. However, it looks like we need a remove duplicates token filter at the end.
          It looks like Morfologik filter injects multiple tokens at the same position

          Show
          iorixxx Ahmet Arslan added a comment - Hi, I was able to run the analyzer successfully. Without mapping chart filter. Because character mappings are hardcoded into code. I am attaching an analysis screen shot. However, it looks like we need a remove duplicates token filter at the end. It looks like Morfologik filter injects multiple tokens at the same position
          Hide
          arysin Andriy Rysin added a comment -

          Sure, I can add a comment, but I guess I need to test the solution first and as I am not familiar with solr so it may take me few days. Unless Ahmet Arslan already verified this solution then we can just post it.

          Show
          arysin Andriy Rysin added a comment - Sure, I can add a comment, but I guess I need to test the solution first and as I am not familiar with solr so it may take me few days. Unless Ahmet Arslan already verified this solution then we can just post it.
          Hide
          ctargett Cassandra Targett added a comment - - edited

          If you do that (make a comment on the page with some text), I'll make sure it gets into the Solr Ref Guide. Just so you know, since this is for 6.2, I won't be able to add the content until after the current Ref Guide for 6.1 is released (vote going on now).

          edit: removed some of my earlier comment, I got this confused with SOLR-7739.

          Show
          ctargett Cassandra Targett added a comment - - edited If you do that (make a comment on the page with some text), I'll make sure it gets into the Solr Ref Guide. Just so you know, since this is for 6.2, I won't be able to add the content until after the current Ref Guide for 6.1 is released (vote going on now). edit : removed some of my earlier comment, I got this confused with SOLR-7739 .
          Hide
          iorixxx Ahmet Arslan added a comment -

          only committers have rights to edit confluence wiki. Contributors include the proposed change/addition as a message at the end of the page.

          Show
          iorixxx Ahmet Arslan added a comment - only committers have rights to edit confluence wiki. Contributors include the proposed change/addition as a message at the end of the page.
          Hide
          arysin Andriy Rysin added a comment -

          I've logged in into cwiki but I don't seem to have rights to edit the page.

          Show
          arysin Andriy Rysin added a comment - I've logged in into cwiki but I don't seem to have rights to edit the page.
          Hide
          iorixxx Ahmet Arslan added a comment -

          I think you, as the author of Ukrainian. Thanks!

          Show
          iorixxx Ahmet Arslan added a comment - I think you, as the author of Ukrainian. Thanks!
          Hide
          arysin Andriy Rysin added a comment -

          Thanks Ahmet, that looks good! Would you add/push those changes or shall I work on this?

          Show
          arysin Andriy Rysin added a comment - Thanks Ahmet, that looks good! Would you add/push those changes or shall I work on this?
          Hide
          iorixxx Ahmet Arslan added a comment -

          So, Solr field type counterpart of this analyzer would be something like:

              <!-- Ukrainian -->
              <fieldType name="text_uk" class="solr.TextField" positionIncrementGap="100">
                <analyzer> 
                  <charFilter class="solr.MappingCharFilterFactory" mapping="lang/mappings_uk.txt"/>
                  <tokenizer class="solr.StandardTokenizerFactory"/>
                  <filter class="solr.LowerCaseFilterFactory"/>
                  <filter class="solr.StopFilterFactory" words="org/apache/lucene/analysis/uk/stopwords.txt" />
                  <filter class="solr.MorfologikFilterFactory" dictionary="org/apache/lucene/analysis/uk/ukrainian.dict"/>
                </analyzer>
              </fieldType>
              

          It would be nice to add an entry for Ukranian to https://cwiki.apache.org/confluence/display/solr/Language+Analysis

          Show
          iorixxx Ahmet Arslan added a comment - So, Solr field type counterpart of this analyzer would be something like: <!-- Ukrainian --> <fieldType name= "text_uk" class= "solr.TextField" positionIncrementGap= "100" > <analyzer> <charFilter class= "solr.MappingCharFilterFactory" mapping= "lang/mappings_uk.txt" /> <tokenizer class= "solr.StandardTokenizerFactory" /> <filter class= "solr.LowerCaseFilterFactory" /> <filter class= "solr.StopFilterFactory" words= "org/apache/lucene/analysis/uk/stopwords.txt" /> <filter class= "solr.MorfologikFilterFactory" dictionary= "org/apache/lucene/analysis/uk/ukrainian.dict" /> </analyzer> </fieldType> It would be nice to add an entry for Ukranian to https://cwiki.apache.org/confluence/display/solr/Language+Analysis
          Hide
          arysin Andriy Rysin added a comment - - edited

          I don't know much about solr, but I think MorfologikFilterFactory uses dictionary= parameter instead of dictionary-resource=
          https://lucene.apache.org/core/6_1_0/analyzers-morfologik/org/apache/lucene/analysis/morfologik/MorfologikFilterFactory.html

          Also would that mean that we don't get the stop words filter and apostrophe/stress character normalization?

          Show
          arysin Andriy Rysin added a comment - - edited I don't know much about solr, but I think MorfologikFilterFactory uses dictionary= parameter instead of dictionary-resource= https://lucene.apache.org/core/6_1_0/analyzers-morfologik/org/apache/lucene/analysis/morfologik/MorfologikFilterFactory.html Also would that mean that we don't get the stop words filter and apostrophe/stress character normalization?
          Hide
          iorixxx Ahmet Arslan added a comment -

          Can we use this analyzer in solr?

           <filter class="solr.MorfologikFilterFactory" dictionary-resource="uk"/>
          
          Show
          iorixxx Ahmet Arslan added a comment - Can we use this analyzer in solr? <filter class= "solr.MorfologikFilterFactory" dictionary-resource= "uk" />
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 21eb654e408727b56a78c1c6a00541efe6eda31e in lucene-solr's branch refs/heads/branch_6x from Mike McCandless
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=21eb654 ]

          LUCENE-7287: don't use full paths to resources

          Show
          jira-bot ASF subversion and git services added a comment - Commit 21eb654e408727b56a78c1c6a00541efe6eda31e in lucene-solr's branch refs/heads/branch_6x from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=21eb654 ] LUCENE-7287 : don't use full paths to resources
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit ceb6e21f84414b42f6b1b3866fc5b62e7ab474c0 in lucene-solr's branch refs/heads/master from Mike McCandless
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=ceb6e21 ]

          LUCENE-7287: don't use full paths to resources

          Show
          jira-bot ASF subversion and git services added a comment - Commit ceb6e21f84414b42f6b1b3866fc5b62e7ab474c0 in lucene-solr's branch refs/heads/master from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=ceb6e21 ] LUCENE-7287 : don't use full paths to resources
          Hide
          mikemccand Michael McCandless added a comment -

          Uwe Schindler oh yeah I'll fix that!

          Show
          mikemccand Michael McCandless added a comment - Uwe Schindler oh yeah I'll fix that!
          Hide
          thetaphi Uwe Schindler added a comment - - edited

          Michael McCandless: Can you remove the absolute path here?

          return Dictionary.read(UkrainianMorfologikAnalyzer.class.getResource("/org/apache/lucene/analysis/uk/ukrainian.dict"));
          

          The file is in same package, so just the filename should be fine to resolve the URL.

          Show
          thetaphi Uwe Schindler added a comment - - edited Michael McCandless : Can you remove the absolute path here? return Dictionary.read(UkrainianMorfologikAnalyzer.class.getResource( "/org/apache/lucene/analysis/uk/ukrainian.dict" )); The file is in same package, so just the filename should be fine to resolve the URL.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 4a71e03a32fb5739b15ca4b0f893d50392caeb71 in lucene-solr's branch refs/heads/branch_6x from Mike McCandless
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=4a71e03 ]

          LUCENE-7287: add UkrainianMorfologikAnalyzer, a dictionary-based analyzer for the Ukrainian language

          Show
          jira-bot ASF subversion and git services added a comment - Commit 4a71e03a32fb5739b15ca4b0f893d50392caeb71 in lucene-solr's branch refs/heads/branch_6x from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=4a71e03 ] LUCENE-7287 : add UkrainianMorfologikAnalyzer, a dictionary-based analyzer for the Ukrainian language
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 6ef174f52737b37e8b0625208ccc7cc64c3bd5b0 in lucene-solr's branch refs/heads/master from Mike McCandless
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=6ef174f ]

          LUCENE-7287: add UkrainianMorfologikAnalyzer, a dictionary-based analyzer for the Ukrainian language

          Show
          jira-bot ASF subversion and git services added a comment - Commit 6ef174f52737b37e8b0625208ccc7cc64c3bd5b0 in lucene-solr's branch refs/heads/master from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=6ef174f ] LUCENE-7287 : add UkrainianMorfologikAnalyzer, a dictionary-based analyzer for the Ukrainian language
          Hide
          mikemccand Michael McCandless added a comment -

          Thanks Andriy Rysin, I'll tweak the javadocs for UkrainianMorfologikAnalyzer stating that it's dictionary based and push shortly. It looks like I have the latest dictionary. Thank you!

          Show
          mikemccand Michael McCandless added a comment - Thanks Andriy Rysin , I'll tweak the javadocs for UkrainianMorfologikAnalyzer stating that it's dictionary based and push shortly. It looks like I have the latest dictionary. Thank you!
          Hide
          arysin Andriy Rysin added a comment -

          Looks cool, thanks a lot Michael!

          I wonder if we should add little javadoc for this analyzer that it's dictionary based so if we add a light-stemming analyzer users can easily tell the difference.
          Also since I created a project I've updated the dictionary once (https://github.com/arysin/lucene_uk/commit/7cc8bea59c402e9b9729afd63d0a53cb34045e750) not sure if you're using the latest update.

          I'll open another issue for the "light" stemmer for Ukrainian.

          Show
          arysin Andriy Rysin added a comment - Looks cool, thanks a lot Michael! I wonder if we should add little javadoc for this analyzer that it's dictionary based so if we add a light-stemming analyzer users can easily tell the difference. Also since I created a project I've updated the dictionary once ( https://github.com/arysin/lucene_uk/commit/7cc8bea59c402e9b9729afd63d0a53cb34045e750 ) not sure if you're using the latest update. I'll open another issue for the "light" stemmer for Ukrainian.
          Hide
          mikemccand Michael McCandless added a comment -

          OK here's a patch, just a rote copy of the files from Andriy Rysin's project, and fixing up a few things ant precommit was unhappy about, and some small code styling fixes. Tests pass, I think it's ready!

          Thank you Andriy Rysin!

          Show
          mikemccand Michael McCandless added a comment - OK here's a patch, just a rote copy of the files from Andriy Rysin 's project, and fixing up a few things ant precommit was unhappy about, and some small code styling fixes. Tests pass, I think it's ready! Thank you Andriy Rysin !
          Hide
          mikemccand Michael McCandless added a comment -

          Or we could put it under analisys/morfologik (as a .uk subpackage) - it's your call.

          I like this idea!

          If we do that will the stopwords go with the stemmer or should they live under common/ (as they are not morfologik-specific and may be used for other Ukrainian implementations)?

          I think we should keep the stop-words in the same location? I think users seeking Ukrainian tokenization should still be able to find them, under analysis/morfologik?

          I am also thinking if we could build generic stemmer for Ukrainian based on the affix rules we have in dict_uk project (they are hunspell-like but fully based on regular expressions which makes them way more compact).

          That sounds compelling! This would be a "light" stemmer, vs what we are adding for this issue (dictionary based)? We should open a separate issue for that I think...

          OK, I'll work on folding your project into Lucene, under analysis/mofologik in a uk sub-package. Thank you for all the hard work here!

          Show
          mikemccand Michael McCandless added a comment - Or we could put it under analisys/morfologik (as a .uk subpackage) - it's your call. I like this idea! If we do that will the stopwords go with the stemmer or should they live under common/ (as they are not morfologik-specific and may be used for other Ukrainian implementations)? I think we should keep the stop-words in the same location? I think users seeking Ukrainian tokenization should still be able to find them, under analysis/morfologik ? I am also thinking if we could build generic stemmer for Ukrainian based on the affix rules we have in dict_uk project (they are hunspell-like but fully based on regular expressions which makes them way more compact). That sounds compelling! This would be a "light" stemmer, vs what we are adding for this issue (dictionary based)? We should open a separate issue for that I think... OK, I'll work on folding your project into Lucene, under analysis/mofologik in a uk sub-package. Thank you for all the hard work here!
          Hide
          arysin Andriy Rysin added a comment -

          I guess it does not fit under analysis/common as it depends on Morfologik so analysis/ukrainian is probably a good place. Or we could put it under analisys/morfologik (as a .uk subpackage) - it's your call. If we do that will the stopwords go with the stemmer or should they live under common/ (as they are not morfologik-specific and may be used for other Ukrainian implementations)?
          I am also thinking if we could build generic stemmer for Ukrainian based on the affix rules we have in dict_uk project (they are hunspell-like but fully based on regular expressions which makes them way more compact).

          Show
          arysin Andriy Rysin added a comment - I guess it does not fit under analysis/common as it depends on Morfologik so analysis/ukrainian is probably a good place. Or we could put it under analisys/morfologik (as a .uk subpackage) - it's your call. If we do that will the stopwords go with the stemmer or should they live under common/ (as they are not morfologik-specific and may be used for other Ukrainian implementations)? I am also thinking if we could build generic stemmer for Ukrainian based on the affix rules we have in dict_uk project (they are hunspell-like but fully based on regular expressions which makes them way more compact).
          Hide
          mikemccand Michael McCandless added a comment -

          Andriy Rysin I think this looks nice, thank you! I think we should place it in its own sub-module under Lucene's analysis module? Maybe just analysis/ukrainian?

          Show
          mikemccand Michael McCandless added a comment - Andriy Rysin I think this looks nice, thank you! I think we should place it in its own sub-module under Lucene's analysis module? Maybe just analysis/ukrainian ?
          Hide
          arysin Andriy Rysin added a comment -

          Michael McCandless, Ahmet Arslan does this implementation look good enough for inclusion? Is there anything else needs to be done? Thanks.

          Show
          arysin Andriy Rysin added a comment - Michael McCandless , Ahmet Arslan does this implementation look good enough for inclusion? Is there anything else needs to be done? Thanks.
          Hide
          arysin Andriy Rysin added a comment -

          Thanks for the hint, I've changed the code to use MappingCharFilter.
          It's slightly slower but architecturally more correct.

          Show
          arysin Andriy Rysin added a comment - Thanks for the hint, I've changed the code to use MappingCharFilter. It's slightly slower but architecturally more correct.
          Hide
          iorixxx Ahmet Arslan added a comment -

          May be MappingCharFilter could be used instead of a token filter?

          Show
          iorixxx Ahmet Arslan added a comment - May be MappingCharFilter could be used instead of a token filter?
          Hide
          arysin Andriy Rysin added a comment -

          I've added a token filter for unicode apostrophes and stress symbol.

          Show
          arysin Andriy Rysin added a comment - I've added a token filter for unicode apostrophes and stress symbol.
          Hide
          arysin Andriy Rysin added a comment -

          Ok, guys, I've created little project with Ukrainian analyzer for lucene using MorfologikAnalyzer: https://github.com/arysin/lucene_uk
          The test (https://github.com/arysin/lucene_uk/blob/master/src/test/java/org/apache/lucene/analysis/uk/TestUkrainianAnalyzer.java) runs successfully inside lucene but I can't run it in my project (getting NPE at RunListenerPrintReproduceInfo.java:131).
          I can run simple standalone test app though with no problem: https://github.com/arysin/lucene_uk/blob/master/src/test/java/org/lucene_uk/test/LuceneTest.java
          For simplicity for now I just included Ukrainian binary morfologik dictionary in the project itself. The only currently published artifact with Ukrainian dictionary is http://mvnrepository.com/artifact/org.languagetool/language-uk but it requires languagetool-core and dragging it into lucene probably does not make sense. If the PoC is good enough I can take a shot at creating separate artifact with just a dictionary (this may take some time) or we can just live with the blob in lucene.

          I would appreciate if you can take a look and let me know how it looks. If it's acceptable I would need to work on including some of the goodies from Dmytro's project: handling different apostrophes and ignoring accent character.

          Show
          arysin Andriy Rysin added a comment - Ok, guys, I've created little project with Ukrainian analyzer for lucene using MorfologikAnalyzer: https://github.com/arysin/lucene_uk The test ( https://github.com/arysin/lucene_uk/blob/master/src/test/java/org/apache/lucene/analysis/uk/TestUkrainianAnalyzer.java ) runs successfully inside lucene but I can't run it in my project (getting NPE at RunListenerPrintReproduceInfo.java:131). I can run simple standalone test app though with no problem: https://github.com/arysin/lucene_uk/blob/master/src/test/java/org/lucene_uk/test/LuceneTest.java For simplicity for now I just included Ukrainian binary morfologik dictionary in the project itself. The only currently published artifact with Ukrainian dictionary is http://mvnrepository.com/artifact/org.languagetool/language-uk but it requires languagetool-core and dragging it into lucene probably does not make sense. If the PoC is good enough I can take a shot at creating separate artifact with just a dictionary (this may take some time) or we can just live with the blob in lucene. I would appreciate if you can take a look and let me know how it looks. If it's acceptable I would need to work on including some of the goodies from Dmytro's project: handling different apostrophes and ignoring accent character.
          Hide
          mikemccand Michael McCandless added a comment -

          Thanks for the detailed analysis Andriy Rysin! On where the dictionary lives, I think option 2 is good?

          On #4, whichever is best for you!

          Show
          mikemccand Michael McCandless added a comment - Thanks for the detailed analysis Andriy Rysin ! On where the dictionary lives, I think option 2 is good? On #4, whichever is best for you!
          Hide
          arysin Andriy Rysin added a comment -

          Ok, I've imported lucene-sorl and the Ukrainian analyzer project from Dmytro Hambal into Eclipse and looked through the code.
          Unfortunately we can't use the whole morfologik package as is - it's very specific for Polish. We could still probably use part of morfologik for compact dictionary representation. The whole Ukrainian dictionary in this format with POS tags is ~1.6MB compared to 98M in csv and we could probably make it smaller if we strip the tags.
          There are several things I'd like to note:
          1) this dictionary is for inflections (not related words) so this stemming will be producing lemmas not quite root words (this is probably ok and in some cases even better?)
          2) as this is dictionary-based stemming it won't stem unknown words (but dictionary contains ~200K lemmas so it should give good output)
          3) as Ukrainian has high level of inflection (nouns produce up to 7 forms, direct verbs up to 20, reverse verbs up to 30 forms) with many rules and exceptions developing quality rule-base stemming will not be trivial
          4) I was planning to work on Ukrainian analyzer in a separate project but if it's better for the review process I can fork lucene-solr and work inside the fork
          5) I am thinking to create org.apache.lucene.analysis.uk classes based on Dmytro Hambal's work and the csv file we have and once it's working try more compact representation

          The question: once we have it working shall we include the dictionary in the lucene project or make it an external dependency (like with morfologik-polish.jar)? First is simpler but second will allow easy updates for the dictionary (which I can see being actively developed for another year or two) and also will keep the binary blob out of the project. I am leaning towards second but open for discussion.

          Show
          arysin Andriy Rysin added a comment - Ok, I've imported lucene-sorl and the Ukrainian analyzer project from Dmytro Hambal into Eclipse and looked through the code. Unfortunately we can't use the whole morfologik package as is - it's very specific for Polish. We could still probably use part of morfologik for compact dictionary representation. The whole Ukrainian dictionary in this format with POS tags is ~1.6MB compared to 98M in csv and we could probably make it smaller if we strip the tags. There are several things I'd like to note: 1) this dictionary is for inflections (not related words) so this stemming will be producing lemmas not quite root words (this is probably ok and in some cases even better?) 2) as this is dictionary-based stemming it won't stem unknown words (but dictionary contains ~200K lemmas so it should give good output) 3) as Ukrainian has high level of inflection (nouns produce up to 7 forms, direct verbs up to 20, reverse verbs up to 30 forms) with many rules and exceptions developing quality rule-base stemming will not be trivial 4) I was planning to work on Ukrainian analyzer in a separate project but if it's better for the review process I can fork lucene-solr and work inside the fork 5) I am thinking to create org.apache.lucene.analysis.uk classes based on Dmytro Hambal 's work and the csv file we have and once it's working try more compact representation The question: once we have it working shall we include the dictionary in the lucene project or make it an external dependency (like with morfologik-polish.jar)? First is simpler but second will allow easy updates for the dictionary (which I can see being actively developed for another year or two) and also will keep the binary blob out of the project. I am leaning towards second but open for discussion.
          Hide
          mikemccand Michael McCandless added a comment -

          That sounds like a great solution Andriy Rysin! Would that give the same functionality as your original plugin sources? Users can do this today, just by using Morfologik with a custom (your) dictionary?

          Show
          mikemccand Michael McCandless added a comment - That sounds like a great solution Andriy Rysin ! Would that give the same functionality as your original plugin sources? Users can do this today, just by using Morfologik with a custom (your) dictionary?
          Hide
          arysin Andriy Rysin added a comment -

          I just realized that Lucene includes morfologik analyzer (https://github.com/apache/lucene-solr/blob/master/lucene/analysis/morfologik/src/java/org/apache/lucene/analysis/morfologik/MorfologikAnalyzer.java). We already use the Ukrainian dictionary in morfologik format for LanguageTool (https://github.com/languagetool-org/languagetool/blob/master/languagetool-language-modules/uk/src/main/resources/org/languagetool/resource/uk/ukrainian.dict).
          It's about 1.6MB in file and should be quite fast and memory efficient.

          Show
          arysin Andriy Rysin added a comment - I just realized that Lucene includes morfologik analyzer ( https://github.com/apache/lucene-solr/blob/master/lucene/analysis/morfologik/src/java/org/apache/lucene/analysis/morfologik/MorfologikAnalyzer.java ). We already use the Ukrainian dictionary in morfologik format for LanguageTool ( https://github.com/languagetool-org/languagetool/blob/master/languagetool-language-modules/uk/src/main/resources/org/languagetool/resource/uk/ukrainian.dict ). It's about 1.6MB in file and should be quite fast and memory efficient.
          Hide
          arysin Andriy Rysin added a comment -

          From my point of view we can use dict_uk as a source for lucene (and we can provide acceptable license). The question is whether we need hunspell data with affixes that are based on lemmas (a bit more work) or we can get away with flat file as suggested by Ahmet Arslan (this we can do pretty quickly).

          Show
          arysin Andriy Rysin added a comment - From my point of view we can use dict_uk as a source for lucene (and we can provide acceptable license). The question is whether we need hunspell data with affixes that are based on lemmas (a bit more work) or we can get away with flat file as suggested by Ahmet Arslan (this we can do pretty quickly).
          Hide
          dchaplinsky Dmitry Chaplinsky added a comment -

          I really want this project to happen.

          Ahmet Arslan, Michael McCandless, is there anything I can do to help?

          Show
          dchaplinsky Dmitry Chaplinsky added a comment - I really want this project to happen. Ahmet Arslan , Michael McCandless , is there anything I can do to help?
          Hide
          arysin Andriy Rysin added a comment -

          BTW how does hunspell stemming works for "exceptions"? There are bunch of words in Ukrainian whose inflections is hard to put in hunspell affix rules.

          Show
          arysin Andriy Rysin added a comment - BTW how does hunspell stemming works for "exceptions"? There are bunch of words in Ukrainian whose inflections is hard to put in hunspell affix rules.
          Hide
          arysin Andriy Rysin added a comment -

          So do we need to build hunspell dictionary (this may take me some time, probably a week or two) or using StemmerOverrideFilter with existing dictionary as suggested by Ahmet is good enough?
          BTW older Ukrainian hunspell used in http://github.com/elastic/hunspell is not very suitable as it's "too compact" - it often combines multiple lemmas together (most frequently direct and reverse verbs, adjectives and adverbs etc).

          Show
          arysin Andriy Rysin added a comment - So do we need to build hunspell dictionary (this may take me some time, probably a week or two) or using StemmerOverrideFilter with existing dictionary as suggested by Ahmet is good enough? BTW older Ukrainian hunspell used in http://github.com/elastic/hunspell is not very suitable as it's "too compact" - it often combines multiple lemmas together (most frequently direct and reverse verbs, adjectives and adverbs etc).
          Hide
          iorixxx Ahmet Arslan added a comment -

          This looks like a wrapper for string to string mapping. No need to roll a custom Lucene code for this: Just replace comma with tab in the mapping_sorted.csv file and use good old StemmerOverrideFilter, which has the fast lookup that does not require termAtt.toString() conversion.

          Show
          iorixxx Ahmet Arslan added a comment - This looks like a wrapper for string to string mapping. No need to roll a custom Lucene code for this: Just replace comma with tab in the mapping_sorted.csv file and use good old StemmerOverrideFilter , which has the fast lookup that does not require termAtt.toString() conversion.
          Hide
          mikemccand Michael McCandless added a comment -

          There's no alternative open dictionary for Ukrainian with acceptable quality (I know since I've been working on it for last 10 years ).

          OK thanks Andriy Rysin ... it looks like all Ukrainian dictionaries I can find, lead back to you!

          Relicensing your data files (and maybe also the hunspell dictionaries) to ASL2 or MIT or BSD would be wonderful, if you are able/allowed to!

          I think we need to understand how your approach differs from the Hunspell tokenizer Lucene already provides.

          See https://lucene.apache.org/core/6_0_0/analyzers-common/org/apache/lucene/analysis/hunspell/HunspellStemFilter.html for some details, and e.g. http://github.com/elastic/hunspell for how it's integrated into ES. This is all quite new to me and I don't have an appreciation for what the differences are, in tokenization accuracy, heap used, tokens per second processing, etc. I know Robert Muir spent quite a bit of time trying to keep heap usage low and tokenization performance high on the original hunspell issues.

          Show
          mikemccand Michael McCandless added a comment - There's no alternative open dictionary for Ukrainian with acceptable quality (I know since I've been working on it for last 10 years ). OK thanks Andriy Rysin ... it looks like all Ukrainian dictionaries I can find, lead back to you! Relicensing your data files (and maybe also the hunspell dictionaries) to ASL2 or MIT or BSD would be wonderful, if you are able/allowed to! I think we need to understand how your approach differs from the Hunspell tokenizer Lucene already provides. See https://lucene.apache.org/core/6_0_0/analyzers-common/org/apache/lucene/analysis/hunspell/HunspellStemFilter.html for some details, and e.g. http://github.com/elastic/hunspell for how it's integrated into ES. This is all quite new to me and I don't have an appreciation for what the differences are, in tokenization accuracy, heap used, tokens per second processing, etc. I know Robert Muir spent quite a bit of time trying to keep heap usage low and tokenization performance high on the original hunspell issues.
          Hide
          arysin Andriy Rysin added a comment -

          There's no alternative open dictionary for Ukrainian with acceptable quality (I know since I've been working on it for last 10 years ).
          But I can relicense the https://github.com/arysin/dict_uk or the derivatives under MIT if it helps.

          Show
          arysin Andriy Rysin added a comment - There's no alternative open dictionary for Ukrainian with acceptable quality (I know since I've been working on it for last 10 years ). But I can relicense the https://github.com/arysin/dict_uk or the derivatives under MIT if it helps.
          Hide
          mikemccand Michael McCandless added a comment -

          The dictionary originally is coming from https://github.com/arysin/dict_uk

          Alas that project is distributed under the GPL license, which we cannot use here. Do you have an alternative dictionary source that has a more reasonable license?

          Show
          mikemccand Michael McCandless added a comment - The dictionary originally is coming from https://github.com/arysin/dict_uk Alas that project is distributed under the GPL license, which we cannot use here. Do you have an alternative dictionary source that has a more reasonable license?
          Hide
          mr_gambal Dmytro Hambal added a comment -

          Michael McCandless speaking of this data file, we had an idea to keep it in DAWG format which should have taken like ~8 MB.
          But at the moment we didn't find correct realisation which could handle both map-like interface and an ability to load the data from files. More details here: https://github.com/mrgambal/elasticsearch-ukrainian-lemmatizer/issues/1

          Show
          mr_gambal Dmytro Hambal added a comment - Michael McCandless speaking of this data file, we had an idea to keep it in DAWG format which should have taken like ~8 MB. But at the moment we didn't find correct realisation which could handle both map-like interface and an ability to load the data from files. More details here: https://github.com/mrgambal/elasticsearch-ukrainian-lemmatizer/issues/1
          Hide
          arysin Andriy Rysin added a comment -

          Quick check via jvisualvm shows ~400MB used by the dictionary map. The dictionary originally is coming from https://github.com/arysin/dict_uk, that project developed from Ukrainian hunspell dictionary (which was very compact: on average each hunspell flag was producing 12 words) but diverged a bit and now the system of affixes in dict_uk is not compatible with that in hunspell.
          I have on my TODO to add a convertor to produce hunspell dictionary from dict_uk sources. If that helps here (I'm not familiar with hunspell token filter in lucene) I could put it a bit higher in my priority.

          Show
          arysin Andriy Rysin added a comment - Quick check via jvisualvm shows ~400MB used by the dictionary map. The dictionary originally is coming from https://github.com/arysin/dict_uk , that project developed from Ukrainian hunspell dictionary (which was very compact: on average each hunspell flag was producing 12 words) but diverged a bit and now the system of affixes in dict_uk is not compatible with that in hunspell. I have on my TODO to add a convertor to produce hunspell dictionary from dict_uk sources. If that helps here (I'm not familiar with hunspell token filter in lucene) I could put it a bit higher in my priority.
          Hide
          mikemccand Michael McCandless added a comment -

          Thanks Dmytro Hambal, this sounds nice! The license is MIT license which is compatible with ASL (good!).

          I looked very briefly and it looks like there is a large (~94 MB) data file that is loaded into heap ... where did this data/dictionary come from? And, once loaded into heap, how much heap does it consume? Seems like it could be very high (it's loaded as a HashMap<String,String> I think?). E.g. our hunspell token filter works hard to use a compact in-heap representation, and does also support Ukrainian I believe.

          Show
          mikemccand Michael McCandless added a comment - Thanks Dmytro Hambal , this sounds nice! The license is MIT license which is compatible with ASL (good!). I looked very briefly and it looks like there is a large (~94 MB) data file that is loaded into heap ... where did this data/dictionary come from? And, once loaded into heap, how much heap does it consume? Seems like it could be very high (it's loaded as a HashMap<String,String> I think?). E.g. our hunspell token filter works hard to use a compact in-heap representation, and does also support Ukrainian I believe.

            People

            • Assignee:
              Unassigned
              Reporter:
              mr_gambal Dmytro Hambal
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development